Files

Dobromir Popov d68c915fd5 using LLM for sentiment analysis

2025-09-25 00:52:01 +03:00

8.7 KiB

Raw Blame History

Docker Model Runner Integration

This guide shows how to integrate Docker Model Runner with your existing Docker stack for AI-powered trading applications.

📁 Files Overview

File	Purpose
`docker-compose.yml`	Main compose file with model runner services
`docker-compose.model-runner.yml`	Standalone model runner configuration
`model-runner.env`	Environment variables for configuration
`integrate_model_runner.sh`	Integration script for existing stacks
`docker-compose.integration-example.yml`	Example integration with trading services

🚀 Quick Start

Option 1: Use with Existing Stack

# Run integration script
./integrate_model_runner.sh

# Start services
docker-compose up -d

# Test API
curl http://localhost:11434/api/tags

Option 2: Standalone Model Runner

# Use dedicated compose file
docker-compose -f docker-compose.model-runner.yml up -d

# Test with specific profile
docker-compose -f docker-compose.model-runner.yml --profile llama-cpp up -d

🔧 Configuration

Environment Variables (`model-runner.env`)

# AMD GPU Configuration
HSA_OVERRIDE_GFX_VERSION=11.0.0  # AMD GPU version override
GPU_LAYERS=35              # Layers to offload to GPU
THREADS=8                  # CPU threads
BATCH_SIZE=512             # Batch processing size
CONTEXT_SIZE=4096          # Context window size

# API Configuration
MODEL_RUNNER_PORT=11434    # Main API port
LLAMA_CPP_PORT=8000        # Llama.cpp server port
METRICS_PORT=9090          # Metrics endpoint

Ports Exposed

Port	Service	Purpose
11434	Docker Model Runner	Ollama-compatible API
8083	Docker Model Runner	Alternative API port
8000	Llama.cpp Server	Advanced llama.cpp features
9090	Metrics	Prometheus metrics
8050	Trading Dashboard	Example dashboard
9091	Model Monitor	Performance monitoring

🛠️ Usage Examples

Basic Model Operations

# List available models
curl http://localhost:11434/api/tags

# Pull a model
docker-compose exec docker-model-runner /app/model-runner pull ai/smollm2:135M-Q4_K_M

# Run a model
docker-compose exec docker-model-runner /app/model-runner run ai/smollm2:135M-Q4_K_M "Hello!"

# Pull Hugging Face model
docker-compose exec docker-model-runner /app/model-runner pull hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF

API Usage

# Generate text (OpenAI-compatible)
curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2:135M-Q4_K_M",
    "prompt": "Analyze market trends",
    "temperature": 0.7,
    "max_tokens": 100
  }'

# Chat completion
curl -X POST http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2:135M-Q4_K_M",
    "messages": [{"role": "user", "content": "What is your analysis?"}]
  }'

Integration with Your Services

# Example: Python integration
import requests

class AIModelClient:
    def __init__(self, base_url="http://localhost:11434"):
        self.base_url = base_url

    def generate(self, prompt, model="ai/smollm2:135M-Q4_K_M"):
        response = requests.post(
            f"{self.base_url}/api/generate",
            json={"model": model, "prompt": prompt}
        )
        return response.json()

    def chat(self, messages, model="ai/smollm2:135M-Q4_K_M"):
        response = requests.post(
            f"{self.base_url}/api/chat",
            json={"model": model, "messages": messages}
        )
        return response.json()

# Usage
client = AIModelClient()
analysis = client.generate("Analyze BTC/USDT market")

🔗 Service Integration

With Existing Trading Dashboard

# Add to your existing docker-compose.yml
services:
  your-trading-service:
    # ... your existing config
    environment:
      - MODEL_RUNNER_URL=http://docker-model-runner:11434
    depends_on:
      - docker-model-runner
    networks:
      - model-runner-network

Internal Networking

Services communicate using Docker networks:

http://docker-model-runner:11434 - Internal API calls
http://llama-cpp-server:8000 - Advanced features
http://model-manager:8001 - Management API

📊 Monitoring and Health Checks

Health Endpoints

# Main service health
curl http://localhost:11434/api/tags

# Metrics endpoint
curl http://localhost:9090/metrics

# Model monitor (if enabled)
curl http://localhost:9091/health
curl http://localhost:9091/models
curl http://localhost:9091/performance

Logs

# View all logs
docker-compose logs -f

# Specific service logs
docker-compose logs -f docker-model-runner
docker-compose logs -f llama-cpp-server

⚡ Performance Tuning

GPU Optimization

# Adjust GPU layers based on VRAM
GPU_LAYERS=35              # For 8GB VRAM
GPU_LAYERS=50              # For 12GB VRAM
GPU_LAYERS=65              # For 16GB+ VRAM

# CPU threading
THREADS=8                  # Match CPU cores
BATCH_SIZE=512            # Increase for better throughput

Memory Management

# Context size affects memory usage
CONTEXT_SIZE=4096         # Standard context
CONTEXT_SIZE=8192         # Larger context (more memory)
CONTEXT_SIZE=2048         # Smaller context (less memory)

🧪 Testing and Validation

Run Integration Tests

# Test basic connectivity
docker-compose exec docker-model-runner curl -f http://localhost:11434/api/tags

# Test model loading
docker-compose exec docker-model-runner /app/model-runner run ai/smollm2:135M-Q4_K_M "test"

# Test parallel requests
for i in {1..5}; do
  curl -X POST http://localhost:11434/api/generate \
    -H "Content-Type: application/json" \
    -d '{"model": "ai/smollm2:135M-Q4_K_M", "prompt": "test '$i'"}' &
done

Benchmarking

# Simple benchmark
time curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model": "ai/smollm2:135M-Q4_K_M", "prompt": "Write a detailed analysis of market trends"}'

🛡️ Security Considerations

Network Security

# Restrict network access
services:
  docker-model-runner:
    networks:
      - internal-network
    # No external ports for internal-only services

networks:
  internal-network:
    internal: true

API Security

# Use API keys (if supported)
MODEL_RUNNER_API_KEY=your-secret-key

# Enable authentication
MODEL_RUNNER_AUTH_ENABLED=true

📈 Scaling and Production

Multiple GPU Support

# Use multiple GPUs
environment:
  - CUDA_VISIBLE_DEVICES=0,1  # Use GPU 0 and 1
  - GPU_LAYERS=35             # Layers per GPU

Load Balancing

# Multiple model runner instances
services:
  model-runner-1:
    # ... config
    deploy:
      placement:
        constraints:
          - node.labels.gpu==true

  model-runner-2:
    # ... config
    deploy:
      placement:
        constraints:
          - node.labels.gpu==true

🔧 Troubleshooting

Common Issues

GPU not detected

# Check NVIDIA drivers
nvidia-smi

# Check Docker GPU support
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

Port conflicts

# Check port usage
netstat -tulpn | grep :11434

# Change ports in model-runner.env
MODEL_RUNNER_PORT=11435

Model loading failures

# Check available disk space
df -h

# Check model file permissions
ls -la models/

Debug Commands

# Full service logs
docker-compose logs

# Container resource usage
docker stats

# Model runner debug info
docker-compose exec docker-model-runner /app/model-runner --help

# Test internal connectivity
docker-compose exec trading-dashboard curl http://docker-model-runner:11434/api/tags

📚 Advanced Features

Custom Model Loading

# Load custom GGUF model
docker-compose exec docker-model-runner /app/model-runner pull /models/custom-model.gguf

# Use specific model file
docker-compose exec docker-model-runner /app/model-runner run /models/my-model.gguf "prompt"

Batch Processing

# Process multiple prompts
curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2:135M-Q4_K_M",
    "prompt": ["prompt1", "prompt2", "prompt3"],
    "batch_size": 3
  }'

Streaming Responses

# Enable streaming
curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2:135M-Q4_K_M",
    "prompt": "long analysis request",
    "stream": true
  }'

This integration provides a complete AI model running environment that seamlessly integrates with your existing trading infrastructure while providing advanced parallelism and GPU acceleration capabilities.

8.7 KiB Raw Blame History