Files
gogo2/MODEL_RUNNER_README.md
2025-09-25 00:52:01 +03:00

8.7 KiB

Docker Model Runner Integration

This guide shows how to integrate Docker Model Runner with your existing Docker stack for AI-powered trading applications.

📁 Files Overview

File Purpose
docker-compose.yml Main compose file with model runner services
docker-compose.model-runner.yml Standalone model runner configuration
model-runner.env Environment variables for configuration
integrate_model_runner.sh Integration script for existing stacks
docker-compose.integration-example.yml Example integration with trading services

🚀 Quick Start

Option 1: Use with Existing Stack

# Run integration script
./integrate_model_runner.sh

# Start services
docker-compose up -d

# Test API
curl http://localhost:11434/api/tags

Option 2: Standalone Model Runner

# Use dedicated compose file
docker-compose -f docker-compose.model-runner.yml up -d

# Test with specific profile
docker-compose -f docker-compose.model-runner.yml --profile llama-cpp up -d

🔧 Configuration

Environment Variables (model-runner.env)

# AMD GPU Configuration
HSA_OVERRIDE_GFX_VERSION=11.0.0  # AMD GPU version override
GPU_LAYERS=35              # Layers to offload to GPU
THREADS=8                  # CPU threads
BATCH_SIZE=512             # Batch processing size
CONTEXT_SIZE=4096          # Context window size

# API Configuration
MODEL_RUNNER_PORT=11434    # Main API port
LLAMA_CPP_PORT=8000        # Llama.cpp server port
METRICS_PORT=9090          # Metrics endpoint

Ports Exposed

Port Service Purpose
11434 Docker Model Runner Ollama-compatible API
8083 Docker Model Runner Alternative API port
8000 Llama.cpp Server Advanced llama.cpp features
9090 Metrics Prometheus metrics
8050 Trading Dashboard Example dashboard
9091 Model Monitor Performance monitoring

🛠️ Usage Examples

Basic Model Operations

# List available models
curl http://localhost:11434/api/tags

# Pull a model
docker-compose exec docker-model-runner /app/model-runner pull ai/smollm2:135M-Q4_K_M

# Run a model
docker-compose exec docker-model-runner /app/model-runner run ai/smollm2:135M-Q4_K_M "Hello!"

# Pull Hugging Face model
docker-compose exec docker-model-runner /app/model-runner pull hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF

API Usage

# Generate text (OpenAI-compatible)
curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2:135M-Q4_K_M",
    "prompt": "Analyze market trends",
    "temperature": 0.7,
    "max_tokens": 100
  }'

# Chat completion
curl -X POST http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2:135M-Q4_K_M",
    "messages": [{"role": "user", "content": "What is your analysis?"}]
  }'

Integration with Your Services

# Example: Python integration
import requests

class AIModelClient:
    def __init__(self, base_url="http://localhost:11434"):
        self.base_url = base_url

    def generate(self, prompt, model="ai/smollm2:135M-Q4_K_M"):
        response = requests.post(
            f"{self.base_url}/api/generate",
            json={"model": model, "prompt": prompt}
        )
        return response.json()

    def chat(self, messages, model="ai/smollm2:135M-Q4_K_M"):
        response = requests.post(
            f"{self.base_url}/api/chat",
            json={"model": model, "messages": messages}
        )
        return response.json()

# Usage
client = AIModelClient()
analysis = client.generate("Analyze BTC/USDT market")

🔗 Service Integration

With Existing Trading Dashboard

# Add to your existing docker-compose.yml
services:
  your-trading-service:
    # ... your existing config
    environment:
      - MODEL_RUNNER_URL=http://docker-model-runner:11434
    depends_on:
      - docker-model-runner
    networks:
      - model-runner-network

Internal Networking

Services communicate using Docker networks:

  • http://docker-model-runner:11434 - Internal API calls
  • http://llama-cpp-server:8000 - Advanced features
  • http://model-manager:8001 - Management API

📊 Monitoring and Health Checks

Health Endpoints

# Main service health
curl http://localhost:11434/api/tags

# Metrics endpoint
curl http://localhost:9090/metrics

# Model monitor (if enabled)
curl http://localhost:9091/health
curl http://localhost:9091/models
curl http://localhost:9091/performance

Logs

# View all logs
docker-compose logs -f

# Specific service logs
docker-compose logs -f docker-model-runner
docker-compose logs -f llama-cpp-server

Performance Tuning

GPU Optimization

# Adjust GPU layers based on VRAM
GPU_LAYERS=35              # For 8GB VRAM
GPU_LAYERS=50              # For 12GB VRAM
GPU_LAYERS=65              # For 16GB+ VRAM

# CPU threading
THREADS=8                  # Match CPU cores
BATCH_SIZE=512            # Increase for better throughput

Memory Management

# Context size affects memory usage
CONTEXT_SIZE=4096         # Standard context
CONTEXT_SIZE=8192         # Larger context (more memory)
CONTEXT_SIZE=2048         # Smaller context (less memory)

🧪 Testing and Validation

Run Integration Tests

# Test basic connectivity
docker-compose exec docker-model-runner curl -f http://localhost:11434/api/tags

# Test model loading
docker-compose exec docker-model-runner /app/model-runner run ai/smollm2:135M-Q4_K_M "test"

# Test parallel requests
for i in {1..5}; do
  curl -X POST http://localhost:11434/api/generate \
    -H "Content-Type: application/json" \
    -d '{"model": "ai/smollm2:135M-Q4_K_M", "prompt": "test '$i'"}' &
done

Benchmarking

# Simple benchmark
time curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model": "ai/smollm2:135M-Q4_K_M", "prompt": "Write a detailed analysis of market trends"}'

🛡️ Security Considerations

Network Security

# Restrict network access
services:
  docker-model-runner:
    networks:
      - internal-network
    # No external ports for internal-only services

networks:
  internal-network:
    internal: true

API Security

# Use API keys (if supported)
MODEL_RUNNER_API_KEY=your-secret-key

# Enable authentication
MODEL_RUNNER_AUTH_ENABLED=true

📈 Scaling and Production

Multiple GPU Support

# Use multiple GPUs
environment:
  - CUDA_VISIBLE_DEVICES=0,1  # Use GPU 0 and 1
  - GPU_LAYERS=35             # Layers per GPU

Load Balancing

# Multiple model runner instances
services:
  model-runner-1:
    # ... config
    deploy:
      placement:
        constraints:
          - node.labels.gpu==true

  model-runner-2:
    # ... config
    deploy:
      placement:
        constraints:
          - node.labels.gpu==true

🔧 Troubleshooting

Common Issues

  1. GPU not detected

    # Check NVIDIA drivers
    nvidia-smi
    
    # Check Docker GPU support
    docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
    
  2. Port conflicts

    # Check port usage
    netstat -tulpn | grep :11434
    
    # Change ports in model-runner.env
    MODEL_RUNNER_PORT=11435
    
  3. Model loading failures

    # Check available disk space
    df -h
    
    # Check model file permissions
    ls -la models/
    

Debug Commands

# Full service logs
docker-compose logs

# Container resource usage
docker stats

# Model runner debug info
docker-compose exec docker-model-runner /app/model-runner --help

# Test internal connectivity
docker-compose exec trading-dashboard curl http://docker-model-runner:11434/api/tags

📚 Advanced Features

Custom Model Loading

# Load custom GGUF model
docker-compose exec docker-model-runner /app/model-runner pull /models/custom-model.gguf

# Use specific model file
docker-compose exec docker-model-runner /app/model-runner run /models/my-model.gguf "prompt"

Batch Processing

# Process multiple prompts
curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2:135M-Q4_K_M",
    "prompt": ["prompt1", "prompt2", "prompt3"],
    "batch_size": 3
  }'

Streaming Responses

# Enable streaming
curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2:135M-Q4_K_M",
    "prompt": "long analysis request",
    "stream": true
  }'

This integration provides a complete AI model running environment that seamlessly integrates with your existing trading infrastructure while providing advanced parallelism and GPU acceleration capabilities.