gogo2/MODEL_RUNNER_README.md

# Docker Model Runner Integration

This guide shows how to integrate Docker Model Runner with your existing Docker stack for AI-powered trading applications.

## 📁 Files Overview

| File | Purpose |
|------|---------|
| `docker-compose.yml` | Main compose file with model runner services |
| `docker-compose.model-runner.yml` | Standalone model runner configuration |
| `model-runner.env` | Environment variables for configuration |
| `integrate_model_runner.sh` | Integration script for existing stacks |
| `docker-compose.integration-example.yml` | Example integration with trading services |

## 🚀 Quick Start

### Option 1: Use with Existing Stack
```bash
# Run integration script
./integrate_model_runner.sh

# Start services
docker-compose up -d

# Test API
curl http://localhost:11434/api/tags
```

### Option 2: Standalone Model Runner
```bash
# Use dedicated compose file
docker-compose -f docker-compose.model-runner.yml up -d

# Test with specific profile
docker-compose -f docker-compose.model-runner.yml --profile llama-cpp up -d
```

## 🔧 Configuration

### Environment Variables (`model-runner.env`)

```bash
# AMD GPU Configuration
HSA_OVERRIDE_GFX_VERSION=11.0.0  # AMD GPU version override
GPU_LAYERS=35              # Layers to offload to GPU
THREADS=8                  # CPU threads
BATCH_SIZE=512             # Batch processing size
CONTEXT_SIZE=4096          # Context window size

# API Configuration
MODEL_RUNNER_PORT=11434    # Main API port
LLAMA_CPP_PORT=8000        # Llama.cpp server port
METRICS_PORT=9090          # Metrics endpoint
```

### Ports Exposed

| Port | Service | Purpose |
|------|---------|---------|
| 11434 | Docker Model Runner | Ollama-compatible API |
| 8083  | Docker Model Runner | Alternative API port |
| 8000  | Llama.cpp Server | Advanced llama.cpp features |
| 9090  | Metrics | Prometheus metrics |
| 8050  | Trading Dashboard | Example dashboard |
| 9091  | Model Monitor | Performance monitoring |

## 🛠️ Usage Examples

### Basic Model Operations

```bash
# List available models
curl http://localhost:11434/api/tags

# Pull a model
docker-compose exec docker-model-runner /app/model-runner pull ai/smollm2:135M-Q4_K_M

# Run a model
docker-compose exec docker-model-runner /app/model-runner run ai/smollm2:135M-Q4_K_M "Hello!"

# Pull Hugging Face model
docker-compose exec docker-model-runner /app/model-runner pull hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF
```

### API Usage

```bash
# Generate text (OpenAI-compatible)
curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2:135M-Q4_K_M",
    "prompt": "Analyze market trends",
    "temperature": 0.7,
    "max_tokens": 100
  }'

# Chat completion
curl -X POST http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2:135M-Q4_K_M",
    "messages": [{"role": "user", "content": "What is your analysis?"}]
  }'
```

### Integration with Your Services

```python
# Example: Python integration
import requests

class AIModelClient:
    def __init__(self, base_url="http://localhost:11434"):
        self.base_url = base_url

    def generate(self, prompt, model="ai/smollm2:135M-Q4_K_M"):
        response = requests.post(
            f"{self.base_url}/api/generate",
            json={"model": model, "prompt": prompt}
        )
        return response.json()

    def chat(self, messages, model="ai/smollm2:135M-Q4_K_M"):
        response = requests.post(
            f"{self.base_url}/api/chat",
            json={"model": model, "messages": messages}
        )
        return response.json()

# Usage
client = AIModelClient()
analysis = client.generate("Analyze BTC/USDT market")
```

## 🔗 Service Integration

### With Existing Trading Dashboard

```yaml
# Add to your existing docker-compose.yml
services:
  your-trading-service:
    # ... your existing config
    environment:
      - MODEL_RUNNER_URL=http://docker-model-runner:11434
    depends_on:
      - docker-model-runner
    networks:
      - model-runner-network
```

### Internal Networking

Services communicate using Docker networks:
- `http://docker-model-runner:11434` - Internal API calls
- `http://llama-cpp-server:8000` - Advanced features
- `http://model-manager:8001` - Management API

## 📊 Monitoring and Health Checks

### Health Endpoints

```bash
# Main service health
curl http://localhost:11434/api/tags

# Metrics endpoint
curl http://localhost:9090/metrics

# Model monitor (if enabled)
curl http://localhost:9091/health
curl http://localhost:9091/models
curl http://localhost:9091/performance
```

### Logs

```bash
# View all logs
docker-compose logs -f

# Specific service logs
docker-compose logs -f docker-model-runner
docker-compose logs -f llama-cpp-server
```

## ⚡ Performance Tuning

### GPU Optimization

```bash
# Adjust GPU layers based on VRAM
GPU_LAYERS=35              # For 8GB VRAM
GPU_LAYERS=50              # For 12GB VRAM
GPU_LAYERS=65              # For 16GB+ VRAM

# CPU threading
THREADS=8                  # Match CPU cores
BATCH_SIZE=512            # Increase for better throughput
```

### Memory Management

```bash
# Context size affects memory usage
CONTEXT_SIZE=4096         # Standard context
CONTEXT_SIZE=8192         # Larger context (more memory)
CONTEXT_SIZE=2048         # Smaller context (less memory)
```

## 🧪 Testing and Validation

### Run Integration Tests

```bash
# Test basic connectivity
docker-compose exec docker-model-runner curl -f http://localhost:11434/api/tags

# Test model loading
docker-compose exec docker-model-runner /app/model-runner run ai/smollm2:135M-Q4_K_M "test"

# Test parallel requests
for i in {1..5}; do
  curl -X POST http://localhost:11434/api/generate \
    -H "Content-Type: application/json" \
    -d '{"model": "ai/smollm2:135M-Q4_K_M", "prompt": "test '$i'"}' &
done
```

### Benchmarking

```bash
# Simple benchmark
time curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model": "ai/smollm2:135M-Q4_K_M", "prompt": "Write a detailed analysis of market trends"}'
```

## 🛡️ Security Considerations

### Network Security

```yaml
# Restrict network access
services:
  docker-model-runner:
    networks:
      - internal-network
    # No external ports for internal-only services

networks:
  internal-network:
    internal: true
```

### API Security

```bash
# Use API keys (if supported)
MODEL_RUNNER_API_KEY=your-secret-key

# Enable authentication
MODEL_RUNNER_AUTH_ENABLED=true
```

## 📈 Scaling and Production

### Multiple GPU Support

```yaml
# Use multiple GPUs
environment:
  - CUDA_VISIBLE_DEVICES=0,1  # Use GPU 0 and 1
  - GPU_LAYERS=35             # Layers per GPU
```

### Load Balancing

```yaml
# Multiple model runner instances
services:
  model-runner-1:
    # ... config
    deploy:
      placement:
        constraints:
          - node.labels.gpu==true

  model-runner-2:
    # ... config
    deploy:
      placement:
        constraints:
          - node.labels.gpu==true
```

## 🔧 Troubleshooting

### Common Issues

1. **GPU not detected**
   ```bash
   # Check NVIDIA drivers
   nvidia-smi

   # Check Docker GPU support
   docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
   ```

2. **Port conflicts**
   ```bash
   # Check port usage
   netstat -tulpn | grep :11434

   # Change ports in model-runner.env
   MODEL_RUNNER_PORT=11435
   ```

3. **Model loading failures**
   ```bash
   # Check available disk space
   df -h

   # Check model file permissions
   ls -la models/
   ```

### Debug Commands

```bash
# Full service logs
docker-compose logs

# Container resource usage
docker stats

# Model runner debug info
docker-compose exec docker-model-runner /app/model-runner --help

# Test internal connectivity
docker-compose exec trading-dashboard curl http://docker-model-runner:11434/api/tags
```

## 📚 Advanced Features

### Custom Model Loading

```bash
# Load custom GGUF model
docker-compose exec docker-model-runner /app/model-runner pull /models/custom-model.gguf

# Use specific model file
docker-compose exec docker-model-runner /app/model-runner run /models/my-model.gguf "prompt"
```

### Batch Processing

```bash
# Process multiple prompts
curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2:135M-Q4_K_M",
    "prompt": ["prompt1", "prompt2", "prompt3"],
    "batch_size": 3
  }'
```

### Streaming Responses

```bash
# Enable streaming
curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2:135M-Q4_K_M",
    "prompt": "long analysis request",
    "stream": true
  }'
```

This integration provides a complete AI model running environment that seamlessly integrates with your existing trading infrastructure while providing advanced parallelism and GPU acceleration capabilities.