gogo2/GPU_SETUP_SUMMARY.md

# GPU Setup Summary - 2025-11-12

## Problem

Training was using CPU instead of GPU on AMD Strix Halo system (Radeon 8050S/8060S Graphics).

**Root Cause:** PyTorch was installed with CPU-only version (`2.8.0+cpu`), not GPU support.

## Solution

**Use Docker with pre-configured ROCm** instead of installing ROCm directly on the host system.

### Why Docker?

1. ✅ Pre-configured ROCm environment
2. ✅ No package conflicts with host system
3. ✅ Easier to update and maintain
4. ✅ Consistent environment across machines
5. ✅ Better isolation

## What Was Created

### 1. Documentation

📄 **`docs/AMD_STRIX_HALO_DOCKER.md`**
- Complete Docker setup guide
- ROCm driver installation
- Performance tuning
- Troubleshooting
- Strix Halo-specific optimizations

### 2. Docker Files

📄 **`Dockerfile.rocm`**
- Based on `rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0`
- Pre-configured with all project dependencies
- Optimized for AMD RDNA 3.5 (Strix Halo)
- Health checks for GPU availability

📄 **`docker-compose.rocm.yml`**
- GPU device mapping (`/dev/kfd`, `/dev/dri`)
- Memory limits and shared memory (8GB)
- Port mappings for all dashboards
- Environment variables for ROCm optimization
- Includes TensorBoard and Redis services

### 3. Helper Scripts

📄 **`scripts/start-docker-rocm.sh`**
- One-command Docker setup
- Checks Docker installation
- Verifies GPU devices
- Builds and starts containers
- Shows access URLs

### 4. Requirements Update

📄 **`requirements.txt`**
- Removed `torchvision` and `torchaudio` (not needed for trading)
- Added note about Docker for AMD GPUs
- CPU PyTorch as default for development

### 5. README Updates

📄 **`readme.md`**
- Added "AMD GPU Docker Setup" section
- Quick start commands
- Performance metrics
- Link to full documentation

## Quick Start

### For CPU Development (Current Setup)

```bash
# Already installed
python ANNOTATE/web/app.py
```

Training will use CPU (slower but works).

### For GPU Training (Docker)

```bash
# One-command setup
./scripts/start-docker-rocm.sh

# Enter container
docker exec -it gogo2-rocm-training bash

# Inside container
python ANNOTATE/web/app.py
```

Access at: `http://localhost:8051`

## Performance Expected

On AMD Strix Halo (Radeon 8050S/8060S):

| Task | CPU | GPU (Docker+ROCm) | Speedup |
|------|-----|-------------------|---------|
| Training | Baseline | 2-3x faster | 2-3x |
| Inference | Baseline | 5-10x faster | 5-10x |

## Files Modified

```
Modified:
  - requirements.txt
  - readme.md

Created:
  - docs/AMD_STRIX_HALO_DOCKER.md
  - Dockerfile.rocm
  - docker-compose.rocm.yml
  - scripts/start-docker-rocm.sh
  - GPU_SETUP_SUMMARY.md (this file)
```

## Next Steps

### To Use GPU Training:

1. **Install Docker** (if not already):
   ```bash
   sudo apt install docker.io docker-compose
   sudo usermod -aG docker $USER
   newgrp docker
   ```

2. **Install ROCm Drivers** (host system only):
   ```bash
   wget https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/jammy/amdgpu-install_6.2.60204-1_all.deb
   sudo dpkg -i amdgpu-install_*.deb
   sudo amdgpu-install --usecase=graphics,rocm --no-dkms -y
   sudo reboot
   ```

3. **Build and Run**:
   ```bash
   ./scripts/start-docker-rocm.sh
   ```

4. **Verify GPU Works**:
   ```bash
   docker exec -it gogo2-rocm-training bash
   rocm-smi
   python3 -c "import torch; print(torch.cuda.is_available())"
   ```

### To Continue with CPU:

No changes needed! Current setup works on CPU.

## Important Notes

1. **Don't install ROCm PyTorch in venv** - Use Docker instead
2. **torchvision/torchaudio not needed** - Only `torch` for trading
3. **Strix Halo is VERY NEW** - ROCm support is experimental but works
4. **iGPU shares memory with CPU** - Adjust batch sizes accordingly
5. **Docker is recommended** - Cleaner than host installation

## Documentation

- Full guide: `docs/AMD_STRIX_HALO_DOCKER.md`
- Quick start: `readme.md` → "AMD GPU Docker Setup"
- Docker compose: `docker-compose.rocm.yml`
- Start script: `scripts/start-docker-rocm.sh`

---

**Status:** ✅ Documented and ready to use
**Date:** 2025-11-12
**System:** AMD Strix Halo (Radeon 8050S/8060S Graphics, RDNA 3.5)