try fixing GPU (torch)

This commit is contained in:
Dobromir Popov
2025-11-17 13:06:37 +02:00
parent 4fcadcdbff
commit 43a7d75daf
9 changed files with 1393 additions and 11 deletions

View File

@@ -0,0 +1,186 @@
# Using Existing ROCm Container for Development
## Current Status
**You already have ROCm PyTorch working on the host!**
```bash
PyTorch: 2.5.1+rocm6.2
CUDA available: True
Device: AMD Radeon Graphics (Strix Halo)
Memory: 47.0 GB
```
## Recommendation: Use Host Environment
**Since your host venv already has ROCm support working, this is the simplest option:**
```bash
cd /mnt/shared/DEV/repos/d-popov.com/gogo2
source venv/bin/activate
python ANNOTATE/web/app.py
```
**Benefits:**
- ✅ Already configured
- ✅ No container overhead
- ✅ Direct file access
- ✅ GPU works perfectly
## Alternative: Use Existing Container
You have these containers running:
- `amd-strix-halo-llama-rocm` - ROCm 7rc (port 8080)
- `amd-strix-halo-llama-vulkan-radv` - Vulkan RADV (port 8081)
- `amd-strix-halo-llama-vulkan-amdvlk` - Vulkan AMDVLK (port 8082)
### Option 1: Quick Attach Script
```bash
./scripts/attach-to-rocm-container.sh
```
This script will:
1. Check if project is accessible in container
2. Offer to copy project if needed
3. Check/install Python if needed
4. Check/install PyTorch if needed
5. Attach you to a bash shell
### Option 2: Manual Setup
#### A. Copy Project to Container
```bash
# Create workspace in container
docker exec amd-strix-halo-llama-rocm mkdir -p /workspace
# Copy project
docker cp /mnt/shared/DEV/repos/d-popov.com/gogo2 amd-strix-halo-llama-rocm:/workspace/
# Enter container
docker exec -it amd-strix-halo-llama-rocm bash
```
#### B. Install Python (if needed)
Inside container:
```bash
# Fedora-based container
dnf install -y python3.12 python3-pip python3-devel git
# Create symlinks
ln -sf /usr/bin/python3.12 /usr/bin/python3
ln -sf /usr/bin/python3.12 /usr/bin/python
```
#### C. Install Dependencies
Inside container:
```bash
cd /workspace/gogo2
# Install PyTorch with ROCm
pip3 install torch --index-url https://download.pytorch.org/whl/rocm6.2
# Install project dependencies
pip3 install -r requirements.txt
```
#### D. Run Application
```bash
# Run ANNOTATE dashboard
python3 ANNOTATE/web/app.py
# Or run training
python3 training_runner.py --mode realtime --duration 4
```
### Option 3: Mount Project on Container Restart
Add volume mount to your docker-compose:
```yaml
services:
amd-strix-halo-llama-rocm:
volumes:
- /mnt/shared/DEV/repos/d-popov.com/gogo2:/workspace/gogo2:rw
```
Then restart:
```bash
docker-compose down
docker-compose up -d
```
## Port Conflicts
Your ROCm container uses port 8080, which conflicts with COBY API.
**Solutions:**
1. **Use host environment** (no conflict)
2. **Change ANNOTATE port** in container:
```bash
python3 ANNOTATE/web/app.py --port 8051
```
3. **Expose different port** when starting container
## Comparison
| Aspect | Host (venv) | Container |
|--------|-------------|-----------|
| Setup | ✅ Already done | ⚠️ Needs Python install |
| GPU | ✅ Working | ✅ Should work |
| Files | ✅ Direct access | ⚠️ Need to copy/mount |
| Performance | ✅ Native | ⚠️ Small overhead |
| Isolation | ⚠️ Shares host | ✅ Isolated |
| Simplicity | ✅ Just works | ⚠️ Extra steps |
## Quick Commands
### Host Development (Recommended)
```bash
cd /mnt/shared/DEV/repos/d-popov.com/gogo2
source venv/bin/activate
python ANNOTATE/web/app.py
```
### Container Development
```bash
# Method 1: Use helper script
./scripts/attach-to-rocm-container.sh
# Method 2: Manual attach
docker exec -it amd-strix-halo-llama-rocm bash
cd /workspace/gogo2
python3 ANNOTATE/web/app.py
```
### Check GPU in Container
```bash
docker exec amd-strix-halo-llama-rocm rocm-smi
docker exec amd-strix-halo-llama-rocm python3 -c "import torch; print(torch.cuda.is_available())"
```
## Summary
**For your use case (avoid heavy downloads):**
**Use the host environment** - Your venv already has everything working perfectly!
**Only use container if you need:**
- Complete isolation from host
- Specific ROCm version testing
- Multiple parallel environments
---
**Last Updated:** 2025-11-12
**Status:** Host venv with ROCm 6.2 is ready to use