Files
mines/rin/miner/GPU_PERFORMANCE_ANALYSIS.md
Dobromir Popov b475590b61 gpu optimizations
2025-09-06 14:20:19 +03:00

117 lines
4.2 KiB
Markdown

# GPU Performance Analysis & Optimization
## 🔍 **Performance Bottleneck Discovery**
### Initial Problem:
- **CPU Mining**: 294 kH/s (4 threads)
- **GPU Mining**: 132 H/s (1,024 threads)
- **Performance Gap**: GPU is **2,200x slower** per thread!
### Root Cause Analysis:
#### ❌ **GPU Implementation Issues Found:**
1. **Memory Allocation Per Hash**
- GPU was calling `hipMalloc()`/`hipFree()` for **every single hash**
- Each memory allocation = ~100μs overhead
- **Solution**: ✅ Implemented memory caching with reuse
2. **Single-Thread GPU Utilization**
- Kernel used only **1 thread out of 1,024** (`if (threadIdx.x == 0)`)
- 1,023 threads sitting completely idle
- **Solution**: ✅ Reduced to minimal 32-thread kernel for lower latency
3. **Sequential Algorithm Nature**
- RinHash: BLAKE3 → Argon2d → SHA3 (inherently sequential)
- Can't parallelize a single hash across multiple threads effectively
- **Reality**: GPU isn't optimal for this algorithm type
### Current Optimization Status:
#### ✅ **Optimizations Implemented:**
1. **Memory Caching**
```c
static uint8_t *d_input_cache = nullptr; // Reused across calls
static uint8_t *d_output_cache = nullptr; // No allocation per hash
static block *d_memory_cache = nullptr; // Persistent Argon2 memory
```
2. **Minimal Kernel Launch**
```c
dim3 blocks(1); // Single block
dim3 threads_per_block(32); // Minimal threads for low latency
```
3. **Reduced Memory Footprint**
```c
hipMalloc(&d_input_cache, 80); // Fixed 80-byte headers
hipMalloc(&d_output_cache, 32); // 32-byte outputs
hipMalloc(&d_memory_cache, 64 * sizeof(block)); // Argon2 workspace
```
## 📊 **Expected Performance After Optimization**
| Configuration | Before | After | Improvement |
|---------------|---------|-------|-------------|
| **Memory Alloc** | Per-hash | Cached | **100x faster** |
| **GPU Threads** | 1,024 (1 active) | 32 (optimized) | **32x less overhead** |
| **Kernel Launch** | High overhead | Minimal | **10x faster** |
### Realistic Performance Target:
- **Previous**: 132 H/s
- **Optimized**: ~5-15 kH/s (estimated)
- **CPU Still Faster**: Sequential algorithm favors CPU threads
## 🚀 **Build Commands for Optimized Version**
```bash
cd /mnt/shared/DEV/repos/d-popov.com/mines/rin/miner/gpu/RinHash-hip
# Compile optimized kernel
/opt/rocm-6.4.3/bin/hipcc -c -O3 -fPIC rinhash.hip.cu -o build/rinhash.o
/opt/rocm-6.4.3/bin/hipcc -c -O3 -fPIC sha3-256.hip.cu -o build/sha3-256.o
# Link optimized library
/opt/rocm-6.4.3/bin/hipcc -shared -O3 build/rinhash.o build/sha3-256.o \
-o rocm-direct-output/gpu-libs/librinhash_hip.so \
-L/opt/rocm-6.4.3/lib -lamdhip64
# Install system-wide
sudo cp rocm-direct-output/gpu-libs/librinhash_hip.so /usr/local/lib/
sudo ldconfig
```
## 🔬 **Technical Analysis**
### Why GPU Struggles with RinHash:
1. **Algorithm Characteristics**:
- **Sequential dependency chain**: Each step needs previous result
- **Memory-bound operations**: Argon2d requires significant memory bandwidth
- **Small data sizes**: 80-byte headers don't saturate GPU throughput
2. **GPU Architecture Mismatch**:
- **GPU Optimal**: Parallel, compute-intensive, large datasets
- **RinHash Reality**: Sequential, memory-bound, small datasets
- **CPU Advantage**: Better single-thread performance, lower latency
3. **Overhead vs. Compute Ratio**:
- **GPU Overhead**: Kernel launch + memory transfers + sync
- **Actual Compute**: ~100μs of hash operations
- **CPU**: Direct function calls, no overhead
## 💡 **Recommendations**
### For Maximum Performance:
1. **Use CPU mining** (`-a rinhash`) for RinHash algorithm
2. **Reserve GPU** for algorithms with massive parallelization potential
3. **Hybrid approach**: CPU for RinHash, GPU for other algorithms
### When to Use GPU:
- **Batch processing**: Multiple hashes simultaneously
- **Different algorithms**: SHA256, Scrypt, Ethash (more GPU-friendly)
- **Large-scale operations**: When latency isn't critical
The optimized GPU implementation is now **available for testing**, but CPU remains the optimal choice for RinHash mining due to algorithmic characteristics.