gpu optimizations
This commit is contained in:
116
rin/miner/GPU_PERFORMANCE_ANALYSIS.md
Normal file
116
rin/miner/GPU_PERFORMANCE_ANALYSIS.md
Normal file
@@ -0,0 +1,116 @@
|
||||
# GPU Performance Analysis & Optimization
|
||||
|
||||
## 🔍 **Performance Bottleneck Discovery**
|
||||
|
||||
### Initial Problem:
|
||||
- **CPU Mining**: 294 kH/s (4 threads)
|
||||
- **GPU Mining**: 132 H/s (1,024 threads)
|
||||
- **Performance Gap**: GPU is **2,200x slower** per thread!
|
||||
|
||||
### Root Cause Analysis:
|
||||
|
||||
#### ❌ **GPU Implementation Issues Found:**
|
||||
|
||||
1. **Memory Allocation Per Hash**
|
||||
- GPU was calling `hipMalloc()`/`hipFree()` for **every single hash**
|
||||
- Each memory allocation = ~100μs overhead
|
||||
- **Solution**: ✅ Implemented memory caching with reuse
|
||||
|
||||
2. **Single-Thread GPU Utilization**
|
||||
- Kernel used only **1 thread out of 1,024** (`if (threadIdx.x == 0)`)
|
||||
- 1,023 threads sitting completely idle
|
||||
- **Solution**: ✅ Reduced to minimal 32-thread kernel for lower latency
|
||||
|
||||
3. **Sequential Algorithm Nature**
|
||||
- RinHash: BLAKE3 → Argon2d → SHA3 (inherently sequential)
|
||||
- Can't parallelize a single hash across multiple threads effectively
|
||||
- **Reality**: GPU isn't optimal for this algorithm type
|
||||
|
||||
### Current Optimization Status:
|
||||
|
||||
#### ✅ **Optimizations Implemented:**
|
||||
|
||||
1. **Memory Caching**
|
||||
```c
|
||||
static uint8_t *d_input_cache = nullptr; // Reused across calls
|
||||
static uint8_t *d_output_cache = nullptr; // No allocation per hash
|
||||
static block *d_memory_cache = nullptr; // Persistent Argon2 memory
|
||||
```
|
||||
|
||||
2. **Minimal Kernel Launch**
|
||||
```c
|
||||
dim3 blocks(1); // Single block
|
||||
dim3 threads_per_block(32); // Minimal threads for low latency
|
||||
```
|
||||
|
||||
3. **Reduced Memory Footprint**
|
||||
```c
|
||||
hipMalloc(&d_input_cache, 80); // Fixed 80-byte headers
|
||||
hipMalloc(&d_output_cache, 32); // 32-byte outputs
|
||||
hipMalloc(&d_memory_cache, 64 * sizeof(block)); // Argon2 workspace
|
||||
```
|
||||
|
||||
## 📊 **Expected Performance After Optimization**
|
||||
|
||||
| Configuration | Before | After | Improvement |
|
||||
|---------------|---------|-------|-------------|
|
||||
| **Memory Alloc** | Per-hash | Cached | **100x faster** |
|
||||
| **GPU Threads** | 1,024 (1 active) | 32 (optimized) | **32x less overhead** |
|
||||
| **Kernel Launch** | High overhead | Minimal | **10x faster** |
|
||||
|
||||
### Realistic Performance Target:
|
||||
- **Previous**: 132 H/s
|
||||
- **Optimized**: ~5-15 kH/s (estimated)
|
||||
- **CPU Still Faster**: Sequential algorithm favors CPU threads
|
||||
|
||||
## 🚀 **Build Commands for Optimized Version**
|
||||
|
||||
```bash
|
||||
cd /mnt/shared/DEV/repos/d-popov.com/mines/rin/miner/gpu/RinHash-hip
|
||||
|
||||
# Compile optimized kernel
|
||||
/opt/rocm-6.4.3/bin/hipcc -c -O3 -fPIC rinhash.hip.cu -o build/rinhash.o
|
||||
/opt/rocm-6.4.3/bin/hipcc -c -O3 -fPIC sha3-256.hip.cu -o build/sha3-256.o
|
||||
|
||||
# Link optimized library
|
||||
/opt/rocm-6.4.3/bin/hipcc -shared -O3 build/rinhash.o build/sha3-256.o \
|
||||
-o rocm-direct-output/gpu-libs/librinhash_hip.so \
|
||||
-L/opt/rocm-6.4.3/lib -lamdhip64
|
||||
|
||||
# Install system-wide
|
||||
sudo cp rocm-direct-output/gpu-libs/librinhash_hip.so /usr/local/lib/
|
||||
sudo ldconfig
|
||||
```
|
||||
|
||||
## 🔬 **Technical Analysis**
|
||||
|
||||
### Why GPU Struggles with RinHash:
|
||||
|
||||
1. **Algorithm Characteristics**:
|
||||
- **Sequential dependency chain**: Each step needs previous result
|
||||
- **Memory-bound operations**: Argon2d requires significant memory bandwidth
|
||||
- **Small data sizes**: 80-byte headers don't saturate GPU throughput
|
||||
|
||||
2. **GPU Architecture Mismatch**:
|
||||
- **GPU Optimal**: Parallel, compute-intensive, large datasets
|
||||
- **RinHash Reality**: Sequential, memory-bound, small datasets
|
||||
- **CPU Advantage**: Better single-thread performance, lower latency
|
||||
|
||||
3. **Overhead vs. Compute Ratio**:
|
||||
- **GPU Overhead**: Kernel launch + memory transfers + sync
|
||||
- **Actual Compute**: ~100μs of hash operations
|
||||
- **CPU**: Direct function calls, no overhead
|
||||
|
||||
## 💡 **Recommendations**
|
||||
|
||||
### For Maximum Performance:
|
||||
1. **Use CPU mining** (`-a rinhash`) for RinHash algorithm
|
||||
2. **Reserve GPU** for algorithms with massive parallelization potential
|
||||
3. **Hybrid approach**: CPU for RinHash, GPU for other algorithms
|
||||
|
||||
### When to Use GPU:
|
||||
- **Batch processing**: Multiple hashes simultaneously
|
||||
- **Different algorithms**: SHA256, Scrypt, Ethash (more GPU-friendly)
|
||||
- **Large-scale operations**: When latency isn't critical
|
||||
|
||||
The optimized GPU implementation is now **available for testing**, but CPU remains the optimal choice for RinHash mining due to algorithmic characteristics.
|
Reference in New Issue
Block a user