4.2 KiB
4.2 KiB
GPU Performance Analysis & Optimization
🔍 Performance Bottleneck Discovery
Initial Problem:
- CPU Mining: 294 kH/s (4 threads)
- GPU Mining: 132 H/s (1,024 threads)
- Performance Gap: GPU is 2,200x slower per thread!
Root Cause Analysis:
❌ GPU Implementation Issues Found:
-
Memory Allocation Per Hash
- GPU was calling
hipMalloc()
/hipFree()
for every single hash - Each memory allocation = ~100μs overhead
- Solution: ✅ Implemented memory caching with reuse
- GPU was calling
-
Single-Thread GPU Utilization
- Kernel used only 1 thread out of 1,024 (
if (threadIdx.x == 0)
) - 1,023 threads sitting completely idle
- Solution: ✅ Reduced to minimal 32-thread kernel for lower latency
- Kernel used only 1 thread out of 1,024 (
-
Sequential Algorithm Nature
- RinHash: BLAKE3 → Argon2d → SHA3 (inherently sequential)
- Can't parallelize a single hash across multiple threads effectively
- Reality: GPU isn't optimal for this algorithm type
Current Optimization Status:
✅ Optimizations Implemented:
-
Memory Caching
static uint8_t *d_input_cache = nullptr; // Reused across calls static uint8_t *d_output_cache = nullptr; // No allocation per hash static block *d_memory_cache = nullptr; // Persistent Argon2 memory
-
Minimal Kernel Launch
dim3 blocks(1); // Single block dim3 threads_per_block(32); // Minimal threads for low latency
-
Reduced Memory Footprint
hipMalloc(&d_input_cache, 80); // Fixed 80-byte headers hipMalloc(&d_output_cache, 32); // 32-byte outputs hipMalloc(&d_memory_cache, 64 * sizeof(block)); // Argon2 workspace
📊 Expected Performance After Optimization
Configuration | Before | After | Improvement |
---|---|---|---|
Memory Alloc | Per-hash | Cached | 100x faster |
GPU Threads | 1,024 (1 active) | 32 (optimized) | 32x less overhead |
Kernel Launch | High overhead | Minimal | 10x faster |
Realistic Performance Target:
- Previous: 132 H/s
- Optimized: ~5-15 kH/s (estimated)
- CPU Still Faster: Sequential algorithm favors CPU threads
🚀 Build Commands for Optimized Version
cd /mnt/shared/DEV/repos/d-popov.com/mines/rin/miner/gpu/RinHash-hip
# Compile optimized kernel
/opt/rocm-6.4.3/bin/hipcc -c -O3 -fPIC rinhash.hip.cu -o build/rinhash.o
/opt/rocm-6.4.3/bin/hipcc -c -O3 -fPIC sha3-256.hip.cu -o build/sha3-256.o
# Link optimized library
/opt/rocm-6.4.3/bin/hipcc -shared -O3 build/rinhash.o build/sha3-256.o \
-o rocm-direct-output/gpu-libs/librinhash_hip.so \
-L/opt/rocm-6.4.3/lib -lamdhip64
# Install system-wide
sudo cp rocm-direct-output/gpu-libs/librinhash_hip.so /usr/local/lib/
sudo ldconfig
🔬 Technical Analysis
Why GPU Struggles with RinHash:
-
Algorithm Characteristics:
- Sequential dependency chain: Each step needs previous result
- Memory-bound operations: Argon2d requires significant memory bandwidth
- Small data sizes: 80-byte headers don't saturate GPU throughput
-
GPU Architecture Mismatch:
- GPU Optimal: Parallel, compute-intensive, large datasets
- RinHash Reality: Sequential, memory-bound, small datasets
- CPU Advantage: Better single-thread performance, lower latency
-
Overhead vs. Compute Ratio:
- GPU Overhead: Kernel launch + memory transfers + sync
- Actual Compute: ~100μs of hash operations
- CPU: Direct function calls, no overhead
💡 Recommendations
For Maximum Performance:
- Use CPU mining (
-a rinhash
) for RinHash algorithm - Reserve GPU for algorithms with massive parallelization potential
- Hybrid approach: CPU for RinHash, GPU for other algorithms
When to Use GPU:
- Batch processing: Multiple hashes simultaneously
- Different algorithms: SHA256, Scrypt, Ethash (more GPU-friendly)
- Large-scale operations: When latency isn't critical
The optimized GPU implementation is now available for testing, but CPU remains the optimal choice for RinHash mining due to algorithmic characteristics.