117 lines
4.2 KiB
Markdown
117 lines
4.2 KiB
Markdown
# GPU Performance Analysis & Optimization
|
|
|
|
## 🔍 **Performance Bottleneck Discovery**
|
|
|
|
### Initial Problem:
|
|
- **CPU Mining**: 294 kH/s (4 threads)
|
|
- **GPU Mining**: 132 H/s (1,024 threads)
|
|
- **Performance Gap**: GPU is **2,200x slower** per thread!
|
|
|
|
### Root Cause Analysis:
|
|
|
|
#### ❌ **GPU Implementation Issues Found:**
|
|
|
|
1. **Memory Allocation Per Hash**
|
|
- GPU was calling `hipMalloc()`/`hipFree()` for **every single hash**
|
|
- Each memory allocation = ~100μs overhead
|
|
- **Solution**: ✅ Implemented memory caching with reuse
|
|
|
|
2. **Single-Thread GPU Utilization**
|
|
- Kernel used only **1 thread out of 1,024** (`if (threadIdx.x == 0)`)
|
|
- 1,023 threads sitting completely idle
|
|
- **Solution**: ✅ Reduced to minimal 32-thread kernel for lower latency
|
|
|
|
3. **Sequential Algorithm Nature**
|
|
- RinHash: BLAKE3 → Argon2d → SHA3 (inherently sequential)
|
|
- Can't parallelize a single hash across multiple threads effectively
|
|
- **Reality**: GPU isn't optimal for this algorithm type
|
|
|
|
### Current Optimization Status:
|
|
|
|
#### ✅ **Optimizations Implemented:**
|
|
|
|
1. **Memory Caching**
|
|
```c
|
|
static uint8_t *d_input_cache = nullptr; // Reused across calls
|
|
static uint8_t *d_output_cache = nullptr; // No allocation per hash
|
|
static block *d_memory_cache = nullptr; // Persistent Argon2 memory
|
|
```
|
|
|
|
2. **Minimal Kernel Launch**
|
|
```c
|
|
dim3 blocks(1); // Single block
|
|
dim3 threads_per_block(32); // Minimal threads for low latency
|
|
```
|
|
|
|
3. **Reduced Memory Footprint**
|
|
```c
|
|
hipMalloc(&d_input_cache, 80); // Fixed 80-byte headers
|
|
hipMalloc(&d_output_cache, 32); // 32-byte outputs
|
|
hipMalloc(&d_memory_cache, 64 * sizeof(block)); // Argon2 workspace
|
|
```
|
|
|
|
## 📊 **Expected Performance After Optimization**
|
|
|
|
| Configuration | Before | After | Improvement |
|
|
|---------------|---------|-------|-------------|
|
|
| **Memory Alloc** | Per-hash | Cached | **100x faster** |
|
|
| **GPU Threads** | 1,024 (1 active) | 32 (optimized) | **32x less overhead** |
|
|
| **Kernel Launch** | High overhead | Minimal | **10x faster** |
|
|
|
|
### Realistic Performance Target:
|
|
- **Previous**: 132 H/s
|
|
- **Optimized**: ~5-15 kH/s (estimated)
|
|
- **CPU Still Faster**: Sequential algorithm favors CPU threads
|
|
|
|
## 🚀 **Build Commands for Optimized Version**
|
|
|
|
```bash
|
|
cd /mnt/shared/DEV/repos/d-popov.com/mines/rin/miner/gpu/RinHash-hip
|
|
|
|
# Compile optimized kernel
|
|
/opt/rocm-6.4.3/bin/hipcc -c -O3 -fPIC rinhash.hip.cu -o build/rinhash.o
|
|
/opt/rocm-6.4.3/bin/hipcc -c -O3 -fPIC sha3-256.hip.cu -o build/sha3-256.o
|
|
|
|
# Link optimized library
|
|
/opt/rocm-6.4.3/bin/hipcc -shared -O3 build/rinhash.o build/sha3-256.o \
|
|
-o rocm-direct-output/gpu-libs/librinhash_hip.so \
|
|
-L/opt/rocm-6.4.3/lib -lamdhip64
|
|
|
|
# Install system-wide
|
|
sudo cp rocm-direct-output/gpu-libs/librinhash_hip.so /usr/local/lib/
|
|
sudo ldconfig
|
|
```
|
|
|
|
## 🔬 **Technical Analysis**
|
|
|
|
### Why GPU Struggles with RinHash:
|
|
|
|
1. **Algorithm Characteristics**:
|
|
- **Sequential dependency chain**: Each step needs previous result
|
|
- **Memory-bound operations**: Argon2d requires significant memory bandwidth
|
|
- **Small data sizes**: 80-byte headers don't saturate GPU throughput
|
|
|
|
2. **GPU Architecture Mismatch**:
|
|
- **GPU Optimal**: Parallel, compute-intensive, large datasets
|
|
- **RinHash Reality**: Sequential, memory-bound, small datasets
|
|
- **CPU Advantage**: Better single-thread performance, lower latency
|
|
|
|
3. **Overhead vs. Compute Ratio**:
|
|
- **GPU Overhead**: Kernel launch + memory transfers + sync
|
|
- **Actual Compute**: ~100μs of hash operations
|
|
- **CPU**: Direct function calls, no overhead
|
|
|
|
## 💡 **Recommendations**
|
|
|
|
### For Maximum Performance:
|
|
1. **Use CPU mining** (`-a rinhash`) for RinHash algorithm
|
|
2. **Reserve GPU** for algorithms with massive parallelization potential
|
|
3. **Hybrid approach**: CPU for RinHash, GPU for other algorithms
|
|
|
|
### When to Use GPU:
|
|
- **Batch processing**: Multiple hashes simultaneously
|
|
- **Different algorithms**: SHA256, Scrypt, Ethash (more GPU-friendly)
|
|
- **Large-scale operations**: When latency isn't critical
|
|
|
|
The optimized GPU implementation is now **available for testing**, but CPU remains the optimal choice for RinHash mining due to algorithmic characteristics.
|