Files
mines/rin/miner/GPU_PERFORMANCE_ANALYSIS.md
Dobromir Popov b475590b61 gpu optimizations
2025-09-06 14:20:19 +03:00

4.2 KiB

GPU Performance Analysis & Optimization

🔍 Performance Bottleneck Discovery

Initial Problem:

  • CPU Mining: 294 kH/s (4 threads)
  • GPU Mining: 132 H/s (1,024 threads)
  • Performance Gap: GPU is 2,200x slower per thread!

Root Cause Analysis:

GPU Implementation Issues Found:

  1. Memory Allocation Per Hash

    • GPU was calling hipMalloc()/hipFree() for every single hash
    • Each memory allocation = ~100μs overhead
    • Solution: Implemented memory caching with reuse
  2. Single-Thread GPU Utilization

    • Kernel used only 1 thread out of 1,024 (if (threadIdx.x == 0))
    • 1,023 threads sitting completely idle
    • Solution: Reduced to minimal 32-thread kernel for lower latency
  3. Sequential Algorithm Nature

    • RinHash: BLAKE3 → Argon2d → SHA3 (inherently sequential)
    • Can't parallelize a single hash across multiple threads effectively
    • Reality: GPU isn't optimal for this algorithm type

Current Optimization Status:

Optimizations Implemented:

  1. Memory Caching

    static uint8_t *d_input_cache = nullptr;  // Reused across calls
    static uint8_t *d_output_cache = nullptr; // No allocation per hash
    static block *d_memory_cache = nullptr;   // Persistent Argon2 memory
    
  2. Minimal Kernel Launch

    dim3 blocks(1);           // Single block
    dim3 threads_per_block(32); // Minimal threads for low latency
    
  3. Reduced Memory Footprint

    hipMalloc(&d_input_cache, 80);        // Fixed 80-byte headers
    hipMalloc(&d_output_cache, 32);       // 32-byte outputs  
    hipMalloc(&d_memory_cache, 64 * sizeof(block)); // Argon2 workspace
    

📊 Expected Performance After Optimization

Configuration Before After Improvement
Memory Alloc Per-hash Cached 100x faster
GPU Threads 1,024 (1 active) 32 (optimized) 32x less overhead
Kernel Launch High overhead Minimal 10x faster

Realistic Performance Target:

  • Previous: 132 H/s
  • Optimized: ~5-15 kH/s (estimated)
  • CPU Still Faster: Sequential algorithm favors CPU threads

🚀 Build Commands for Optimized Version

cd /mnt/shared/DEV/repos/d-popov.com/mines/rin/miner/gpu/RinHash-hip

# Compile optimized kernel
/opt/rocm-6.4.3/bin/hipcc -c -O3 -fPIC rinhash.hip.cu -o build/rinhash.o
/opt/rocm-6.4.3/bin/hipcc -c -O3 -fPIC sha3-256.hip.cu -o build/sha3-256.o

# Link optimized library
/opt/rocm-6.4.3/bin/hipcc -shared -O3 build/rinhash.o build/sha3-256.o \
  -o rocm-direct-output/gpu-libs/librinhash_hip.so \
  -L/opt/rocm-6.4.3/lib -lamdhip64

# Install system-wide
sudo cp rocm-direct-output/gpu-libs/librinhash_hip.so /usr/local/lib/
sudo ldconfig

🔬 Technical Analysis

Why GPU Struggles with RinHash:

  1. Algorithm Characteristics:

    • Sequential dependency chain: Each step needs previous result
    • Memory-bound operations: Argon2d requires significant memory bandwidth
    • Small data sizes: 80-byte headers don't saturate GPU throughput
  2. GPU Architecture Mismatch:

    • GPU Optimal: Parallel, compute-intensive, large datasets
    • RinHash Reality: Sequential, memory-bound, small datasets
    • CPU Advantage: Better single-thread performance, lower latency
  3. Overhead vs. Compute Ratio:

    • GPU Overhead: Kernel launch + memory transfers + sync
    • Actual Compute: ~100μs of hash operations
    • CPU: Direct function calls, no overhead

💡 Recommendations

For Maximum Performance:

  1. Use CPU mining (-a rinhash) for RinHash algorithm
  2. Reserve GPU for algorithms with massive parallelization potential
  3. Hybrid approach: CPU for RinHash, GPU for other algorithms

When to Use GPU:

  • Batch processing: Multiple hashes simultaneously
  • Different algorithms: SHA256, Scrypt, Ethash (more GPU-friendly)
  • Large-scale operations: When latency isn't critical

The optimized GPU implementation is now available for testing, but CPU remains the optimal choice for RinHash mining due to algorithmic characteristics.