Files
gogo2/TRAINING_EFFECTIVENESS_FIXES.md
2025-12-08 20:00:47 +02:00

4.9 KiB

Training Effectiveness Fixes

Issues Identified

From the logs, we found several critical issues preventing effective training:

1. Batch Corruption Across Epochs

Problem: Only epoch 1 trains successfully, epochs 2-10 all show 0.0 loss

Epoch 1/10, Loss: 1.688709, Accuracy: 0.00% (1 batches)  ✅ Training works
Epoch 2/10, Loss: 0.000000, Accuracy: 0.00% (1 batches)  ❌ No training
Epoch 3/10, Loss: 0.000000, Accuracy: 0.00% (1 batches)  ❌ No training
...
WARNING - No timeframe data available for transformer forward pass
WARNING - No 'actions' key in batch - skipping this training step

Root Cause:

  • Batches were being reused across epochs without copying
  • train_step() was modifying the batch dict in-place
  • By epoch 2, the batch tensors were corrupted/missing

Fix Applied:

  1. Batch Generator: Create shallow copy of batch dict for each yield

    # Before: yield batch (same object reused)
    # After:  yield {k: v for k, v in batch.items()} (new dict each time)
    
  2. Train Step: Always create new batch_on_device dict instead of modifying input

    # Before: batch = batch_gpu (modifies input)
    # After:  batch_on_device = {...} (new dict, preserves input)
    

2. Remaining Inplace Errors ⚠️

Problem: Still seeing occasional inplace operation errors (but recovering)

ERROR - Inplace operation error: [torch.FloatTensor [128, 3]] version 4; expected version 2
ERROR - Inplace operation error: [torch.FloatTensor [256, 256]] version 6; expected version 4

Root Cause:

  • trend_target tensor [128, 3] suggests batching is creating shared tensors
  • Weight matrices [256, 256] being modified during backward pass

Current Status:

  • Errors are caught and training continues (returns 0.0 loss for that step)
  • Not crashing, but losing training opportunities

Potential Additional Fixes (if issues persist):

  1. Ensure trend_target is detached after creation
  2. Add .detach() to intermediate tensors before loss calculation
  3. Use torch.no_grad() for any non-training operations

3. Zero GPU Utilization 🔧

Problem: GPU shows 0.0% utilization and 0.00GB memory

GPU: AMD Radeon 8060S, Util: 0.0%, Mem: 0.00GB/46.97GB

Possible Causes:

  1. ROCm/AMD GPU monitoring issue: The monitoring tool might not support AMD GPUs properly
  2. Computation too fast: Single-sample batches complete before monitoring can measure
  3. CPU fallback: Model might be running on CPU despite GPU being available

Recommendations:

  1. Check if model is actually on GPU: next(model.parameters()).device
  2. Increase batch size for longer GPU operations
  3. Use AMD-specific monitoring tools (rocm-smi) instead of nvidia-smi based tools

4. Single Sample Batches 📊

Problem: Training with only 1 sample per batch

Total samples: 1
Ready to train on 1 batches

Impact:

  • Poor GPU utilization (GPUs are optimized for parallel processing)
  • Noisy gradients (no batch averaging)
  • Slower training convergence

Recommendations:

  1. Accumulate more training samples before starting training
  2. Use gradient accumulation to simulate larger batches
  3. Collect multiple pivot points before triggering training

Files Modified

  1. ANNOTATE/core/real_training_adapter.py

    • Line 2527-2538: Batch generator now creates shallow copies
  2. NN/models/advanced_transformer_trading.py

    • Lines 1350-1390: Train step creates new batch_on_device dict

Expected Improvements

After these fixes:

All epochs should train: Epochs 2-10 will have real loss values, not 0.0 Consistent training: No more "No timeframe data" warnings after epoch 1 Better convergence: Loss should decrease across epochs Fewer inplace errors: Batch corruption was causing many of these

Testing Checklist

Run realtime training and verify:

  • Epoch 1 trains successfully (already working)
  • Epoch 2 shows non-zero loss (should be fixed now)
  • Epochs 3-10 all train with real loss values
  • No "No timeframe data" warnings after epoch 1
  • Loss decreases over epochs (model is learning)
  • Accuracy increases over epochs
  • Fewer inplace operation errors

Additional Recommendations

Short Term:

  1. Increase training samples: Collect 10-20 pivot points before training
  2. Batch size: Group samples into batches of 8-16 for better GPU utilization
  3. Learning rate: May need adjustment if training is too slow/fast

Medium Term:

  1. Data augmentation: Generate more training samples from each pivot
  2. Validation set: Split data to monitor overfitting
  3. Early stopping: Stop training when validation loss stops improving

Long Term:

  1. Distributed training: Use multiple GPUs if available
  2. Mixed precision: Already enabled, but verify it's working
  3. Model pruning: Remove unused parameters to speed up training