Files
gogo2/REALTIME_TRAINING_FIXES.md
2025-12-08 19:48:46 +02:00

3.6 KiB

Realtime RL Training Fixes

Issues Identified and Fixed

1. Inplace Operation Errors During Backward Pass

Problem:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

Root Cause:

  • Tensor operations like x = x + position_emb were modifying tensors that are part of the computation graph
  • The regime detector's weighted sum was creating shared memory references
  • Layer outputs were being reused without cloning

Fix Applied:

  • Added .clone() to create new tensors instead of modifying existing ones:
    • x = price_emb.clone() + cob_emb + tech_emb + market_emb
    • x = layer_output['output'].clone()
    • adapted_output = torch.sum(regime_stack * regime_weights, dim=0).clone()

Files Modified:

  • NN/models/advanced_transformer_trading.py (lines 638, 668, 223)

2. Missing 'actions' Key in Batch

Problem:

WARNING - No 'actions' key in batch - skipping this training step
WARNING - No timeframe data available for transformer forward pass

Root Cause:

  • Per-candle training was creating incomplete batches without proper validation
  • Batches were being passed to training even when required data was missing

Fix Applied:

  • Added validation before training to ensure all required keys are present:
required_keys = ['actions', 'price_data_1m', 'price_data_1h', 'price_data_1d']
missing_keys = [k for k in required_keys if k not in batch or batch[k] is None]
if missing_keys:
    logger.warning(f"Per-candle training skipped: Missing required keys: {missing_keys}")
    return

Files Modified:

  • ANNOTATE/core/real_training_adapter.py (lines 3520-3527)

3. Checkpoint File Deletion Race Condition

Problem:

WARNING - Could not remove checkpoint: [Errno 2] No such file or directory

Root Cause:

  • Checkpoint cleanup was running immediately after saving
  • Files were being deleted before they were fully written to disk
  • No existence check before deletion

Fix Applied:

  • Added 0.5 second delay before cleanup to ensure files are fully written
  • Added existence checks before attempting to delete files:
import time
time.sleep(0.5)  # Ensure files are fully written

# Double-check file still exists before deleting
if os.path.exists(checkpoint['path']):
    os.remove(checkpoint['path'])

Files Modified:

  • ANNOTATE/core/real_training_adapter.py (lines 2254-2285, 3710-3745)

Expected Results After Fixes

  1. No more inplace operation errors - Gradients will flow correctly during backward pass
  2. Proper training on valid batches - Only batches with complete data will be trained
  3. No checkpoint deletion errors - Files will be fully written before cleanup attempts
  4. Improved training metrics - Loss and accuracy should show meaningful values instead of 0.0

Testing Recommendations

  1. Run the realtime training again and monitor for:

    • Absence of inplace operation errors
    • Reduction in "skipping this training step" warnings
    • No checkpoint deletion errors
    • Non-zero loss and accuracy values
  2. Check GPU utilization:

    • Should see actual GPU usage during training (currently showing 0.0%)
    • Memory usage should increase during forward/backward passes
  3. Monitor training progress:

    • Loss should decrease over epochs
    • Accuracy should increase over epochs
    • Checkpoints should save successfully

Additional Notes

  • The fixes maintain backward compatibility with existing code
  • No changes to model architecture or training logic
  • Only defensive programming and proper tensor handling added
  • All changes follow PyTorch best practices for gradient computation