Files
gogo2/TRAINING_BACKPROP_FIX.md
2025-11-22 19:25:27 +02:00

4.5 KiB

Training Backpropagation Fix

Problem

Training was failing with two critical errors during backward pass:

Error 1: Inplace Operation Error

Inplace operation error during backward pass: one of the variables needed for gradient 
computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128, 256]], 
which is output 0 of AsStridedBackward0, is at version 57; expected version 53 instead.

Error 2: Gradient Checkpoint Shape Mismatch

torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for 
the following tensors have different metadata than during the forward pass.

tensor at position 3:
saved metadata: {'shape': torch.Size([200, 1024]), 'dtype': torch.float32, 'device': cuda:0}
recomputed metadata: {'shape': torch.Size([1, 200, 1024]), 'dtype': torch.bool, 'device': cuda:0}

Root Cause

Gradient checkpointing was enabled by default in TradingTransformerConfig:

use_gradient_checkpointing: bool = True  # Trade compute for memory (saves ~30% memory)

Gradient checkpointing saves memory by recomputing activations during backward pass instead of storing them. However, this causes issues when:

  1. Tensor shapes change between forward and backward (masks, boolean tensors)
  2. Non-deterministic operations produce different results during recomputation
  3. In-place operations modify tensors that checkpointing tries to save

Impact

  • Training failed: Candle Acc: 0.0% consistently
  • Loss became 0.0000 after backward errors
  • Model couldn't learn: Accuracy stayed at 0% despite training
  • Per-candle training broken: Online learning failed completely

Solution

Disabled gradient checkpointing in NN/models/advanced_transformer_trading.py:

# Memory optimization
use_gradient_checkpointing: bool = False  # DISABLED: Causes tensor shape mismatches during backward pass

Memory Impact

This change will increase GPU memory usage slightly:

  • Before: Saves ~30% memory by recomputing activations
  • After: Stores all activations in memory

Current memory usage: 1.63GB / 46.97GB (3.5%)

  • We have plenty of headroom (45GB free!)
  • The memory saving is not needed on this GPU
  • Training stability is more important

Expected Results After Fix

With gradient checkpointing disabled:

Batch Training

Batch 1/23, Loss: 0.535, Candle Acc: 15-25%, Trend Acc: 45-55%
Batch 5/23, Loss: 0.420, Candle Acc: 20-30%, Trend Acc: 50-60%
Batch 10/23, Loss: 0.350, Candle Acc: 25-35%, Trend Acc: 55-65%

Per-Candle Training

Per-candle training: Loss=0.4231 (avg: 0.4156), Acc=28.50% (avg: 25.32%)
Trained on candle: ETH/USDT 1s @ 2025-11-22 17:03:41+00:00 (change: -0.06%)

Epoch Summary

Epoch 1/10, Loss: 0.385, Accuracy: 26.34% (23 batches)

Files Modified

  • /mnt/shared/DEV/repos/d-popov.com/gogo2/NN/models/advanced_transformer_trading.py
    • Line 64: Changed use_gradient_checkpointing: bool = False

Testing Instructions

  1. Delete old checkpoints (they might have broken gradients):

    rm -rf models/checkpoints/transformer/*
    
  2. Restart training:

    • Go to ANNOTATE UI
    • Load Transformer model (will create fresh model)
    • Start "Live Inference + Per-Candle Training"
  3. Monitor logs for improvements:

    • Watch for Candle Acc > 0%
    • Check that Loss decreases over batches
    • Verify no more CheckpointError or Inplace operation error
  4. Expected timeline:

    • First few batches: Acc ~15-25%
    • After 1 epoch: Acc ~25-35%
    • After 5-10 epochs: Acc should improve to 40-60%

Additional Notes

Why This Happens

Gradient checkpointing in PyTorch recomputes forward pass during backward. If:

  • A mask changes from [200, 1024] float to [1, 200, 1024] bool
  • Dropout produces different random values
  • Any operation is non-deterministic

...then the recomputed tensors won't match saved metadata, causing the error.

Alternative Solutions (if memory becomes an issue)

If we run out of memory in the future:

  1. Reduce batch size: Currently uses default batch size
  2. Reduce sequence length: Currently 200, could use 100
  3. Use mixed precision more aggressively: Already using AMP
  4. Disable uncertainty estimation: Turn off use_uncertainty_estimation
  5. Reduce model size: Decrease d_model or n_layers

But with 45GB free, we don't need any of these optimizations yet!

Status

FIXED - Gradient checkpointing disabled PENDING - User needs to test with fresh training run