fix prediction candles updates. fix for trend prediction.

2025-11-22 19:25:27 +02:00
parent 44379ae2e4
commit 20fe481ec5
6 changed files with 461 additions and 24 deletions
--- a/TRAINING_BACKPROP_FIX.md
+++ b/TRAINING_BACKPROP_FIX.md
@@ -0,0 +1,137 @@
+# Training Backpropagation Fix
+
+## Problem
+
+Training was failing with two critical errors during backward pass:
+
+### Error 1: Inplace Operation Error
+```
+Inplace operation error during backward pass: one of the variables needed for gradient 
+computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128, 256]], 
+which is output 0 of AsStridedBackward0, is at version 57; expected version 53 instead.
+```
+
+### Error 2: Gradient Checkpoint Shape Mismatch
+```
+torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for 
+the following tensors have different metadata than during the forward pass.
+
+tensor at position 3:
+saved metadata: {'shape': torch.Size([200, 1024]), 'dtype': torch.float32, 'device': cuda:0}
+recomputed metadata: {'shape': torch.Size([1, 200, 1024]), 'dtype': torch.bool, 'device': cuda:0}
+```
+
+## Root Cause
+
+**Gradient checkpointing** was enabled by default in `TradingTransformerConfig`:
+```python
+use_gradient_checkpointing: bool = True  # Trade compute for memory (saves ~30% memory)
+```
+
+Gradient checkpointing saves memory by recomputing activations during backward pass instead of storing them. However, this causes issues when:
+1. **Tensor shapes change** between forward and backward (masks, boolean tensors)
+2. **Non-deterministic operations** produce different results during recomputation
+3. **In-place operations** modify tensors that checkpointing tries to save
+
+## Impact
+
+- **Training failed**: `Candle Acc: 0.0%` consistently
+- **Loss became 0.0000** after backward errors
+- **Model couldn't learn**: Accuracy stayed at 0% despite training
+- **Per-candle training broken**: Online learning failed completely
+
+## Solution
+
+**Disabled gradient checkpointing** in `NN/models/advanced_transformer_trading.py`:
+
+```python
+# Memory optimization
+use_gradient_checkpointing: bool = False  # DISABLED: Causes tensor shape mismatches during backward pass
+```
+
+## Memory Impact
+
+This change will increase GPU memory usage slightly:
+- **Before**: Saves ~30% memory by recomputing activations
+- **After**: Stores all activations in memory
+
+**Current memory usage**: 1.63GB / 46.97GB (3.5%)
+- We have **plenty of headroom** (45GB free!)
+- The memory saving is not needed on this GPU
+- Training stability is more important
+
+## Expected Results After Fix
+
+With gradient checkpointing disabled:
+
+### Batch Training
+```
+Batch 1/23, Loss: 0.535, Candle Acc: 15-25%, Trend Acc: 45-55%
+Batch 5/23, Loss: 0.420, Candle Acc: 20-30%, Trend Acc: 50-60%
+Batch 10/23, Loss: 0.350, Candle Acc: 25-35%, Trend Acc: 55-65%
+```
+
+### Per-Candle Training
+```
+Per-candle training: Loss=0.4231 (avg: 0.4156), Acc=28.50% (avg: 25.32%)
+Trained on candle: ETH/USDT 1s @ 2025-11-22 17:03:41+00:00 (change: -0.06%)
+```
+
+### Epoch Summary
+```
+Epoch 1/10, Loss: 0.385, Accuracy: 26.34% (23 batches)
+```
+
+## Files Modified
+
+- `/mnt/shared/DEV/repos/d-popov.com/gogo2/NN/models/advanced_transformer_trading.py`
+  - Line 64: Changed `use_gradient_checkpointing: bool = False`
+
+## Testing Instructions
+
+1. **Delete old checkpoints** (they might have broken gradients):
+   ```bash
+   rm -rf models/checkpoints/transformer/*
+   ```
+
+2. **Restart training**:
+   - Go to ANNOTATE UI
+   - Load Transformer model (will create fresh model)
+   - Start "Live Inference + Per-Candle Training"
+
+3. **Monitor logs for improvements**:
+   - Watch for `Candle Acc` > 0%
+   - Check that `Loss` decreases over batches
+   - Verify no more `CheckpointError` or `Inplace operation error`
+
+4. **Expected timeline**:
+   - First few batches: Acc ~15-25%
+   - After 1 epoch: Acc ~25-35%
+   - After 5-10 epochs: Acc should improve to 40-60%
+
+## Additional Notes
+
+### Why This Happens
+
+Gradient checkpointing in PyTorch recomputes forward pass during backward. If:
+- A mask changes from `[200, 1024]` float to `[1, 200, 1024]` bool
+- Dropout produces different random values
+- Any operation is non-deterministic
+
+...then the recomputed tensors won't match saved metadata, causing the error.
+
+### Alternative Solutions (if memory becomes an issue)
+
+If we run out of memory in the future:
+1. **Reduce batch size**: Currently uses default batch size
+2. **Reduce sequence length**: Currently 200, could use 100
+3. **Use mixed precision more aggressively**: Already using AMP
+4. **Disable uncertainty estimation**: Turn off `use_uncertainty_estimation`
+5. **Reduce model size**: Decrease `d_model` or `n_layers`
+
+But with 45GB free, we don't need any of these optimizations yet!
+
+## Status
+
+✅ **FIXED** - Gradient checkpointing disabled
+⏳ **PENDING** - User needs to test with fresh training run