fix: Main Problem: Batch Corruption Across Epochs

2025-12-08 20:00:47 +02:00
parent cc555735e8
commit 81a7f27d2d
4 changed files with 205 additions and 15 deletions
--- a/TRAINING_EFFECTIVENESS_FIXES.md
+++ b/TRAINING_EFFECTIVENESS_FIXES.md
@@ -0,0 +1,133 @@
+# Training Effectiveness Fixes
+
+## Issues Identified
+
+From the logs, we found several critical issues preventing effective training:
+
+### 1. **Batch Corruption Across Epochs** ❌
+**Problem**: Only epoch 1 trains successfully, epochs 2-10 all show 0.0 loss
+```
+Epoch 1/10, Loss: 1.688709, Accuracy: 0.00% (1 batches)  ✅ Training works
+Epoch 2/10, Loss: 0.000000, Accuracy: 0.00% (1 batches)  ❌ No training
+Epoch 3/10, Loss: 0.000000, Accuracy: 0.00% (1 batches)  ❌ No training
+...
+WARNING - No timeframe data available for transformer forward pass
+WARNING - No 'actions' key in batch - skipping this training step
+```
+
+**Root Cause**: 
+- Batches were being reused across epochs without copying
+- `train_step()` was modifying the batch dict in-place
+- By epoch 2, the batch tensors were corrupted/missing
+
+**Fix Applied**:
+1. **Batch Generator**: Create shallow copy of batch dict for each yield
+   ```python
+   # Before: yield batch (same object reused)
+   # After:  yield {k: v for k, v in batch.items()} (new dict each time)
+   ```
+
+2. **Train Step**: Always create new `batch_on_device` dict instead of modifying input
+   ```python
+   # Before: batch = batch_gpu (modifies input)
+   # After:  batch_on_device = {...} (new dict, preserves input)
+   ```
+
+### 2. **Remaining Inplace Errors** ⚠️
+**Problem**: Still seeing occasional inplace operation errors (but recovering)
+```
+ERROR - Inplace operation error: [torch.FloatTensor [128, 3]] version 4; expected version 2
+ERROR - Inplace operation error: [torch.FloatTensor [256, 256]] version 6; expected version 4
+```
+
+**Root Cause**:
+- `trend_target` tensor `[128, 3]` suggests batching is creating shared tensors
+- Weight matrices `[256, 256]` being modified during backward pass
+
+**Current Status**: 
+- Errors are caught and training continues (returns 0.0 loss for that step)
+- Not crashing, but losing training opportunities
+
+**Potential Additional Fixes** (if issues persist):
+1. Ensure trend_target is detached after creation
+2. Add `.detach()` to intermediate tensors before loss calculation
+3. Use `torch.no_grad()` for any non-training operations
+
+### 3. **Zero GPU Utilization** 🔧
+**Problem**: GPU shows 0.0% utilization and 0.00GB memory
+```
+GPU: AMD Radeon 8060S, Util: 0.0%, Mem: 0.00GB/46.97GB
+```
+
+**Possible Causes**:
+1. **ROCm/AMD GPU monitoring issue**: The monitoring tool might not support AMD GPUs properly
+2. **Computation too fast**: Single-sample batches complete before monitoring can measure
+3. **CPU fallback**: Model might be running on CPU despite GPU being available
+
+**Recommendations**:
+1. Check if model is actually on GPU: `next(model.parameters()).device`
+2. Increase batch size for longer GPU operations
+3. Use AMD-specific monitoring tools (rocm-smi) instead of nvidia-smi based tools
+
+### 4. **Single Sample Batches** 📊
+**Problem**: Training with only 1 sample per batch
+```
+Total samples: 1
+Ready to train on 1 batches
+```
+
+**Impact**:
+- Poor GPU utilization (GPUs are optimized for parallel processing)
+- Noisy gradients (no batch averaging)
+- Slower training convergence
+
+**Recommendations**:
+1. Accumulate more training samples before starting training
+2. Use gradient accumulation to simulate larger batches
+3. Collect multiple pivot points before triggering training
+
+## Files Modified
+
+1. **ANNOTATE/core/real_training_adapter.py**
+   - Line 2527-2538: Batch generator now creates shallow copies
+
+2. **NN/models/advanced_transformer_trading.py**
+   - Lines 1350-1390: Train step creates new batch_on_device dict
+
+## Expected Improvements
+
+After these fixes:
+
+✅ **All epochs should train**: Epochs 2-10 will have real loss values, not 0.0
+✅ **Consistent training**: No more "No timeframe data" warnings after epoch 1
+✅ **Better convergence**: Loss should decrease across epochs
+✅ **Fewer inplace errors**: Batch corruption was causing many of these
+
+## Testing Checklist
+
+Run realtime training and verify:
+
+- [ ] Epoch 1 trains successfully (already working)
+- [ ] Epoch 2 shows non-zero loss (should be fixed now)
+- [ ] Epochs 3-10 all train with real loss values
+- [ ] No "No timeframe data" warnings after epoch 1
+- [ ] Loss decreases over epochs (model is learning)
+- [ ] Accuracy increases over epochs
+- [ ] Fewer inplace operation errors
+
+## Additional Recommendations
+
+### Short Term:
+1. **Increase training samples**: Collect 10-20 pivot points before training
+2. **Batch size**: Group samples into batches of 8-16 for better GPU utilization
+3. **Learning rate**: May need adjustment if training is too slow/fast
+
+### Medium Term:
+1. **Data augmentation**: Generate more training samples from each pivot
+2. **Validation set**: Split data to monitor overfitting
+3. **Early stopping**: Stop training when validation loss stops improving
+
+### Long Term:
+1. **Distributed training**: Use multiple GPUs if available
+2. **Mixed precision**: Already enabled, but verify it's working
+3. **Model pruning**: Remove unused parameters to speed up training