fix: Main Problem: Batch Corruption Across Epochs

2025-12-08 20:00:47 +02:00
parent cc555735e8
commit 81a7f27d2d
4 changed files with 205 additions and 15 deletions
--- a/QUICK_ACTION_SUMMARY.md
+++ b/QUICK_ACTION_SUMMARY.md
@@ -0,0 +1,52 @@
+# Quick Action Summary - Training Effectiveness
+
+## What Was Wrong
+
+**Only epoch 1 was training, epochs 2-10 were skipping with 0.0 loss**
+
+The batch dictionaries were being modified in-place during training, so by epoch 2 the data was corrupted.
+
+## What Was Fixed
+
+### 1. Batch Generator (ANNOTATE/core/real_training_adapter.py)
+```python
+# ❌ BEFORE - Same batch object reused
+for batch in grouped_batches:
+    yield batch
+
+# ✅ AFTER - New dict each time
+for batch in grouped_batches:
+    batch_copy = {k: v for k, v in batch.items()}
+    yield batch_copy
+```
+
+### 2. Train Step (NN/models/advanced_transformer_trading.py)
+```python
+# ❌ BEFORE - Modifies input batch
+batch = batch_gpu  # Overwrites input
+
+# ✅ AFTER - Creates new dict
+batch_on_device = {}  # New dict, preserves input
+for k, v in batch.items():
+    batch_on_device[k] = v
+```
+
+## Expected Result
+
+- ✅ All 10 epochs should now train with real loss values
+- ✅ No more "No timeframe data" warnings after epoch 1
+- ✅ Loss should decrease across epochs
+- ✅ Model should actually learn
+
+## Still Need to Address
+
+1. **GPU utilization 0%** - Might be monitoring issue or single-sample batches
+2. **Occasional inplace errors** - Caught and recovered, but losing training steps
+3. **Single sample batches** - Need to accumulate more samples for better training
+
+## Test It
+
+Run your realtime training again and check if:
+- Epoch 2 shows non-zero loss (not 0.000000)
+- All epochs train successfully
+- Loss decreases over time