4.9 KiB
Training Effectiveness Fixes
Issues Identified
From the logs, we found several critical issues preventing effective training:
1. Batch Corruption Across Epochs ❌
Problem: Only epoch 1 trains successfully, epochs 2-10 all show 0.0 loss
Epoch 1/10, Loss: 1.688709, Accuracy: 0.00% (1 batches) ✅ Training works
Epoch 2/10, Loss: 0.000000, Accuracy: 0.00% (1 batches) ❌ No training
Epoch 3/10, Loss: 0.000000, Accuracy: 0.00% (1 batches) ❌ No training
...
WARNING - No timeframe data available for transformer forward pass
WARNING - No 'actions' key in batch - skipping this training step
Root Cause:
- Batches were being reused across epochs without copying
train_step()was modifying the batch dict in-place- By epoch 2, the batch tensors were corrupted/missing
Fix Applied:
-
Batch Generator: Create shallow copy of batch dict for each yield
# Before: yield batch (same object reused) # After: yield {k: v for k, v in batch.items()} (new dict each time) -
Train Step: Always create new
batch_on_devicedict instead of modifying input# Before: batch = batch_gpu (modifies input) # After: batch_on_device = {...} (new dict, preserves input)
2. Remaining Inplace Errors ⚠️
Problem: Still seeing occasional inplace operation errors (but recovering)
ERROR - Inplace operation error: [torch.FloatTensor [128, 3]] version 4; expected version 2
ERROR - Inplace operation error: [torch.FloatTensor [256, 256]] version 6; expected version 4
Root Cause:
trend_targettensor[128, 3]suggests batching is creating shared tensors- Weight matrices
[256, 256]being modified during backward pass
Current Status:
- Errors are caught and training continues (returns 0.0 loss for that step)
- Not crashing, but losing training opportunities
Potential Additional Fixes (if issues persist):
- Ensure trend_target is detached after creation
- Add
.detach()to intermediate tensors before loss calculation - Use
torch.no_grad()for any non-training operations
3. Zero GPU Utilization 🔧
Problem: GPU shows 0.0% utilization and 0.00GB memory
GPU: AMD Radeon 8060S, Util: 0.0%, Mem: 0.00GB/46.97GB
Possible Causes:
- ROCm/AMD GPU monitoring issue: The monitoring tool might not support AMD GPUs properly
- Computation too fast: Single-sample batches complete before monitoring can measure
- CPU fallback: Model might be running on CPU despite GPU being available
Recommendations:
- Check if model is actually on GPU:
next(model.parameters()).device - Increase batch size for longer GPU operations
- Use AMD-specific monitoring tools (rocm-smi) instead of nvidia-smi based tools
4. Single Sample Batches 📊
Problem: Training with only 1 sample per batch
Total samples: 1
Ready to train on 1 batches
Impact:
- Poor GPU utilization (GPUs are optimized for parallel processing)
- Noisy gradients (no batch averaging)
- Slower training convergence
Recommendations:
- Accumulate more training samples before starting training
- Use gradient accumulation to simulate larger batches
- Collect multiple pivot points before triggering training
Files Modified
-
ANNOTATE/core/real_training_adapter.py
- Line 2527-2538: Batch generator now creates shallow copies
-
NN/models/advanced_transformer_trading.py
- Lines 1350-1390: Train step creates new batch_on_device dict
Expected Improvements
After these fixes:
✅ All epochs should train: Epochs 2-10 will have real loss values, not 0.0 ✅ Consistent training: No more "No timeframe data" warnings after epoch 1 ✅ Better convergence: Loss should decrease across epochs ✅ Fewer inplace errors: Batch corruption was causing many of these
Testing Checklist
Run realtime training and verify:
- Epoch 1 trains successfully (already working)
- Epoch 2 shows non-zero loss (should be fixed now)
- Epochs 3-10 all train with real loss values
- No "No timeframe data" warnings after epoch 1
- Loss decreases over epochs (model is learning)
- Accuracy increases over epochs
- Fewer inplace operation errors
Additional Recommendations
Short Term:
- Increase training samples: Collect 10-20 pivot points before training
- Batch size: Group samples into batches of 8-16 for better GPU utilization
- Learning rate: May need adjustment if training is too slow/fast
Medium Term:
- Data augmentation: Generate more training samples from each pivot
- Validation set: Split data to monitor overfitting
- Early stopping: Stop training when validation loss stops improving
Long Term:
- Distributed training: Use multiple GPUs if available
- Mixed precision: Already enabled, but verify it's working
- Model pruning: Remove unused parameters to speed up training