refactoring. inference real data triggers

This commit is contained in:
Dobromir Popov
2025-12-09 11:59:15 +02:00
parent 1c1ebf6d7e
commit 992d6de25b
9 changed files with 1970 additions and 224 deletions

View File

@@ -1,244 +1,147 @@
# Implementation Summary - November 12, 2025
# Event-Driven Inference Training System - Implementation Summary
## All Issues Fixed ✅
## Architecture Decisions
### Session 1: Core Training Issues
1. ✅ Database `performance_score` column error
2. ✅ Deprecated PyTorch `torch.cuda.amp.autocast` API
3. ✅ Historical data timestamp mismatch warnings
### Where Components Fit
### Session 2: Cross-Platform & Performance
4. ✅ AMD GPU support (ROCm compatibility)
5. ✅ Multiple database initialization (singleton pattern)
6. ✅ Slice indices type error in negative sampling
1. **InferenceTrainingCoordinator****TradingOrchestrator**
- **Rationale**: Orchestrator already manages models, training, and predictions
- **Benefits**:
- Reduces duplication (orchestrator has model access)
- Centralizes coordination logic
- Reuses existing prediction storage
- **Location**: `core/orchestrator.py` - initialized in `__init__`
### Session 3: Critical Memory & Loss Issues
7.**Memory leak** - 128GB RAM exhaustion fixed
8.**Unrealistic loss values** - $3.3B errors fixed to realistic RMSE
2. **DataProvider Subscription Methods****DataProvider**
- **Rationale**: Data layer responsibility - emits events when data changes
- **Methods Added**:
- `subscribe_candle_completion()` - Subscribe to candle completion events
- `subscribe_pivot_events()` - Subscribe to pivot events
- `_emit_candle_completion()` - Emit event when candle closes
- `_emit_pivot_event()` - Emit event when pivot detected
- **Location**: `core/data_provider.py`
### Session 4: Live Training Feature
9.**Automatic training on L2 pivots** - New feature implemented
3. **TrainingEventSubscriber Interface****RealTrainingAdapter**
- **Rationale**: Training layer implements subscriber interface
- **Methods Implemented**:
- `on_candle_completion()` - Train on candle completion
- `on_pivot_event()` - Train on pivot detection
- **Location**: `ANNOTATE/core/real_training_adapter.py`
---
## Code Duplication Reduction
## Memory Leak Fixes (Critical)
### Before (Duplicated Logic)
### Problem
Training crashed with 128GB RAM due to:
- Batch accumulation in memory (never freed)
- Gradient accumulation without cleanup
- Reusing batches across epochs without deletion
1. **Data Retrieval**:
- `_get_realtime_market_data()` in RealTrainingAdapter
- Similar logic in orchestrator
- Similar logic in data_provider
### Solution
```python
# BEFORE: Store all batches in list
converted_batches = []
for data in training_data:
batch = convert(data)
converted_batches.append(batch) # ACCUMULATES!
2. **Prediction Storage**:
- `store_transformer_prediction()` in orchestrator
- `inference_input_cache` in RealTrainingAdapter session
- `prediction_cache` in app.py
# AFTER: Use generator (memory efficient)
def batch_generator():
for data in training_data:
batch = convert(data)
yield batch # Auto-freed after use
3. **Training Coordination**:
- Training logic in RealTrainingAdapter
- Training logic in orchestrator
- Training logic in enhanced_realtime_training
# Explicit cleanup after each batch
for batch in batch_generator():
train_step(batch)
del batch
torch.cuda.empty_cache()
gc.collect()
```
### After (Centralized)
**Result:** Memory usage reduced from 65GB+ to <16GB
1. **Data Retrieval**:
- Single source: `data_provider.get_historical_data()` queries DuckDB
- Coordinator retrieves data on-demand using references
- No copying - just timestamp ranges
---
2. **Prediction Storage**:
- Orchestrator's `inference_training_coordinator` manages references
- References stored in coordinator (not copied)
- Data retrieved from DuckDB when needed
## Unrealistic Loss Fixes (Critical)
3. **Training Coordination**:
- Orchestrator's coordinator handles event distribution
- RealTrainingAdapter implements subscriber interface
- Single training lock in RealTrainingAdapter
### Problem
```
Real Price Error: 1d=$3386828032.00 # $3.3 BILLION!
```
## Implementation Status
### Root Cause
Using MSE (Mean Square Error) on denormalized prices:
```python
# MSE on real prices gives HUGE errors
mse = (pred - target) ** 2
# If pred=$3000, target=$3100: (100)^2 = 10,000
# For 1d timeframe: errors in billions
```
### ✅ Completed
### Solution
Use RMSE (Root Mean Square Error) instead:
```python
# RMSE gives interpretable dollar values
mse = torch.mean((pred_denorm - target_denorm) ** 2)
rmse = torch.sqrt(mse + 1e-8) # Add epsilon for stability
candle_losses_denorm[tf] = rmse.item()
```
1. **InferenceTrainingCoordinator** (`inference_training_system.py`)
- Reference-based storage
- Event subscription system
- Data retrieval from DuckDB
**Result:** Realistic loss values like `1d=$150.50` (RMSE in dollars)
2. **DataProvider Extensions** (`data_provider.py`)
- `subscribe_candle_completion()` method
- `subscribe_pivot_events()` method
- `_emit_candle_completion()` method
- `_emit_pivot_event()` method
- Event emission in `_update_candle()`
---
3. **Orchestrator Integration** (`orchestrator.py`)
- Coordinator initialized in `__init__`
- Accessible via `orchestrator.inference_training_coordinator`
## Live Pivot Training (New Feature)
4. **RealTrainingAdapter Integration** (`real_training_adapter.py`)
- Uses orchestrator's coordinator
- Implements `TrainingEventSubscriber` interface
- `on_candle_completion()` method
- `on_pivot_event()` method
- `_register_inference_frame()` method
- Helper methods for batch creation
### What It Does
Automatically trains models on L2 pivot points detected in real-time on 1s and 1m charts.
### ⚠️ Needs Completion
### How It Works
```
Live Market Data (1s, 1m)
Williams Market Structure
L2 Pivot Detection
Automatic Training Sample Creation
Background Training (non-blocking)
```
1. **Pivot Event Emission**
- DataProvider needs to detect pivots and emit events
- Currently pivots are calculated but not emitted as events
- Need to integrate with WilliamsMarketStructure pivot detection
### Usage
**Enabled by default when starting live inference:**
```javascript
// Start inference with auto-training (default)
fetch('/api/realtime-inference/start', {
method: 'POST',
body: JSON.stringify({
model_name: 'Transformer',
symbol: 'ETH/USDT'
// enable_live_training: true (default)
})
})
```
2. **Norm Params Storage**
- Currently norm_params are calculated on retrieval
- Could be stored in reference during registration for efficiency
- Need to pass norm_params from `_get_realtime_market_data()` to `_register_inference_frame()`
**Disable if needed:**
```javascript
body: JSON.stringify({
model_name: 'Transformer',
symbol: 'ETH/USDT',
enable_live_training: false
})
```
3. **Device Handling**
- Ensure tensors are on correct device when retrieved from DuckDB
- May need to store device info in reference
### Benefits
- Continuous learning from live data
- Trains on high-quality pivot points
- Non-blocking (doesn't interfere with inference)
- Automatic (no manual work needed)
- Adaptive to current market conditions
4. **Testing**
- Test candle completion events
- Test pivot events
- Test data retrieval from DuckDB
- Test training on inference frames
### Configuration
```python
# In ANNOTATE/core/live_pivot_trainer.py
self.check_interval = 5 # Check every 5 seconds
self.min_pivot_spacing = 60 # Min 60s between training
```
## Key Benefits
---
1. **Memory Efficient**: No copying 600 candles every second
2. **Event-Driven**: Clean separation of concerns
3. **Flexible**: Supports time-based (candles) and event-based (pivots)
4. **Centralized**: Coordinator in orchestrator reduces duplication
5. **Extensible**: Easy to add new training methods or event types
## Files Modified
## Next Steps
### Core Fixes (16 files)
1. `ANNOTATE/core/real_training_adapter.py` - 5 changes
2. `ANNOTATE/web/app.py` - 3 changes
3. `NN/models/advanced_transformer_trading.py` - 3 changes
4. `NN/models/dqn_agent.py` - 1 change
5. `NN/models/cob_rl_model.py` - 1 change
6. `core/realtime_rl_cob_trader.py` - 2 changes
7. `utils/database_manager.py` - (schema reference)
1. **Complete Pivot Event Emission**
- Add pivot detection in DataProvider
- Emit events when L2L, L2H, etc. detected
### New Files Created
8. `ANNOTATE/core/live_pivot_trainer.py` - New module
9. `ANNOTATE/TRAINING_FIXES_SUMMARY.md` - Documentation
10. `ANNOTATE/AMD_GPU_AND_PERFORMANCE_FIXES.md` - Documentation
11. `ANNOTATE/MEMORY_LEAK_AND_LOSS_FIXES.md` - Documentation
12. `ANNOTATE/LIVE_PIVOT_TRAINING_GUIDE.md` - Documentation
13. `ANNOTATE/IMPLEMENTATION_SUMMARY.md` - This file
2. **Store Norm Params During Registration**
- Pass norm_params from prediction to registration
- Store in reference for faster retrieval
---
3. **Add Device Info to References**
- Store device in InferenceFrameReference
- Use when creating tensors
## Testing Checklist
4. **Remove Old Caching Code**
- Remove `inference_input_cache` from session
- Remove `_make_realtime_prediction_with_cache()` (deprecated)
- Clean up duplicate code
### Memory Leak Fix
- [ ] Start training with 4+ test cases
- [ ] Monitor RAM usage (should stay <16GB)
- [ ] Complete 10 epochs without crash
- [ ] Verify no "Out of Memory" errors
### Loss Values Fix
- [ ] Check training logs for realistic RMSE values
- [ ] Verify: `1s=$50-200`, `1m=$100-500`, `1h=$500-2000`, `1d=$1000-5000`
- [ ] No billion-dollar errors
### AMD GPU Support
- [ ] Test on AMD GPU with ROCm
- [ ] Verify no CUDA-specific errors
- [ ] Training completes successfully
### Live Pivot Training
- [ ] Start live inference
- [ ] Check logs for "Live pivot training ENABLED"
- [ ] Wait 5-10 minutes
- [ ] Verify pivots detected: "Found X new L2 pivots"
- [ ] Verify training started: "Background training started"
---
## Performance Improvements
### Memory Usage
- **Before:** 65GB+ (crashes with 128GB RAM)
- **After:** <16GB (fits in 32GB RAM)
- **Improvement:** 75% reduction
### Loss Interpretability
- **Before:** `1d=$3386828032.00` (meaningless)
- **After:** `1d=$150.50` (RMSE in dollars)
- **Improvement:** Actionable metrics
### GPU Utilization
- **Current:** Low (batch_size=1, no DataLoader)
- **Recommended:** Increase batch_size to 4-8, add DataLoader workers
- **Potential:** 3-5x faster training
### Training Automation
- **Before:** Manual annotation only
- **After:** Automatic training on L2 pivots
- **Benefit:** Continuous learning without manual work
---
## Next Steps (Optional Enhancements)
### High Priority
1. Increase batch size from 1 to 4-8 (better GPU utilization)
2. Implement DataLoader with workers (parallel data loading)
3. Add memory profiling/monitoring
### Medium Priority
4. Adaptive pivot spacing based on volatility
5. Multi-level pivot training (L1, L2, L3)
6. Outcome tracking for pivot-based trades
### Low Priority
7. Configuration UI for live pivot training
8. Multi-symbol pivot monitoring
9. Quality filtering for pivots
---
## Summary
All critical issues have been resolved:
- Memory leak fixed (can now train with 128GB RAM)
- Loss values realistic (RMSE in dollars)
- AMD GPU support added
- Database errors fixed
- Live pivot training implemented
**System is now production-ready for continuous learning!**
5. **Extend DuckDB Schema**
- Add MA indicators to ohlcv_data
- Create pivot_points table
- Store technical indicators