refactoring. inference real data triggers

2025-12-09 11:59:15 +02:00
parent 1c1ebf6d7e
commit 992d6de25b
9 changed files with 1970 additions and 224 deletions
--- a/ANNOTATE/IMPLEMENTATION_SUMMARY.md
+++ b/ANNOTATE/IMPLEMENTATION_SUMMARY.md
@@ -1,244 +1,147 @@
-# Implementation Summary - November 12, 2025
+# Event-Driven Inference Training System - Implementation Summary

-## All Issues Fixed ✅
+## Architecture Decisions

-### Session 1: Core Training Issues
-1. ✅ Database `performance_score` column error
-2. ✅ Deprecated PyTorch `torch.cuda.amp.autocast` API
-3. ✅ Historical data timestamp mismatch warnings
+### Where Components Fit

-### Session 2: Cross-Platform & Performance
-4. ✅ AMD GPU support (ROCm compatibility)
-5. ✅ Multiple database initialization (singleton pattern)
-6. ✅ Slice indices type error in negative sampling
+1. **InferenceTrainingCoordinator** → **TradingOrchestrator**
+   - **Rationale**: Orchestrator already manages models, training, and predictions
+   - **Benefits**: 
+     - Reduces duplication (orchestrator has model access)
+     - Centralizes coordination logic
+     - Reuses existing prediction storage
+   - **Location**: `core/orchestrator.py` - initialized in `__init__`

-### Session 3: Critical Memory & Loss Issues
-7. ✅ **Memory leak** - 128GB RAM exhaustion fixed
-8. ✅ **Unrealistic loss values** - $3.3B errors fixed to realistic RMSE
+2. **DataProvider Subscription Methods** → **DataProvider**
+   - **Rationale**: Data layer responsibility - emits events when data changes
+   - **Methods Added**:
+     - `subscribe_candle_completion()` - Subscribe to candle completion events
+     - `subscribe_pivot_events()` - Subscribe to pivot events
+     - `_emit_candle_completion()` - Emit event when candle closes
+     - `_emit_pivot_event()` - Emit event when pivot detected
+   - **Location**: `core/data_provider.py`

-### Session 4: Live Training Feature
-9. ✅ **Automatic training on L2 pivots** - New feature implemented
+3. **TrainingEventSubscriber Interface** → **RealTrainingAdapter**
+   - **Rationale**: Training layer implements subscriber interface
+   - **Methods Implemented**:
+     - `on_candle_completion()` - Train on candle completion
+     - `on_pivot_event()` - Train on pivot detection
+   - **Location**: `ANNOTATE/core/real_training_adapter.py`

---
+## Code Duplication Reduction

-## Memory Leak Fixes (Critical)
+### Before (Duplicated Logic)

-### Problem
-Training crashed with 128GB RAM due to:
- Batch accumulation in memory (never freed)
- Gradient accumulation without cleanup
- Reusing batches across epochs without deletion
+1. **Data Retrieval**: 
+   - `_get_realtime_market_data()` in RealTrainingAdapter
+   - Similar logic in orchestrator
+   - Similar logic in data_provider

-### Solution
-```python
-# BEFORE: Store all batches in list
-converted_batches = []
-for data in training_data:
-    batch = convert(data)
-    converted_batches.append(batch)  # ACCUMULATES!
+2. **Prediction Storage**:
+   - `store_transformer_prediction()` in orchestrator
+   - `inference_input_cache` in RealTrainingAdapter session
+   - `prediction_cache` in app.py

-# AFTER: Use generator (memory efficient)
-def batch_generator():
-    for data in training_data:
-        batch = convert(data)
-        yield batch  # Auto-freed after use
+3. **Training Coordination**:
+   - Training logic in RealTrainingAdapter
+   - Training logic in orchestrator
+   - Training logic in enhanced_realtime_training

-# Explicit cleanup after each batch
-for batch in batch_generator():
-    train_step(batch)
-    del batch
-    torch.cuda.empty_cache()
-    gc.collect()
-```
+### After (Centralized)

-**Result:** Memory usage reduced from 65GB+ to <16GB
+1. **Data Retrieval**: 
+   - Single source: `data_provider.get_historical_data()` queries DuckDB
+   - Coordinator retrieves data on-demand using references
+   - No copying - just timestamp ranges

---
+2. **Prediction Storage**:
+   - Orchestrator's `inference_training_coordinator` manages references
+   - References stored in coordinator (not copied)
+   - Data retrieved from DuckDB when needed

-## Unrealistic Loss Fixes (Critical)
+3. **Training Coordination**:
+   - Orchestrator's coordinator handles event distribution
+   - RealTrainingAdapter implements subscriber interface
+   - Single training lock in RealTrainingAdapter

-### Problem
-```
-Real Price Error: 1d=$3386828032.00  # $3.3 BILLION!
-```
+## Implementation Status

-### Root Cause
-Using MSE (Mean Square Error) on denormalized prices:
-```python
-# MSE on real prices gives HUGE errors
-mse = (pred - target) ** 2
-# If pred=$3000, target=$3100: (100)^2 = 10,000
-# For 1d timeframe: errors in billions
-```
+### ✅ Completed

-### Solution
-Use RMSE (Root Mean Square Error) instead:
-```python
-# RMSE gives interpretable dollar values
-mse = torch.mean((pred_denorm - target_denorm) ** 2)
-rmse = torch.sqrt(mse + 1e-8)  # Add epsilon for stability
-candle_losses_denorm[tf] = rmse.item()
-```
+1. **InferenceTrainingCoordinator** (`inference_training_system.py`)
+   - Reference-based storage
+   - Event subscription system
+   - Data retrieval from DuckDB

-**Result:** Realistic loss values like `1d=$150.50` (RMSE in dollars)
+2. **DataProvider Extensions** (`data_provider.py`)
+   - `subscribe_candle_completion()` method
+   - `subscribe_pivot_events()` method
+   - `_emit_candle_completion()` method
+   - `_emit_pivot_event()` method
+   - Event emission in `_update_candle()`

---
+3. **Orchestrator Integration** (`orchestrator.py`)
+   - Coordinator initialized in `__init__`
+   - Accessible via `orchestrator.inference_training_coordinator`

-## Live Pivot Training (New Feature)
+4. **RealTrainingAdapter Integration** (`real_training_adapter.py`)
+   - Uses orchestrator's coordinator
+   - Implements `TrainingEventSubscriber` interface
+   - `on_candle_completion()` method
+   - `on_pivot_event()` method
+   - `_register_inference_frame()` method
+   - Helper methods for batch creation

-### What It Does
-Automatically trains models on L2 pivot points detected in real-time on 1s and 1m charts.
+### ⚠️ Needs Completion

-### How It Works
-```
-Live Market Data (1s, 1m)
-    ↓
-Williams Market Structure
-    ↓
-L2 Pivot Detection
-    ↓
-Automatic Training Sample Creation
-    ↓
-Background Training (non-blocking)
-```
+1. **Pivot Event Emission**
+   - DataProvider needs to detect pivots and emit events
+   - Currently pivots are calculated but not emitted as events
+   - Need to integrate with WilliamsMarketStructure pivot detection

-### Usage
-**Enabled by default when starting live inference:**
-```javascript
-// Start inference with auto-training (default)
-fetch('/api/realtime-inference/start', {
-    method: 'POST',
-    body: JSON.stringify({
-        model_name: 'Transformer',
-        symbol: 'ETH/USDT'
-        // enable_live_training: true (default)
-    })
-})
-```
+2. **Norm Params Storage**
+   - Currently norm_params are calculated on retrieval
+   - Could be stored in reference during registration for efficiency
+   - Need to pass norm_params from `_get_realtime_market_data()` to `_register_inference_frame()`

-**Disable if needed:**
-```javascript
-body: JSON.stringify({
-    model_name: 'Transformer',
-    symbol: 'ETH/USDT',
-    enable_live_training: false
-})
-```
+3. **Device Handling**
+   - Ensure tensors are on correct device when retrieved from DuckDB
+   - May need to store device info in reference

-### Benefits
- ✅ Continuous learning from live data
- ✅ Trains on high-quality pivot points
- ✅ Non-blocking (doesn't interfere with inference)
- ✅ Automatic (no manual work needed)
- ✅ Adaptive to current market conditions
+4. **Testing**
+   - Test candle completion events
+   - Test pivot events
+   - Test data retrieval from DuckDB
+   - Test training on inference frames

-### Configuration
-```python
-# In ANNOTATE/core/live_pivot_trainer.py
-self.check_interval = 5  # Check every 5 seconds
-self.min_pivot_spacing = 60  # Min 60s between training
-```
+## Key Benefits

---
+1. **Memory Efficient**: No copying 600 candles every second
+2. **Event-Driven**: Clean separation of concerns
+3. **Flexible**: Supports time-based (candles) and event-based (pivots)
+4. **Centralized**: Coordinator in orchestrator reduces duplication
+5. **Extensible**: Easy to add new training methods or event types

-## Files Modified
+## Next Steps

-### Core Fixes (16 files)
-1. `ANNOTATE/core/real_training_adapter.py` - 5 changes
-2. `ANNOTATE/web/app.py` - 3 changes
-3. `NN/models/advanced_transformer_trading.py` - 3 changes
-4. `NN/models/dqn_agent.py` - 1 change
-5. `NN/models/cob_rl_model.py` - 1 change
-6. `core/realtime_rl_cob_trader.py` - 2 changes
-7. `utils/database_manager.py` - (schema reference)
+1. **Complete Pivot Event Emission**
+   - Add pivot detection in DataProvider
+   - Emit events when L2L, L2H, etc. detected

-### New Files Created
-8. `ANNOTATE/core/live_pivot_trainer.py` - New module
-9. `ANNOTATE/TRAINING_FIXES_SUMMARY.md` - Documentation
-10. `ANNOTATE/AMD_GPU_AND_PERFORMANCE_FIXES.md` - Documentation
-11. `ANNOTATE/MEMORY_LEAK_AND_LOSS_FIXES.md` - Documentation
-12. `ANNOTATE/LIVE_PIVOT_TRAINING_GUIDE.md` - Documentation
-13. `ANNOTATE/IMPLEMENTATION_SUMMARY.md` - This file
+2. **Store Norm Params During Registration**
+   - Pass norm_params from prediction to registration
+   - Store in reference for faster retrieval

---
+3. **Add Device Info to References**
+   - Store device in InferenceFrameReference
+   - Use when creating tensors

-## Testing Checklist
+4. **Remove Old Caching Code**
+   - Remove `inference_input_cache` from session
+   - Remove `_make_realtime_prediction_with_cache()` (deprecated)
+   - Clean up duplicate code

-### Memory Leak Fix
- [ ] Start training with 4+ test cases
- [ ] Monitor RAM usage (should stay <16GB)
- [ ] Complete 10 epochs without crash
- [ ] Verify no "Out of Memory" errors
-
-### Loss Values Fix
- [ ] Check training logs for realistic RMSE values
- [ ] Verify: `1s=$50-200`, `1m=$100-500`, `1h=$500-2000`, `1d=$1000-5000`
- [ ] No billion-dollar errors
-
-### AMD GPU Support
- [ ] Test on AMD GPU with ROCm
- [ ] Verify no CUDA-specific errors
- [ ] Training completes successfully
-
-### Live Pivot Training
- [ ] Start live inference
- [ ] Check logs for "Live pivot training ENABLED"
- [ ] Wait 5-10 minutes
- [ ] Verify pivots detected: "Found X new L2 pivots"
- [ ] Verify training started: "Background training started"
-
---
-
-## Performance Improvements
-
-### Memory Usage
- **Before:** 65GB+ (crashes with 128GB RAM)
- **After:** <16GB (fits in 32GB RAM)
- **Improvement:** 75% reduction
-
-### Loss Interpretability
- **Before:** `1d=$3386828032.00` (meaningless)
- **After:** `1d=$150.50` (RMSE in dollars)
- **Improvement:** Actionable metrics
-
-### GPU Utilization
- **Current:** Low (batch_size=1, no DataLoader)
- **Recommended:** Increase batch_size to 4-8, add DataLoader workers
- **Potential:** 3-5x faster training
-
-### Training Automation
- **Before:** Manual annotation only
- **After:** Automatic training on L2 pivots
- **Benefit:** Continuous learning without manual work
-
---
-
-## Next Steps (Optional Enhancements)
-
-### High Priority
-1. ⚠️ Increase batch size from 1 to 4-8 (better GPU utilization)
-2. ⚠️ Implement DataLoader with workers (parallel data loading)
-3. ⚠️ Add memory profiling/monitoring
-
-### Medium Priority
-4. ⚠️ Adaptive pivot spacing based on volatility
-5. ⚠️ Multi-level pivot training (L1, L2, L3)
-6. ⚠️ Outcome tracking for pivot-based trades
-
-### Low Priority
-7. ⚠️ Configuration UI for live pivot training
-8. ⚠️ Multi-symbol pivot monitoring
-9. ⚠️ Quality filtering for pivots
-
---
-
-## Summary
-
-All critical issues have been resolved:
- ✅ Memory leak fixed (can now train with 128GB RAM)
- ✅ Loss values realistic (RMSE in dollars)
- ✅ AMD GPU support added
- ✅ Database errors fixed
- ✅ Live pivot training implemented
-
-**System is now production-ready for continuous learning!**
+5. **Extend DuckDB Schema**
+   - Add MA indicators to ohlcv_data
+   - Create pivot_points table
+   - Store technical indicators