ehanced training and reward - wip

2025-08-23 01:07:05 +03:00
parent 10199e4171
commit 9992b226ea
6 changed files with 2471 additions and 0 deletions
--- a/docs/ENHANCED_REWARD_SYSTEM.md
+++ b/docs/ENHANCED_REWARD_SYSTEM.md
@@ -0,0 +1,349 @@
+# Enhanced Reward System for Reinforcement Learning Training
+
+## Overview
+
+This document describes the implementation of an enhanced reward system for your reinforcement learning trading models. The system uses **mean squared error (MSE) between predictions and empirical outcomes** as the primary reward mechanism, with support for multiple timeframes and comprehensive accuracy tracking.
+
+## Key Features
+
+### ✅ MSE-Based Reward Calculation
+- Uses mean squared difference between predicted and actual prices
+- Exponential decay function heavily penalizes large prediction errors
+- Direction accuracy bonus/penalty system
+- Confidence-weighted final rewards
+
+### ✅ Multi-Timeframe Support
+- Separate tracking for **1s, 1m, 1h, 1d** timeframes
+- Independent accuracy metrics for each timeframe
+- Timeframe-specific evaluation timeouts
+- Models know which timeframe they're predicting on
+
+### ✅ Prediction History Tracking
+- Maintains last **6 predictions per timeframe** per symbol
+- Comprehensive prediction records with outcomes
+- Historical accuracy analysis
+- Memory-efficient with automatic cleanup
+
+### ✅ Real-Time Training
+- Training triggered at each inference when outcomes are available
+- Separate training batches for each model and timeframe
+- Automatic evaluation of predictions after appropriate timeouts
+- Integration with existing RL training infrastructure
+
+### ✅ Enhanced Inference Scheduling
+- **Continuous inference** every 1-5 seconds on primary timeframe
+- **Hourly multi-timeframe inference** (4 predictions per hour - one for each timeframe)
+- Timeframe-aware inference context
+- Proper scheduling and coordination
+
+## Architecture
+
+```mermaid
+graph TD
+    A[Market Data] --> B[Timeframe Inference Coordinator]
+    B --> C[Model Inference]
+    C --> D[Enhanced Reward Calculator]
+    D --> E[Prediction Tracking]
+    E --> F[Outcome Evaluation]
+    F --> G[MSE Reward Calculation]
+    G --> H[Enhanced RL Training Adapter]
+    H --> I[Model Training]
+    I --> J[Performance Monitoring]
+```
+
+## Core Components
+
+### 1. EnhancedRewardCalculator (`core/enhanced_reward_calculator.py`)
+
+**Purpose**: Central reward calculation engine using MSE methodology
+
+**Key Methods**:
+- `add_prediction()` - Track new predictions
+- `evaluate_predictions()` - Calculate rewards when outcomes available
+- `get_accuracy_summary()` - Comprehensive accuracy metrics
+- `get_training_data()` - Extract training samples for models
+
+**Reward Formula**:
+```python
+# MSE calculation
+price_error = actual_price - predicted_price
+mse = price_error ** 2
+
+# Normalize to reasonable scale
+max_mse = (current_price * 0.1) ** 2  # 10% as max expected error
+normalized_mse = min(mse / max_mse, 1.0)
+
+# Exponential decay (heavily penalize large errors)
+mse_reward = exp(-5 * normalized_mse)  # Range: [exp(-5), 1]
+
+# Direction bonus/penalty
+direction_bonus = 0.5 if direction_correct else -0.5
+
+# Final reward (confidence weighted)
+final_reward = (mse_reward + direction_bonus) * confidence
+```
+
+### 2. TimeframeInferenceCoordinator (`core/timeframe_inference_coordinator.py`)
+
+**Purpose**: Coordinates timeframe-aware model inference with proper scheduling
+
+**Key Features**:
+- **Continuous inference loop** for each symbol (every 5 seconds)
+- **Hourly multi-timeframe scheduler** (4 predictions per hour)
+- **Inference context management** (models know target timeframe)
+- **Automatic reward evaluation** and training triggers
+
+**Scheduling**:
+- **Every 5 seconds**: Inference on primary timeframe (1s)
+- **Every hour**: One inference for each timeframe (1s, 1m, 1h, 1d)
+- **Evaluation timeouts**: 5s for 1s predictions, 60s for 1m, 300s for 1h, 900s for 1d
+
+### 3. EnhancedRLTrainingAdapter (`core/enhanced_rl_training_adapter.py`)
+
+**Purpose**: Bridge between new reward system and existing RL training infrastructure
+
+**Key Features**:
+- **Model inference wrappers** for DQN, COB RL, and CNN models
+- **Training batch creation** from prediction records and rewards
+- **Real-time training triggers** based on evaluation results
+- **Backward compatibility** with existing training systems
+
+### 4. EnhancedRewardSystemIntegration (`core/enhanced_reward_system_integration.py`)
+
+**Purpose**: Simple integration point for existing systems
+
+**Key Features**:
+- **One-line integration** with existing TradingOrchestrator
+- **Helper functions** for easy prediction tracking
+- **Comprehensive monitoring** and statistics
+- **Minimal code changes** required
+
+## Integration Guide
+
+### Step 1: Import Required Components
+
+Add to your `orchestrator.py`:
+
+```python
+from core.enhanced_reward_system_integration import (
+    integrate_enhanced_rewards, 
+    add_prediction_to_enhanced_rewards
+)
+```
+
+### Step 2: Initialize in TradingOrchestrator
+
+In your `TradingOrchestrator.__init__()`:
+
+```python
+# Add this line after existing initialization
+integrate_enhanced_rewards(self, symbols=['ETH/USDT', 'BTC/USDT'])
+```
+
+### Step 3: Start the System
+
+In your `TradingOrchestrator.run()` method:
+
+```python
+# Add this line after initialization
+await self.enhanced_reward_system.start_integration()
+```
+
+### Step 4: Track Predictions
+
+In your model inference methods (CNN, DQN, COB RL):
+
+```python
+# Example in CNN inference
+prediction_id = add_prediction_to_enhanced_rewards(
+    self,  # orchestrator instance
+    symbol,  # 'ETH/USDT'
+    timeframe,  # '1s', '1m', '1h', '1d'
+    predicted_price,  # model's price prediction
+    direction,  # -1 (down), 0 (neutral), 1 (up)  
+    confidence,  # 0.0 to 1.0
+    current_price,  # current market price
+    'enhanced_cnn'  # model name
+)
+```
+
+### Step 5: Monitor Performance
+
+```python
+# Get comprehensive statistics
+stats = self.enhanced_reward_system.get_integration_statistics()
+accuracy = self.enhanced_reward_system.get_model_accuracy()
+
+# Force evaluation for testing
+self.enhanced_reward_system.force_evaluation_and_training('ETH/USDT', '1s')
+```
+
+## Usage Example
+
+See `examples/enhanced_reward_system_example.py` for a complete demonstration.
+
+```bash
+python examples/enhanced_reward_system_example.py
+```
+
+## Performance Benefits
+
+### 🎯 Better Accuracy Measurement
+- **MSE rewards** provide nuanced feedback vs. simple directional accuracy
+- **Price prediction accuracy** measured alongside direction accuracy
+- **Confidence-weighted rewards** encourage well-calibrated predictions
+
+### 📊 Multi-Timeframe Intelligence
+- **Separate tracking** prevents timeframe confusion
+- **Timeframe-specific evaluation** accounts for different market dynamics
+- **Comprehensive accuracy picture** across all prediction horizons
+
+### ⚡ Real-Time Learning
+- **Immediate training** when prediction outcomes available
+- **No batch delays** - models learn from every prediction
+- **Adaptive training frequency** based on prediction evaluation
+
+### 🔄 Enhanced Inference Scheduling
+- **Optimal prediction frequency** balances real-time response with computational efficiency
+- **Hourly multi-timeframe predictions** provide comprehensive market coverage
+- **Context-aware models** make better predictions knowing their target timeframe
+
+## Configuration
+
+### Evaluation Timeouts (Configurable in EnhancedRewardCalculator)
+
+```python
+evaluation_timeouts = {
+    TimeFrame.SECONDS_1: 5,    # Evaluate 1s predictions after 5 seconds
+    TimeFrame.MINUTES_1: 60,   # Evaluate 1m predictions after 1 minute  
+    TimeFrame.HOURS_1: 300,    # Evaluate 1h predictions after 5 minutes
+    TimeFrame.DAYS_1: 900      # Evaluate 1d predictions after 15 minutes
+}
+```
+
+### Inference Scheduling (Configurable in TimeframeInferenceCoordinator)
+
+```python
+schedule = InferenceSchedule(
+    continuous_interval_seconds=5.0,  # Continuous inference every 5 seconds
+    hourly_timeframes=[TimeFrame.SECONDS_1, TimeFrame.MINUTES_1, 
+                      TimeFrame.HOURS_1, TimeFrame.DAYS_1]
+)
+```
+
+### Training Configuration (Configurable in EnhancedRLTrainingAdapter)
+
+```python
+min_batch_size = 8  # Minimum samples for training
+max_batch_size = 64  # Maximum samples per training batch
+training_interval_seconds = 5.0  # Training check frequency
+```
+
+## Monitoring and Statistics
+
+### Integration Statistics
+
+```python
+stats = enhanced_reward_system.get_integration_statistics()
+```
+
+Returns:
+- System running status
+- Total predictions tracked
+- Component status
+- Inference and training statistics
+- Performance metrics
+
+### Model Accuracy
+
+```python
+accuracy = enhanced_reward_system.get_model_accuracy()
+```
+
+Returns for each symbol and timeframe:
+- Total predictions made
+- Direction accuracy percentage
+- Average MSE
+- Recent prediction count
+
+### Real-Time Monitoring
+
+The system provides comprehensive logging at different levels:
+- `INFO`: Major system events, training results
+- `DEBUG`: Detailed prediction tracking, reward calculations
+- `ERROR`: System errors and recovery actions
+
+## Backward Compatibility
+
+The enhanced reward system is designed to be **fully backward compatible**:
+
+✅ **Existing models continue to work** without modification
+✅ **Existing training systems** remain functional
+✅ **Existing reward calculations** can run in parallel
+✅ **Gradual migration** - enable for specific models incrementally
+
+## Testing and Validation
+
+### Force Evaluation for Testing
+
+```python
+# Force immediate evaluation of all predictions
+enhanced_reward_system.force_evaluation_and_training()
+
+# Force evaluation for specific symbol/timeframe
+enhanced_reward_system.force_evaluation_and_training('ETH/USDT', '1s')
+```
+
+### Manual Prediction Addition
+
+```python
+# Add predictions manually for testing
+prediction_id = enhanced_reward_system.add_prediction_manually(
+    symbol='ETH/USDT',
+    timeframe_str='1s',
+    predicted_price=3150.50,
+    predicted_direction=1,
+    confidence=0.85,
+    current_price=3150.00,
+    model_name='test_model'
+)
+```
+
+## Memory Management
+
+The system includes automatic memory management:
+
+- **Automatic prediction cleanup** (configurable retention period)
+- **Circular buffers** for prediction history (max 100 per timeframe)
+- **Price cache management** (max 1000 price points per symbol)
+- **Efficient storage** using deques and compressed data structures
+
+## Future Enhancements
+
+The architecture supports easy extension for:
+
+1. **Additional timeframes** (30s, 5m, 15m, etc.)
+2. **Custom reward functions** (Sharpe ratio, maximum drawdown, etc.)
+3. **Multi-symbol correlation** rewards
+4. **Advanced statistical metrics** (Sortino ratio, Calmar ratio)
+5. **Model ensemble** reward aggregation
+6. **A/B testing** framework for reward functions
+
+## Conclusion
+
+The Enhanced Reward System provides a comprehensive foundation for improving RL model training through:
+
+- **Precise MSE-based rewards** that accurately measure prediction quality
+- **Multi-timeframe intelligence** that prevents confusion between different prediction horizons
+- **Real-time learning** that maximizes training opportunities
+- **Easy integration** that requires minimal changes to existing code
+- **Comprehensive monitoring** that provides insights into model performance
+
+This system addresses the specific requirements you outlined:
+✅ MSE-based accuracy calculation
+✅ Training at each inference using last prediction vs. current outcome
+✅ Separate accuracy tracking for up to 6 last predictions per timeframe
+✅ Models know which timeframe they're predicting on
+✅ Hourly multi-timeframe inference (4 predictions per hour)
+✅ Integration with existing 1-5 second inference frequency
+