# Enhanced Reward System for Reinforcement Learning Training ## Overview This document describes the implementation of an enhanced reward system for your reinforcement learning trading models. The system uses **mean squared error (MSE) between predictions and empirical outcomes** as the primary reward mechanism, with support for multiple timeframes and comprehensive accuracy tracking. ## Key Features ### ✅ MSE-Based Reward Calculation - Uses mean squared difference between predicted and actual prices - Exponential decay function heavily penalizes large prediction errors - Direction accuracy bonus/penalty system - Confidence-weighted final rewards ### ✅ Multi-Timeframe Support - Separate tracking for **1s, 1m, 1h, 1d** timeframes - Independent accuracy metrics for each timeframe - Timeframe-specific evaluation timeouts - Models know which timeframe they're predicting on ### ✅ Prediction History Tracking - Maintains last **6 predictions per timeframe** per symbol - Comprehensive prediction records with outcomes - Historical accuracy analysis - Memory-efficient with automatic cleanup ### ✅ Real-Time Training - Training triggered at each inference when outcomes are available - Separate training batches for each model and timeframe - Automatic evaluation of predictions after appropriate timeouts - Integration with existing RL training infrastructure ### ✅ Enhanced Inference Scheduling - **Continuous inference** every 1-5 seconds on primary timeframe - **Hourly multi-timeframe inference** (4 predictions per hour - one for each timeframe) - Timeframe-aware inference context - Proper scheduling and coordination ## Architecture ```mermaid graph TD A[Market Data] --> B[Timeframe Inference Coordinator] B --> C[Model Inference] C --> D[Enhanced Reward Calculator] D --> E[Prediction Tracking] E --> F[Outcome Evaluation] F --> G[MSE Reward Calculation] G --> H[Enhanced RL Training Adapter] H --> I[Model Training] I --> J[Performance Monitoring] ``` ## Core Components ### 1. EnhancedRewardCalculator (`core/enhanced_reward_calculator.py`) **Purpose**: Central reward calculation engine using MSE methodology **Key Methods**: - `add_prediction()` - Track new predictions - `evaluate_predictions()` - Calculate rewards when outcomes available - `get_accuracy_summary()` - Comprehensive accuracy metrics - `get_training_data()` - Extract training samples for models **Reward Formula**: ```python # MSE calculation price_error = actual_price - predicted_price mse = price_error ** 2 # Normalize to reasonable scale max_mse = (current_price * 0.1) ** 2 # 10% as max expected error normalized_mse = min(mse / max_mse, 1.0) # Exponential decay (heavily penalize large errors) mse_reward = exp(-5 * normalized_mse) # Range: [exp(-5), 1] # Direction bonus/penalty direction_bonus = 0.5 if direction_correct else -0.5 # Final reward (confidence weighted) final_reward = (mse_reward + direction_bonus) * confidence ``` ### 2. TimeframeInferenceCoordinator (`core/timeframe_inference_coordinator.py`) **Purpose**: Coordinates timeframe-aware model inference with proper scheduling **Key Features**: - **Continuous inference loop** for each symbol (every 5 seconds) - **Hourly multi-timeframe scheduler** (4 predictions per hour) - **Inference context management** (models know target timeframe) - **Automatic reward evaluation** and training triggers **Scheduling**: - **Every 5 seconds**: Inference on primary timeframe (1s) - **Every hour**: One inference for each timeframe (1s, 1m, 1h, 1d) - **Evaluation timeouts**: 5s for 1s predictions, 60s for 1m, 300s for 1h, 900s for 1d ### 3. EnhancedRLTrainingAdapter (`core/enhanced_rl_training_adapter.py`) **Purpose**: Bridge between new reward system and existing RL training infrastructure **Key Features**: - **Model inference wrappers** for DQN, COB RL, and CNN models - **Training batch creation** from prediction records and rewards - **Real-time training triggers** based on evaluation results - **Backward compatibility** with existing training systems ### 4. EnhancedRewardSystemIntegration (`core/enhanced_reward_system_integration.py`) **Purpose**: Simple integration point for existing systems **Key Features**: - **One-line integration** with existing TradingOrchestrator - **Helper functions** for easy prediction tracking - **Comprehensive monitoring** and statistics - **Minimal code changes** required ## Integration Guide ### Step 1: Import Required Components Add to your `orchestrator.py`: ```python from core.enhanced_reward_system_integration import ( integrate_enhanced_rewards, add_prediction_to_enhanced_rewards ) ``` ### Step 2: Initialize in TradingOrchestrator In your `TradingOrchestrator.__init__()`: ```python # Add this line after existing initialization integrate_enhanced_rewards(self, symbols=['ETH/USDT', 'BTC/USDT']) ``` ### Step 3: Start the System In your `TradingOrchestrator.run()` method: ```python # Add this line after initialization await self.enhanced_reward_system.start_integration() ``` ### Step 4: Track Predictions In your model inference methods (CNN, DQN, COB RL): ```python # Example in CNN inference prediction_id = add_prediction_to_enhanced_rewards( self, # orchestrator instance symbol, # 'ETH/USDT' timeframe, # '1s', '1m', '1h', '1d' predicted_price, # model's price prediction direction, # -1 (down), 0 (neutral), 1 (up) confidence, # 0.0 to 1.0 current_price, # current market price 'enhanced_cnn' # model name ) ``` ### Step 5: Monitor Performance ```python # Get comprehensive statistics stats = self.enhanced_reward_system.get_integration_statistics() accuracy = self.enhanced_reward_system.get_model_accuracy() # Force evaluation for testing self.enhanced_reward_system.force_evaluation_and_training('ETH/USDT', '1s') ``` ## Usage Example See `examples/enhanced_reward_system_example.py` for a complete demonstration. ```bash python examples/enhanced_reward_system_example.py ``` ## Performance Benefits ### 🎯 Better Accuracy Measurement - **MSE rewards** provide nuanced feedback vs. simple directional accuracy - **Price prediction accuracy** measured alongside direction accuracy - **Confidence-weighted rewards** encourage well-calibrated predictions ### 📊 Multi-Timeframe Intelligence - **Separate tracking** prevents timeframe confusion - **Timeframe-specific evaluation** accounts for different market dynamics - **Comprehensive accuracy picture** across all prediction horizons ### ⚡ Real-Time Learning - **Immediate training** when prediction outcomes available - **No batch delays** - models learn from every prediction - **Adaptive training frequency** based on prediction evaluation ### 🔄 Enhanced Inference Scheduling - **Optimal prediction frequency** balances real-time response with computational efficiency - **Hourly multi-timeframe predictions** provide comprehensive market coverage - **Context-aware models** make better predictions knowing their target timeframe ## Configuration ### Evaluation Timeouts (Configurable in EnhancedRewardCalculator) ```python evaluation_timeouts = { TimeFrame.SECONDS_1: 5, # Evaluate 1s predictions after 5 seconds TimeFrame.MINUTES_1: 60, # Evaluate 1m predictions after 1 minute TimeFrame.HOURS_1: 300, # Evaluate 1h predictions after 5 minutes TimeFrame.DAYS_1: 900 # Evaluate 1d predictions after 15 minutes } ``` ### Inference Scheduling (Configurable in TimeframeInferenceCoordinator) ```python schedule = InferenceSchedule( continuous_interval_seconds=5.0, # Continuous inference every 5 seconds hourly_timeframes=[TimeFrame.SECONDS_1, TimeFrame.MINUTES_1, TimeFrame.HOURS_1, TimeFrame.DAYS_1] ) ``` ### Training Configuration (Configurable in EnhancedRLTrainingAdapter) ```python min_batch_size = 8 # Minimum samples for training max_batch_size = 64 # Maximum samples per training batch training_interval_seconds = 5.0 # Training check frequency ``` ## Monitoring and Statistics ### Integration Statistics ```python stats = enhanced_reward_system.get_integration_statistics() ``` Returns: - System running status - Total predictions tracked - Component status - Inference and training statistics - Performance metrics ### Model Accuracy ```python accuracy = enhanced_reward_system.get_model_accuracy() ``` Returns for each symbol and timeframe: - Total predictions made - Direction accuracy percentage - Average MSE - Recent prediction count ### Real-Time Monitoring The system provides comprehensive logging at different levels: - `INFO`: Major system events, training results - `DEBUG`: Detailed prediction tracking, reward calculations - `ERROR`: System errors and recovery actions ## Backward Compatibility The enhanced reward system is designed to be **fully backward compatible**: ✅ **Existing models continue to work** without modification ✅ **Existing training systems** remain functional ✅ **Existing reward calculations** can run in parallel ✅ **Gradual migration** - enable for specific models incrementally ## Testing and Validation ### Force Evaluation for Testing ```python # Force immediate evaluation of all predictions enhanced_reward_system.force_evaluation_and_training() # Force evaluation for specific symbol/timeframe enhanced_reward_system.force_evaluation_and_training('ETH/USDT', '1s') ``` ### Manual Prediction Addition ```python # Add predictions manually for testing prediction_id = enhanced_reward_system.add_prediction_manually( symbol='ETH/USDT', timeframe_str='1s', predicted_price=3150.50, predicted_direction=1, confidence=0.85, current_price=3150.00, model_name='test_model' ) ``` ## Memory Management The system includes automatic memory management: - **Automatic prediction cleanup** (configurable retention period) - **Circular buffers** for prediction history (max 100 per timeframe) - **Price cache management** (max 1000 price points per symbol) - **Efficient storage** using deques and compressed data structures ## Future Enhancements The architecture supports easy extension for: 1. **Additional timeframes** (30s, 5m, 15m, etc.) 2. **Custom reward functions** (Sharpe ratio, maximum drawdown, etc.) 3. **Multi-symbol correlation** rewards 4. **Advanced statistical metrics** (Sortino ratio, Calmar ratio) 5. **Model ensemble** reward aggregation 6. **A/B testing** framework for reward functions ## Conclusion The Enhanced Reward System provides a comprehensive foundation for improving RL model training through: - **Precise MSE-based rewards** that accurately measure prediction quality - **Multi-timeframe intelligence** that prevents confusion between different prediction horizons - **Real-time learning** that maximizes training opportunities - **Easy integration** that requires minimal changes to existing code - **Comprehensive monitoring** that provides insights into model performance This system addresses the specific requirements you outlined: ✅ MSE-based accuracy calculation ✅ Training at each inference using last prediction vs. current outcome ✅ Separate accuracy tracking for up to 6 last predictions per timeframe ✅ Models know which timeframe they're predicting on ✅ Hourly multi-timeframe inference (4 predictions per hour) ✅ Integration with existing 1-5 second inference frequency