11 KiB
Enhanced Reward System for Reinforcement Learning Training
Overview
This document describes the implementation of an enhanced reward system for your reinforcement learning trading models. The system uses mean squared error (MSE) between predictions and empirical outcomes as the primary reward mechanism, with support for multiple timeframes and comprehensive accuracy tracking.
Key Features
✅ MSE-Based Reward Calculation
- Uses mean squared difference between predicted and actual prices
- Exponential decay function heavily penalizes large prediction errors
- Direction accuracy bonus/penalty system
- Confidence-weighted final rewards
✅ Multi-Timeframe Support
- Separate tracking for 1s, 1m, 1h, 1d timeframes
- Independent accuracy metrics for each timeframe
- Timeframe-specific evaluation timeouts
- Models know which timeframe they're predicting on
✅ Prediction History Tracking
- Maintains last 6 predictions per timeframe per symbol
- Comprehensive prediction records with outcomes
- Historical accuracy analysis
- Memory-efficient with automatic cleanup
✅ Real-Time Training
- Training triggered at each inference when outcomes are available
- Separate training batches for each model and timeframe
- Automatic evaluation of predictions after appropriate timeouts
- Integration with existing RL training infrastructure
✅ Enhanced Inference Scheduling
- Continuous inference every 1-5 seconds on primary timeframe
- Hourly multi-timeframe inference (4 predictions per hour - one for each timeframe)
- Timeframe-aware inference context
- Proper scheduling and coordination
Architecture
graph TD
A[Market Data] --> B[Timeframe Inference Coordinator]
B --> C[Model Inference]
C --> D[Enhanced Reward Calculator]
D --> E[Prediction Tracking]
E --> F[Outcome Evaluation]
F --> G[MSE Reward Calculation]
G --> H[Enhanced RL Training Adapter]
H --> I[Model Training]
I --> J[Performance Monitoring]
Core Components
1. EnhancedRewardCalculator (core/enhanced_reward_calculator.py
)
Purpose: Central reward calculation engine using MSE methodology
Key Methods:
add_prediction()
- Track new predictionsevaluate_predictions()
- Calculate rewards when outcomes availableget_accuracy_summary()
- Comprehensive accuracy metricsget_training_data()
- Extract training samples for models
Reward Formula:
# MSE calculation
price_error = actual_price - predicted_price
mse = price_error ** 2
# Normalize to reasonable scale
max_mse = (current_price * 0.1) ** 2 # 10% as max expected error
normalized_mse = min(mse / max_mse, 1.0)
# Exponential decay (heavily penalize large errors)
mse_reward = exp(-5 * normalized_mse) # Range: [exp(-5), 1]
# Direction bonus/penalty
direction_bonus = 0.5 if direction_correct else -0.5
# Final reward (confidence weighted)
final_reward = (mse_reward + direction_bonus) * confidence
2. TimeframeInferenceCoordinator (core/timeframe_inference_coordinator.py
)
Purpose: Coordinates timeframe-aware model inference with proper scheduling
Key Features:
- Continuous inference loop for each symbol (every 5 seconds)
- Hourly multi-timeframe scheduler (4 predictions per hour)
- Inference context management (models know target timeframe)
- Automatic reward evaluation and training triggers
Scheduling:
- Every 5 seconds: Inference on primary timeframe (1s)
- Every hour: One inference for each timeframe (1s, 1m, 1h, 1d)
- Evaluation timeouts: 5s for 1s predictions, 60s for 1m, 300s for 1h, 900s for 1d
3. EnhancedRLTrainingAdapter (core/enhanced_rl_training_adapter.py
)
Purpose: Bridge between new reward system and existing RL training infrastructure
Key Features:
- Model inference wrappers for DQN, COB RL, and CNN models
- Training batch creation from prediction records and rewards
- Real-time training triggers based on evaluation results
- Backward compatibility with existing training systems
4. EnhancedRewardSystemIntegration (core/enhanced_reward_system_integration.py
)
Purpose: Simple integration point for existing systems
Key Features:
- One-line integration with existing TradingOrchestrator
- Helper functions for easy prediction tracking
- Comprehensive monitoring and statistics
- Minimal code changes required
Integration Guide
Step 1: Import Required Components
Add to your orchestrator.py
:
from core.enhanced_reward_system_integration import (
integrate_enhanced_rewards,
add_prediction_to_enhanced_rewards
)
Step 2: Initialize in TradingOrchestrator
In your TradingOrchestrator.__init__()
:
# Add this line after existing initialization
integrate_enhanced_rewards(self, symbols=['ETH/USDT', 'BTC/USDT'])
Step 3: Start the System
In your TradingOrchestrator.run()
method:
# Add this line after initialization
await self.enhanced_reward_system.start_integration()
Step 4: Track Predictions
In your model inference methods (CNN, DQN, COB RL):
# Example in CNN inference
prediction_id = add_prediction_to_enhanced_rewards(
self, # orchestrator instance
symbol, # 'ETH/USDT'
timeframe, # '1s', '1m', '1h', '1d'
predicted_price, # model's price prediction
direction, # -1 (down), 0 (neutral), 1 (up)
confidence, # 0.0 to 1.0
current_price, # current market price
'enhanced_cnn' # model name
)
Step 5: Monitor Performance
# Get comprehensive statistics
stats = self.enhanced_reward_system.get_integration_statistics()
accuracy = self.enhanced_reward_system.get_model_accuracy()
# Force evaluation for testing
self.enhanced_reward_system.force_evaluation_and_training('ETH/USDT', '1s')
Usage Example
See examples/enhanced_reward_system_example.py
for a complete demonstration.
python examples/enhanced_reward_system_example.py
Performance Benefits
🎯 Better Accuracy Measurement
- MSE rewards provide nuanced feedback vs. simple directional accuracy
- Price prediction accuracy measured alongside direction accuracy
- Confidence-weighted rewards encourage well-calibrated predictions
📊 Multi-Timeframe Intelligence
- Separate tracking prevents timeframe confusion
- Timeframe-specific evaluation accounts for different market dynamics
- Comprehensive accuracy picture across all prediction horizons
⚡ Real-Time Learning
- Immediate training when prediction outcomes available
- No batch delays - models learn from every prediction
- Adaptive training frequency based on prediction evaluation
🔄 Enhanced Inference Scheduling
- Optimal prediction frequency balances real-time response with computational efficiency
- Hourly multi-timeframe predictions provide comprehensive market coverage
- Context-aware models make better predictions knowing their target timeframe
Configuration
Evaluation Timeouts (Configurable in EnhancedRewardCalculator)
evaluation_timeouts = {
TimeFrame.SECONDS_1: 5, # Evaluate 1s predictions after 5 seconds
TimeFrame.MINUTES_1: 60, # Evaluate 1m predictions after 1 minute
TimeFrame.HOURS_1: 300, # Evaluate 1h predictions after 5 minutes
TimeFrame.DAYS_1: 900 # Evaluate 1d predictions after 15 minutes
}
Inference Scheduling (Configurable in TimeframeInferenceCoordinator)
schedule = InferenceSchedule(
continuous_interval_seconds=5.0, # Continuous inference every 5 seconds
hourly_timeframes=[TimeFrame.SECONDS_1, TimeFrame.MINUTES_1,
TimeFrame.HOURS_1, TimeFrame.DAYS_1]
)
Training Configuration (Configurable in EnhancedRLTrainingAdapter)
min_batch_size = 8 # Minimum samples for training
max_batch_size = 64 # Maximum samples per training batch
training_interval_seconds = 5.0 # Training check frequency
Monitoring and Statistics
Integration Statistics
stats = enhanced_reward_system.get_integration_statistics()
Returns:
- System running status
- Total predictions tracked
- Component status
- Inference and training statistics
- Performance metrics
Model Accuracy
accuracy = enhanced_reward_system.get_model_accuracy()
Returns for each symbol and timeframe:
- Total predictions made
- Direction accuracy percentage
- Average MSE
- Recent prediction count
Real-Time Monitoring
The system provides comprehensive logging at different levels:
INFO
: Major system events, training resultsDEBUG
: Detailed prediction tracking, reward calculationsERROR
: System errors and recovery actions
Backward Compatibility
The enhanced reward system is designed to be fully backward compatible:
✅ Existing models continue to work without modification ✅ Existing training systems remain functional ✅ Existing reward calculations can run in parallel ✅ Gradual migration - enable for specific models incrementally
Testing and Validation
Force Evaluation for Testing
# Force immediate evaluation of all predictions
enhanced_reward_system.force_evaluation_and_training()
# Force evaluation for specific symbol/timeframe
enhanced_reward_system.force_evaluation_and_training('ETH/USDT', '1s')
Manual Prediction Addition
# Add predictions manually for testing
prediction_id = enhanced_reward_system.add_prediction_manually(
symbol='ETH/USDT',
timeframe_str='1s',
predicted_price=3150.50,
predicted_direction=1,
confidence=0.85,
current_price=3150.00,
model_name='test_model'
)
Memory Management
The system includes automatic memory management:
- Automatic prediction cleanup (configurable retention period)
- Circular buffers for prediction history (max 100 per timeframe)
- Price cache management (max 1000 price points per symbol)
- Efficient storage using deques and compressed data structures
Future Enhancements
The architecture supports easy extension for:
- Additional timeframes (30s, 5m, 15m, etc.)
- Custom reward functions (Sharpe ratio, maximum drawdown, etc.)
- Multi-symbol correlation rewards
- Advanced statistical metrics (Sortino ratio, Calmar ratio)
- Model ensemble reward aggregation
- A/B testing framework for reward functions
Conclusion
The Enhanced Reward System provides a comprehensive foundation for improving RL model training through:
- Precise MSE-based rewards that accurately measure prediction quality
- Multi-timeframe intelligence that prevents confusion between different prediction horizons
- Real-time learning that maximizes training opportunities
- Easy integration that requires minimal changes to existing code
- Comprehensive monitoring that provides insights into model performance
This system addresses the specific requirements you outlined: ✅ MSE-based accuracy calculation ✅ Training at each inference using last prediction vs. current outcome ✅ Separate accuracy tracking for up to 6 last predictions per timeframe ✅ Models know which timeframe they're predicting on ✅ Hourly multi-timeframe inference (4 predictions per hour) ✅ Integration with existing 1-5 second inference frequency