350 lines
11 KiB
Markdown
350 lines
11 KiB
Markdown
# Enhanced Reward System for Reinforcement Learning Training
|
|
|
|
## Overview
|
|
|
|
This document describes the implementation of an enhanced reward system for your reinforcement learning trading models. The system uses **mean squared error (MSE) between predictions and empirical outcomes** as the primary reward mechanism, with support for multiple timeframes and comprehensive accuracy tracking.
|
|
|
|
## Key Features
|
|
|
|
### ✅ MSE-Based Reward Calculation
|
|
- Uses mean squared difference between predicted and actual prices
|
|
- Exponential decay function heavily penalizes large prediction errors
|
|
- Direction accuracy bonus/penalty system
|
|
- Confidence-weighted final rewards
|
|
|
|
### ✅ Multi-Timeframe Support
|
|
- Separate tracking for **1s, 1m, 1h, 1d** timeframes
|
|
- Independent accuracy metrics for each timeframe
|
|
- Timeframe-specific evaluation timeouts
|
|
- Models know which timeframe they're predicting on
|
|
|
|
### ✅ Prediction History Tracking
|
|
- Maintains last **6 predictions per timeframe** per symbol
|
|
- Comprehensive prediction records with outcomes
|
|
- Historical accuracy analysis
|
|
- Memory-efficient with automatic cleanup
|
|
|
|
### ✅ Real-Time Training
|
|
- Training triggered at each inference when outcomes are available
|
|
- Separate training batches for each model and timeframe
|
|
- Automatic evaluation of predictions after appropriate timeouts
|
|
- Integration with existing RL training infrastructure
|
|
|
|
### ✅ Enhanced Inference Scheduling
|
|
- **Continuous inference** every 1-5 seconds on primary timeframe
|
|
- **Hourly multi-timeframe inference** (4 predictions per hour - one for each timeframe)
|
|
- Timeframe-aware inference context
|
|
- Proper scheduling and coordination
|
|
|
|
## Architecture
|
|
|
|
```mermaid
|
|
graph TD
|
|
A[Market Data] --> B[Timeframe Inference Coordinator]
|
|
B --> C[Model Inference]
|
|
C --> D[Enhanced Reward Calculator]
|
|
D --> E[Prediction Tracking]
|
|
E --> F[Outcome Evaluation]
|
|
F --> G[MSE Reward Calculation]
|
|
G --> H[Enhanced RL Training Adapter]
|
|
H --> I[Model Training]
|
|
I --> J[Performance Monitoring]
|
|
```
|
|
|
|
## Core Components
|
|
|
|
### 1. EnhancedRewardCalculator (`core/enhanced_reward_calculator.py`)
|
|
|
|
**Purpose**: Central reward calculation engine using MSE methodology
|
|
|
|
**Key Methods**:
|
|
- `add_prediction()` - Track new predictions
|
|
- `evaluate_predictions()` - Calculate rewards when outcomes available
|
|
- `get_accuracy_summary()` - Comprehensive accuracy metrics
|
|
- `get_training_data()` - Extract training samples for models
|
|
|
|
**Reward Formula**:
|
|
```python
|
|
# MSE calculation
|
|
price_error = actual_price - predicted_price
|
|
mse = price_error ** 2
|
|
|
|
# Normalize to reasonable scale
|
|
max_mse = (current_price * 0.1) ** 2 # 10% as max expected error
|
|
normalized_mse = min(mse / max_mse, 1.0)
|
|
|
|
# Exponential decay (heavily penalize large errors)
|
|
mse_reward = exp(-5 * normalized_mse) # Range: [exp(-5), 1]
|
|
|
|
# Direction bonus/penalty
|
|
direction_bonus = 0.5 if direction_correct else -0.5
|
|
|
|
# Final reward (confidence weighted)
|
|
final_reward = (mse_reward + direction_bonus) * confidence
|
|
```
|
|
|
|
### 2. TimeframeInferenceCoordinator (`core/timeframe_inference_coordinator.py`)
|
|
|
|
**Purpose**: Coordinates timeframe-aware model inference with proper scheduling
|
|
|
|
**Key Features**:
|
|
- **Continuous inference loop** for each symbol (every 5 seconds)
|
|
- **Hourly multi-timeframe scheduler** (4 predictions per hour)
|
|
- **Inference context management** (models know target timeframe)
|
|
- **Automatic reward evaluation** and training triggers
|
|
|
|
**Scheduling**:
|
|
- **Every 5 seconds**: Inference on primary timeframe (1s)
|
|
- **Every hour**: One inference for each timeframe (1s, 1m, 1h, 1d)
|
|
- **Evaluation timeouts**: 5s for 1s predictions, 60s for 1m, 300s for 1h, 900s for 1d
|
|
|
|
### 3. EnhancedRLTrainingAdapter (`core/enhanced_rl_training_adapter.py`)
|
|
|
|
**Purpose**: Bridge between new reward system and existing RL training infrastructure
|
|
|
|
**Key Features**:
|
|
- **Model inference wrappers** for DQN, COB RL, and CNN models
|
|
- **Training batch creation** from prediction records and rewards
|
|
- **Real-time training triggers** based on evaluation results
|
|
- **Backward compatibility** with existing training systems
|
|
|
|
### 4. EnhancedRewardSystemIntegration (`core/enhanced_reward_system_integration.py`)
|
|
|
|
**Purpose**: Simple integration point for existing systems
|
|
|
|
**Key Features**:
|
|
- **One-line integration** with existing TradingOrchestrator
|
|
- **Helper functions** for easy prediction tracking
|
|
- **Comprehensive monitoring** and statistics
|
|
- **Minimal code changes** required
|
|
|
|
## Integration Guide
|
|
|
|
### Step 1: Import Required Components
|
|
|
|
Add to your `orchestrator.py`:
|
|
|
|
```python
|
|
from core.enhanced_reward_system_integration import (
|
|
integrate_enhanced_rewards,
|
|
add_prediction_to_enhanced_rewards
|
|
)
|
|
```
|
|
|
|
### Step 2: Initialize in TradingOrchestrator
|
|
|
|
In your `TradingOrchestrator.__init__()`:
|
|
|
|
```python
|
|
# Add this line after existing initialization
|
|
integrate_enhanced_rewards(self, symbols=['ETH/USDT', 'BTC/USDT'])
|
|
```
|
|
|
|
### Step 3: Start the System
|
|
|
|
In your `TradingOrchestrator.run()` method:
|
|
|
|
```python
|
|
# Add this line after initialization
|
|
await self.enhanced_reward_system.start_integration()
|
|
```
|
|
|
|
### Step 4: Track Predictions
|
|
|
|
In your model inference methods (CNN, DQN, COB RL):
|
|
|
|
```python
|
|
# Example in CNN inference
|
|
prediction_id = add_prediction_to_enhanced_rewards(
|
|
self, # orchestrator instance
|
|
symbol, # 'ETH/USDT'
|
|
timeframe, # '1s', '1m', '1h', '1d'
|
|
predicted_price, # model's price prediction
|
|
direction, # -1 (down), 0 (neutral), 1 (up)
|
|
confidence, # 0.0 to 1.0
|
|
current_price, # current market price
|
|
'enhanced_cnn' # model name
|
|
)
|
|
```
|
|
|
|
### Step 5: Monitor Performance
|
|
|
|
```python
|
|
# Get comprehensive statistics
|
|
stats = self.enhanced_reward_system.get_integration_statistics()
|
|
accuracy = self.enhanced_reward_system.get_model_accuracy()
|
|
|
|
# Force evaluation for testing
|
|
self.enhanced_reward_system.force_evaluation_and_training('ETH/USDT', '1s')
|
|
```
|
|
|
|
## Usage Example
|
|
|
|
See `examples/enhanced_reward_system_example.py` for a complete demonstration.
|
|
|
|
```bash
|
|
python examples/enhanced_reward_system_example.py
|
|
```
|
|
|
|
## Performance Benefits
|
|
|
|
### 🎯 Better Accuracy Measurement
|
|
- **MSE rewards** provide nuanced feedback vs. simple directional accuracy
|
|
- **Price prediction accuracy** measured alongside direction accuracy
|
|
- **Confidence-weighted rewards** encourage well-calibrated predictions
|
|
|
|
### 📊 Multi-Timeframe Intelligence
|
|
- **Separate tracking** prevents timeframe confusion
|
|
- **Timeframe-specific evaluation** accounts for different market dynamics
|
|
- **Comprehensive accuracy picture** across all prediction horizons
|
|
|
|
### ⚡ Real-Time Learning
|
|
- **Immediate training** when prediction outcomes available
|
|
- **No batch delays** - models learn from every prediction
|
|
- **Adaptive training frequency** based on prediction evaluation
|
|
|
|
### 🔄 Enhanced Inference Scheduling
|
|
- **Optimal prediction frequency** balances real-time response with computational efficiency
|
|
- **Hourly multi-timeframe predictions** provide comprehensive market coverage
|
|
- **Context-aware models** make better predictions knowing their target timeframe
|
|
|
|
## Configuration
|
|
|
|
### Evaluation Timeouts (Configurable in EnhancedRewardCalculator)
|
|
|
|
```python
|
|
evaluation_timeouts = {
|
|
TimeFrame.SECONDS_1: 5, # Evaluate 1s predictions after 5 seconds
|
|
TimeFrame.MINUTES_1: 60, # Evaluate 1m predictions after 1 minute
|
|
TimeFrame.HOURS_1: 300, # Evaluate 1h predictions after 5 minutes
|
|
TimeFrame.DAYS_1: 900 # Evaluate 1d predictions after 15 minutes
|
|
}
|
|
```
|
|
|
|
### Inference Scheduling (Configurable in TimeframeInferenceCoordinator)
|
|
|
|
```python
|
|
schedule = InferenceSchedule(
|
|
continuous_interval_seconds=5.0, # Continuous inference every 5 seconds
|
|
hourly_timeframes=[TimeFrame.SECONDS_1, TimeFrame.MINUTES_1,
|
|
TimeFrame.HOURS_1, TimeFrame.DAYS_1]
|
|
)
|
|
```
|
|
|
|
### Training Configuration (Configurable in EnhancedRLTrainingAdapter)
|
|
|
|
```python
|
|
min_batch_size = 8 # Minimum samples for training
|
|
max_batch_size = 64 # Maximum samples per training batch
|
|
training_interval_seconds = 5.0 # Training check frequency
|
|
```
|
|
|
|
## Monitoring and Statistics
|
|
|
|
### Integration Statistics
|
|
|
|
```python
|
|
stats = enhanced_reward_system.get_integration_statistics()
|
|
```
|
|
|
|
Returns:
|
|
- System running status
|
|
- Total predictions tracked
|
|
- Component status
|
|
- Inference and training statistics
|
|
- Performance metrics
|
|
|
|
### Model Accuracy
|
|
|
|
```python
|
|
accuracy = enhanced_reward_system.get_model_accuracy()
|
|
```
|
|
|
|
Returns for each symbol and timeframe:
|
|
- Total predictions made
|
|
- Direction accuracy percentage
|
|
- Average MSE
|
|
- Recent prediction count
|
|
|
|
### Real-Time Monitoring
|
|
|
|
The system provides comprehensive logging at different levels:
|
|
- `INFO`: Major system events, training results
|
|
- `DEBUG`: Detailed prediction tracking, reward calculations
|
|
- `ERROR`: System errors and recovery actions
|
|
|
|
## Backward Compatibility
|
|
|
|
The enhanced reward system is designed to be **fully backward compatible**:
|
|
|
|
✅ **Existing models continue to work** without modification
|
|
✅ **Existing training systems** remain functional
|
|
✅ **Existing reward calculations** can run in parallel
|
|
✅ **Gradual migration** - enable for specific models incrementally
|
|
|
|
## Testing and Validation
|
|
|
|
### Force Evaluation for Testing
|
|
|
|
```python
|
|
# Force immediate evaluation of all predictions
|
|
enhanced_reward_system.force_evaluation_and_training()
|
|
|
|
# Force evaluation for specific symbol/timeframe
|
|
enhanced_reward_system.force_evaluation_and_training('ETH/USDT', '1s')
|
|
```
|
|
|
|
### Manual Prediction Addition
|
|
|
|
```python
|
|
# Add predictions manually for testing
|
|
prediction_id = enhanced_reward_system.add_prediction_manually(
|
|
symbol='ETH/USDT',
|
|
timeframe_str='1s',
|
|
predicted_price=3150.50,
|
|
predicted_direction=1,
|
|
confidence=0.85,
|
|
current_price=3150.00,
|
|
model_name='test_model'
|
|
)
|
|
```
|
|
|
|
## Memory Management
|
|
|
|
The system includes automatic memory management:
|
|
|
|
- **Automatic prediction cleanup** (configurable retention period)
|
|
- **Circular buffers** for prediction history (max 100 per timeframe)
|
|
- **Price cache management** (max 1000 price points per symbol)
|
|
- **Efficient storage** using deques and compressed data structures
|
|
|
|
## Future Enhancements
|
|
|
|
The architecture supports easy extension for:
|
|
|
|
1. **Additional timeframes** (30s, 5m, 15m, etc.)
|
|
2. **Custom reward functions** (Sharpe ratio, maximum drawdown, etc.)
|
|
3. **Multi-symbol correlation** rewards
|
|
4. **Advanced statistical metrics** (Sortino ratio, Calmar ratio)
|
|
5. **Model ensemble** reward aggregation
|
|
6. **A/B testing** framework for reward functions
|
|
|
|
## Conclusion
|
|
|
|
The Enhanced Reward System provides a comprehensive foundation for improving RL model training through:
|
|
|
|
- **Precise MSE-based rewards** that accurately measure prediction quality
|
|
- **Multi-timeframe intelligence** that prevents confusion between different prediction horizons
|
|
- **Real-time learning** that maximizes training opportunities
|
|
- **Easy integration** that requires minimal changes to existing code
|
|
- **Comprehensive monitoring** that provides insights into model performance
|
|
|
|
This system addresses the specific requirements you outlined:
|
|
✅ MSE-based accuracy calculation
|
|
✅ Training at each inference using last prediction vs. current outcome
|
|
✅ Separate accuracy tracking for up to 6 last predictions per timeframe
|
|
✅ Models know which timeframe they're predicting on
|
|
✅ Hourly multi-timeframe inference (4 predictions per hour)
|
|
✅ Integration with existing 1-5 second inference frequency
|
|
|