ehanced training and reward - wip
This commit is contained in:
349
docs/ENHANCED_REWARD_SYSTEM.md
Normal file
349
docs/ENHANCED_REWARD_SYSTEM.md
Normal file
@@ -0,0 +1,349 @@
|
||||
# Enhanced Reward System for Reinforcement Learning Training
|
||||
|
||||
## Overview
|
||||
|
||||
This document describes the implementation of an enhanced reward system for your reinforcement learning trading models. The system uses **mean squared error (MSE) between predictions and empirical outcomes** as the primary reward mechanism, with support for multiple timeframes and comprehensive accuracy tracking.
|
||||
|
||||
## Key Features
|
||||
|
||||
### ✅ MSE-Based Reward Calculation
|
||||
- Uses mean squared difference between predicted and actual prices
|
||||
- Exponential decay function heavily penalizes large prediction errors
|
||||
- Direction accuracy bonus/penalty system
|
||||
- Confidence-weighted final rewards
|
||||
|
||||
### ✅ Multi-Timeframe Support
|
||||
- Separate tracking for **1s, 1m, 1h, 1d** timeframes
|
||||
- Independent accuracy metrics for each timeframe
|
||||
- Timeframe-specific evaluation timeouts
|
||||
- Models know which timeframe they're predicting on
|
||||
|
||||
### ✅ Prediction History Tracking
|
||||
- Maintains last **6 predictions per timeframe** per symbol
|
||||
- Comprehensive prediction records with outcomes
|
||||
- Historical accuracy analysis
|
||||
- Memory-efficient with automatic cleanup
|
||||
|
||||
### ✅ Real-Time Training
|
||||
- Training triggered at each inference when outcomes are available
|
||||
- Separate training batches for each model and timeframe
|
||||
- Automatic evaluation of predictions after appropriate timeouts
|
||||
- Integration with existing RL training infrastructure
|
||||
|
||||
### ✅ Enhanced Inference Scheduling
|
||||
- **Continuous inference** every 1-5 seconds on primary timeframe
|
||||
- **Hourly multi-timeframe inference** (4 predictions per hour - one for each timeframe)
|
||||
- Timeframe-aware inference context
|
||||
- Proper scheduling and coordination
|
||||
|
||||
## Architecture
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
A[Market Data] --> B[Timeframe Inference Coordinator]
|
||||
B --> C[Model Inference]
|
||||
C --> D[Enhanced Reward Calculator]
|
||||
D --> E[Prediction Tracking]
|
||||
E --> F[Outcome Evaluation]
|
||||
F --> G[MSE Reward Calculation]
|
||||
G --> H[Enhanced RL Training Adapter]
|
||||
H --> I[Model Training]
|
||||
I --> J[Performance Monitoring]
|
||||
```
|
||||
|
||||
## Core Components
|
||||
|
||||
### 1. EnhancedRewardCalculator (`core/enhanced_reward_calculator.py`)
|
||||
|
||||
**Purpose**: Central reward calculation engine using MSE methodology
|
||||
|
||||
**Key Methods**:
|
||||
- `add_prediction()` - Track new predictions
|
||||
- `evaluate_predictions()` - Calculate rewards when outcomes available
|
||||
- `get_accuracy_summary()` - Comprehensive accuracy metrics
|
||||
- `get_training_data()` - Extract training samples for models
|
||||
|
||||
**Reward Formula**:
|
||||
```python
|
||||
# MSE calculation
|
||||
price_error = actual_price - predicted_price
|
||||
mse = price_error ** 2
|
||||
|
||||
# Normalize to reasonable scale
|
||||
max_mse = (current_price * 0.1) ** 2 # 10% as max expected error
|
||||
normalized_mse = min(mse / max_mse, 1.0)
|
||||
|
||||
# Exponential decay (heavily penalize large errors)
|
||||
mse_reward = exp(-5 * normalized_mse) # Range: [exp(-5), 1]
|
||||
|
||||
# Direction bonus/penalty
|
||||
direction_bonus = 0.5 if direction_correct else -0.5
|
||||
|
||||
# Final reward (confidence weighted)
|
||||
final_reward = (mse_reward + direction_bonus) * confidence
|
||||
```
|
||||
|
||||
### 2. TimeframeInferenceCoordinator (`core/timeframe_inference_coordinator.py`)
|
||||
|
||||
**Purpose**: Coordinates timeframe-aware model inference with proper scheduling
|
||||
|
||||
**Key Features**:
|
||||
- **Continuous inference loop** for each symbol (every 5 seconds)
|
||||
- **Hourly multi-timeframe scheduler** (4 predictions per hour)
|
||||
- **Inference context management** (models know target timeframe)
|
||||
- **Automatic reward evaluation** and training triggers
|
||||
|
||||
**Scheduling**:
|
||||
- **Every 5 seconds**: Inference on primary timeframe (1s)
|
||||
- **Every hour**: One inference for each timeframe (1s, 1m, 1h, 1d)
|
||||
- **Evaluation timeouts**: 5s for 1s predictions, 60s for 1m, 300s for 1h, 900s for 1d
|
||||
|
||||
### 3. EnhancedRLTrainingAdapter (`core/enhanced_rl_training_adapter.py`)
|
||||
|
||||
**Purpose**: Bridge between new reward system and existing RL training infrastructure
|
||||
|
||||
**Key Features**:
|
||||
- **Model inference wrappers** for DQN, COB RL, and CNN models
|
||||
- **Training batch creation** from prediction records and rewards
|
||||
- **Real-time training triggers** based on evaluation results
|
||||
- **Backward compatibility** with existing training systems
|
||||
|
||||
### 4. EnhancedRewardSystemIntegration (`core/enhanced_reward_system_integration.py`)
|
||||
|
||||
**Purpose**: Simple integration point for existing systems
|
||||
|
||||
**Key Features**:
|
||||
- **One-line integration** with existing TradingOrchestrator
|
||||
- **Helper functions** for easy prediction tracking
|
||||
- **Comprehensive monitoring** and statistics
|
||||
- **Minimal code changes** required
|
||||
|
||||
## Integration Guide
|
||||
|
||||
### Step 1: Import Required Components
|
||||
|
||||
Add to your `orchestrator.py`:
|
||||
|
||||
```python
|
||||
from core.enhanced_reward_system_integration import (
|
||||
integrate_enhanced_rewards,
|
||||
add_prediction_to_enhanced_rewards
|
||||
)
|
||||
```
|
||||
|
||||
### Step 2: Initialize in TradingOrchestrator
|
||||
|
||||
In your `TradingOrchestrator.__init__()`:
|
||||
|
||||
```python
|
||||
# Add this line after existing initialization
|
||||
integrate_enhanced_rewards(self, symbols=['ETH/USDT', 'BTC/USDT'])
|
||||
```
|
||||
|
||||
### Step 3: Start the System
|
||||
|
||||
In your `TradingOrchestrator.run()` method:
|
||||
|
||||
```python
|
||||
# Add this line after initialization
|
||||
await self.enhanced_reward_system.start_integration()
|
||||
```
|
||||
|
||||
### Step 4: Track Predictions
|
||||
|
||||
In your model inference methods (CNN, DQN, COB RL):
|
||||
|
||||
```python
|
||||
# Example in CNN inference
|
||||
prediction_id = add_prediction_to_enhanced_rewards(
|
||||
self, # orchestrator instance
|
||||
symbol, # 'ETH/USDT'
|
||||
timeframe, # '1s', '1m', '1h', '1d'
|
||||
predicted_price, # model's price prediction
|
||||
direction, # -1 (down), 0 (neutral), 1 (up)
|
||||
confidence, # 0.0 to 1.0
|
||||
current_price, # current market price
|
||||
'enhanced_cnn' # model name
|
||||
)
|
||||
```
|
||||
|
||||
### Step 5: Monitor Performance
|
||||
|
||||
```python
|
||||
# Get comprehensive statistics
|
||||
stats = self.enhanced_reward_system.get_integration_statistics()
|
||||
accuracy = self.enhanced_reward_system.get_model_accuracy()
|
||||
|
||||
# Force evaluation for testing
|
||||
self.enhanced_reward_system.force_evaluation_and_training('ETH/USDT', '1s')
|
||||
```
|
||||
|
||||
## Usage Example
|
||||
|
||||
See `examples/enhanced_reward_system_example.py` for a complete demonstration.
|
||||
|
||||
```bash
|
||||
python examples/enhanced_reward_system_example.py
|
||||
```
|
||||
|
||||
## Performance Benefits
|
||||
|
||||
### 🎯 Better Accuracy Measurement
|
||||
- **MSE rewards** provide nuanced feedback vs. simple directional accuracy
|
||||
- **Price prediction accuracy** measured alongside direction accuracy
|
||||
- **Confidence-weighted rewards** encourage well-calibrated predictions
|
||||
|
||||
### 📊 Multi-Timeframe Intelligence
|
||||
- **Separate tracking** prevents timeframe confusion
|
||||
- **Timeframe-specific evaluation** accounts for different market dynamics
|
||||
- **Comprehensive accuracy picture** across all prediction horizons
|
||||
|
||||
### ⚡ Real-Time Learning
|
||||
- **Immediate training** when prediction outcomes available
|
||||
- **No batch delays** - models learn from every prediction
|
||||
- **Adaptive training frequency** based on prediction evaluation
|
||||
|
||||
### 🔄 Enhanced Inference Scheduling
|
||||
- **Optimal prediction frequency** balances real-time response with computational efficiency
|
||||
- **Hourly multi-timeframe predictions** provide comprehensive market coverage
|
||||
- **Context-aware models** make better predictions knowing their target timeframe
|
||||
|
||||
## Configuration
|
||||
|
||||
### Evaluation Timeouts (Configurable in EnhancedRewardCalculator)
|
||||
|
||||
```python
|
||||
evaluation_timeouts = {
|
||||
TimeFrame.SECONDS_1: 5, # Evaluate 1s predictions after 5 seconds
|
||||
TimeFrame.MINUTES_1: 60, # Evaluate 1m predictions after 1 minute
|
||||
TimeFrame.HOURS_1: 300, # Evaluate 1h predictions after 5 minutes
|
||||
TimeFrame.DAYS_1: 900 # Evaluate 1d predictions after 15 minutes
|
||||
}
|
||||
```
|
||||
|
||||
### Inference Scheduling (Configurable in TimeframeInferenceCoordinator)
|
||||
|
||||
```python
|
||||
schedule = InferenceSchedule(
|
||||
continuous_interval_seconds=5.0, # Continuous inference every 5 seconds
|
||||
hourly_timeframes=[TimeFrame.SECONDS_1, TimeFrame.MINUTES_1,
|
||||
TimeFrame.HOURS_1, TimeFrame.DAYS_1]
|
||||
)
|
||||
```
|
||||
|
||||
### Training Configuration (Configurable in EnhancedRLTrainingAdapter)
|
||||
|
||||
```python
|
||||
min_batch_size = 8 # Minimum samples for training
|
||||
max_batch_size = 64 # Maximum samples per training batch
|
||||
training_interval_seconds = 5.0 # Training check frequency
|
||||
```
|
||||
|
||||
## Monitoring and Statistics
|
||||
|
||||
### Integration Statistics
|
||||
|
||||
```python
|
||||
stats = enhanced_reward_system.get_integration_statistics()
|
||||
```
|
||||
|
||||
Returns:
|
||||
- System running status
|
||||
- Total predictions tracked
|
||||
- Component status
|
||||
- Inference and training statistics
|
||||
- Performance metrics
|
||||
|
||||
### Model Accuracy
|
||||
|
||||
```python
|
||||
accuracy = enhanced_reward_system.get_model_accuracy()
|
||||
```
|
||||
|
||||
Returns for each symbol and timeframe:
|
||||
- Total predictions made
|
||||
- Direction accuracy percentage
|
||||
- Average MSE
|
||||
- Recent prediction count
|
||||
|
||||
### Real-Time Monitoring
|
||||
|
||||
The system provides comprehensive logging at different levels:
|
||||
- `INFO`: Major system events, training results
|
||||
- `DEBUG`: Detailed prediction tracking, reward calculations
|
||||
- `ERROR`: System errors and recovery actions
|
||||
|
||||
## Backward Compatibility
|
||||
|
||||
The enhanced reward system is designed to be **fully backward compatible**:
|
||||
|
||||
✅ **Existing models continue to work** without modification
|
||||
✅ **Existing training systems** remain functional
|
||||
✅ **Existing reward calculations** can run in parallel
|
||||
✅ **Gradual migration** - enable for specific models incrementally
|
||||
|
||||
## Testing and Validation
|
||||
|
||||
### Force Evaluation for Testing
|
||||
|
||||
```python
|
||||
# Force immediate evaluation of all predictions
|
||||
enhanced_reward_system.force_evaluation_and_training()
|
||||
|
||||
# Force evaluation for specific symbol/timeframe
|
||||
enhanced_reward_system.force_evaluation_and_training('ETH/USDT', '1s')
|
||||
```
|
||||
|
||||
### Manual Prediction Addition
|
||||
|
||||
```python
|
||||
# Add predictions manually for testing
|
||||
prediction_id = enhanced_reward_system.add_prediction_manually(
|
||||
symbol='ETH/USDT',
|
||||
timeframe_str='1s',
|
||||
predicted_price=3150.50,
|
||||
predicted_direction=1,
|
||||
confidence=0.85,
|
||||
current_price=3150.00,
|
||||
model_name='test_model'
|
||||
)
|
||||
```
|
||||
|
||||
## Memory Management
|
||||
|
||||
The system includes automatic memory management:
|
||||
|
||||
- **Automatic prediction cleanup** (configurable retention period)
|
||||
- **Circular buffers** for prediction history (max 100 per timeframe)
|
||||
- **Price cache management** (max 1000 price points per symbol)
|
||||
- **Efficient storage** using deques and compressed data structures
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
The architecture supports easy extension for:
|
||||
|
||||
1. **Additional timeframes** (30s, 5m, 15m, etc.)
|
||||
2. **Custom reward functions** (Sharpe ratio, maximum drawdown, etc.)
|
||||
3. **Multi-symbol correlation** rewards
|
||||
4. **Advanced statistical metrics** (Sortino ratio, Calmar ratio)
|
||||
5. **Model ensemble** reward aggregation
|
||||
6. **A/B testing** framework for reward functions
|
||||
|
||||
## Conclusion
|
||||
|
||||
The Enhanced Reward System provides a comprehensive foundation for improving RL model training through:
|
||||
|
||||
- **Precise MSE-based rewards** that accurately measure prediction quality
|
||||
- **Multi-timeframe intelligence** that prevents confusion between different prediction horizons
|
||||
- **Real-time learning** that maximizes training opportunities
|
||||
- **Easy integration** that requires minimal changes to existing code
|
||||
- **Comprehensive monitoring** that provides insights into model performance
|
||||
|
||||
This system addresses the specific requirements you outlined:
|
||||
✅ MSE-based accuracy calculation
|
||||
✅ Training at each inference using last prediction vs. current outcome
|
||||
✅ Separate accuracy tracking for up to 6 last predictions per timeframe
|
||||
✅ Models know which timeframe they're predicting on
|
||||
✅ Hourly multi-timeframe inference (4 predictions per hour)
|
||||
✅ Integration with existing 1-5 second inference frequency
|
||||
|
Reference in New Issue
Block a user