ehanced training and reward - wip

This commit is contained in:
Dobromir Popov
2025-08-23 01:07:05 +03:00
parent 10199e4171
commit 9992b226ea
6 changed files with 2471 additions and 0 deletions

View File

@@ -0,0 +1,349 @@
# Enhanced Reward System for Reinforcement Learning Training
## Overview
This document describes the implementation of an enhanced reward system for your reinforcement learning trading models. The system uses **mean squared error (MSE) between predictions and empirical outcomes** as the primary reward mechanism, with support for multiple timeframes and comprehensive accuracy tracking.
## Key Features
### ✅ MSE-Based Reward Calculation
- Uses mean squared difference between predicted and actual prices
- Exponential decay function heavily penalizes large prediction errors
- Direction accuracy bonus/penalty system
- Confidence-weighted final rewards
### ✅ Multi-Timeframe Support
- Separate tracking for **1s, 1m, 1h, 1d** timeframes
- Independent accuracy metrics for each timeframe
- Timeframe-specific evaluation timeouts
- Models know which timeframe they're predicting on
### ✅ Prediction History Tracking
- Maintains last **6 predictions per timeframe** per symbol
- Comprehensive prediction records with outcomes
- Historical accuracy analysis
- Memory-efficient with automatic cleanup
### ✅ Real-Time Training
- Training triggered at each inference when outcomes are available
- Separate training batches for each model and timeframe
- Automatic evaluation of predictions after appropriate timeouts
- Integration with existing RL training infrastructure
### ✅ Enhanced Inference Scheduling
- **Continuous inference** every 1-5 seconds on primary timeframe
- **Hourly multi-timeframe inference** (4 predictions per hour - one for each timeframe)
- Timeframe-aware inference context
- Proper scheduling and coordination
## Architecture
```mermaid
graph TD
A[Market Data] --> B[Timeframe Inference Coordinator]
B --> C[Model Inference]
C --> D[Enhanced Reward Calculator]
D --> E[Prediction Tracking]
E --> F[Outcome Evaluation]
F --> G[MSE Reward Calculation]
G --> H[Enhanced RL Training Adapter]
H --> I[Model Training]
I --> J[Performance Monitoring]
```
## Core Components
### 1. EnhancedRewardCalculator (`core/enhanced_reward_calculator.py`)
**Purpose**: Central reward calculation engine using MSE methodology
**Key Methods**:
- `add_prediction()` - Track new predictions
- `evaluate_predictions()` - Calculate rewards when outcomes available
- `get_accuracy_summary()` - Comprehensive accuracy metrics
- `get_training_data()` - Extract training samples for models
**Reward Formula**:
```python
# MSE calculation
price_error = actual_price - predicted_price
mse = price_error ** 2
# Normalize to reasonable scale
max_mse = (current_price * 0.1) ** 2 # 10% as max expected error
normalized_mse = min(mse / max_mse, 1.0)
# Exponential decay (heavily penalize large errors)
mse_reward = exp(-5 * normalized_mse) # Range: [exp(-5), 1]
# Direction bonus/penalty
direction_bonus = 0.5 if direction_correct else -0.5
# Final reward (confidence weighted)
final_reward = (mse_reward + direction_bonus) * confidence
```
### 2. TimeframeInferenceCoordinator (`core/timeframe_inference_coordinator.py`)
**Purpose**: Coordinates timeframe-aware model inference with proper scheduling
**Key Features**:
- **Continuous inference loop** for each symbol (every 5 seconds)
- **Hourly multi-timeframe scheduler** (4 predictions per hour)
- **Inference context management** (models know target timeframe)
- **Automatic reward evaluation** and training triggers
**Scheduling**:
- **Every 5 seconds**: Inference on primary timeframe (1s)
- **Every hour**: One inference for each timeframe (1s, 1m, 1h, 1d)
- **Evaluation timeouts**: 5s for 1s predictions, 60s for 1m, 300s for 1h, 900s for 1d
### 3. EnhancedRLTrainingAdapter (`core/enhanced_rl_training_adapter.py`)
**Purpose**: Bridge between new reward system and existing RL training infrastructure
**Key Features**:
- **Model inference wrappers** for DQN, COB RL, and CNN models
- **Training batch creation** from prediction records and rewards
- **Real-time training triggers** based on evaluation results
- **Backward compatibility** with existing training systems
### 4. EnhancedRewardSystemIntegration (`core/enhanced_reward_system_integration.py`)
**Purpose**: Simple integration point for existing systems
**Key Features**:
- **One-line integration** with existing TradingOrchestrator
- **Helper functions** for easy prediction tracking
- **Comprehensive monitoring** and statistics
- **Minimal code changes** required
## Integration Guide
### Step 1: Import Required Components
Add to your `orchestrator.py`:
```python
from core.enhanced_reward_system_integration import (
integrate_enhanced_rewards,
add_prediction_to_enhanced_rewards
)
```
### Step 2: Initialize in TradingOrchestrator
In your `TradingOrchestrator.__init__()`:
```python
# Add this line after existing initialization
integrate_enhanced_rewards(self, symbols=['ETH/USDT', 'BTC/USDT'])
```
### Step 3: Start the System
In your `TradingOrchestrator.run()` method:
```python
# Add this line after initialization
await self.enhanced_reward_system.start_integration()
```
### Step 4: Track Predictions
In your model inference methods (CNN, DQN, COB RL):
```python
# Example in CNN inference
prediction_id = add_prediction_to_enhanced_rewards(
self, # orchestrator instance
symbol, # 'ETH/USDT'
timeframe, # '1s', '1m', '1h', '1d'
predicted_price, # model's price prediction
direction, # -1 (down), 0 (neutral), 1 (up)
confidence, # 0.0 to 1.0
current_price, # current market price
'enhanced_cnn' # model name
)
```
### Step 5: Monitor Performance
```python
# Get comprehensive statistics
stats = self.enhanced_reward_system.get_integration_statistics()
accuracy = self.enhanced_reward_system.get_model_accuracy()
# Force evaluation for testing
self.enhanced_reward_system.force_evaluation_and_training('ETH/USDT', '1s')
```
## Usage Example
See `examples/enhanced_reward_system_example.py` for a complete demonstration.
```bash
python examples/enhanced_reward_system_example.py
```
## Performance Benefits
### 🎯 Better Accuracy Measurement
- **MSE rewards** provide nuanced feedback vs. simple directional accuracy
- **Price prediction accuracy** measured alongside direction accuracy
- **Confidence-weighted rewards** encourage well-calibrated predictions
### 📊 Multi-Timeframe Intelligence
- **Separate tracking** prevents timeframe confusion
- **Timeframe-specific evaluation** accounts for different market dynamics
- **Comprehensive accuracy picture** across all prediction horizons
### ⚡ Real-Time Learning
- **Immediate training** when prediction outcomes available
- **No batch delays** - models learn from every prediction
- **Adaptive training frequency** based on prediction evaluation
### 🔄 Enhanced Inference Scheduling
- **Optimal prediction frequency** balances real-time response with computational efficiency
- **Hourly multi-timeframe predictions** provide comprehensive market coverage
- **Context-aware models** make better predictions knowing their target timeframe
## Configuration
### Evaluation Timeouts (Configurable in EnhancedRewardCalculator)
```python
evaluation_timeouts = {
TimeFrame.SECONDS_1: 5, # Evaluate 1s predictions after 5 seconds
TimeFrame.MINUTES_1: 60, # Evaluate 1m predictions after 1 minute
TimeFrame.HOURS_1: 300, # Evaluate 1h predictions after 5 minutes
TimeFrame.DAYS_1: 900 # Evaluate 1d predictions after 15 minutes
}
```
### Inference Scheduling (Configurable in TimeframeInferenceCoordinator)
```python
schedule = InferenceSchedule(
continuous_interval_seconds=5.0, # Continuous inference every 5 seconds
hourly_timeframes=[TimeFrame.SECONDS_1, TimeFrame.MINUTES_1,
TimeFrame.HOURS_1, TimeFrame.DAYS_1]
)
```
### Training Configuration (Configurable in EnhancedRLTrainingAdapter)
```python
min_batch_size = 8 # Minimum samples for training
max_batch_size = 64 # Maximum samples per training batch
training_interval_seconds = 5.0 # Training check frequency
```
## Monitoring and Statistics
### Integration Statistics
```python
stats = enhanced_reward_system.get_integration_statistics()
```
Returns:
- System running status
- Total predictions tracked
- Component status
- Inference and training statistics
- Performance metrics
### Model Accuracy
```python
accuracy = enhanced_reward_system.get_model_accuracy()
```
Returns for each symbol and timeframe:
- Total predictions made
- Direction accuracy percentage
- Average MSE
- Recent prediction count
### Real-Time Monitoring
The system provides comprehensive logging at different levels:
- `INFO`: Major system events, training results
- `DEBUG`: Detailed prediction tracking, reward calculations
- `ERROR`: System errors and recovery actions
## Backward Compatibility
The enhanced reward system is designed to be **fully backward compatible**:
**Existing models continue to work** without modification
**Existing training systems** remain functional
**Existing reward calculations** can run in parallel
**Gradual migration** - enable for specific models incrementally
## Testing and Validation
### Force Evaluation for Testing
```python
# Force immediate evaluation of all predictions
enhanced_reward_system.force_evaluation_and_training()
# Force evaluation for specific symbol/timeframe
enhanced_reward_system.force_evaluation_and_training('ETH/USDT', '1s')
```
### Manual Prediction Addition
```python
# Add predictions manually for testing
prediction_id = enhanced_reward_system.add_prediction_manually(
symbol='ETH/USDT',
timeframe_str='1s',
predicted_price=3150.50,
predicted_direction=1,
confidence=0.85,
current_price=3150.00,
model_name='test_model'
)
```
## Memory Management
The system includes automatic memory management:
- **Automatic prediction cleanup** (configurable retention period)
- **Circular buffers** for prediction history (max 100 per timeframe)
- **Price cache management** (max 1000 price points per symbol)
- **Efficient storage** using deques and compressed data structures
## Future Enhancements
The architecture supports easy extension for:
1. **Additional timeframes** (30s, 5m, 15m, etc.)
2. **Custom reward functions** (Sharpe ratio, maximum drawdown, etc.)
3. **Multi-symbol correlation** rewards
4. **Advanced statistical metrics** (Sortino ratio, Calmar ratio)
5. **Model ensemble** reward aggregation
6. **A/B testing** framework for reward functions
## Conclusion
The Enhanced Reward System provides a comprehensive foundation for improving RL model training through:
- **Precise MSE-based rewards** that accurately measure prediction quality
- **Multi-timeframe intelligence** that prevents confusion between different prediction horizons
- **Real-time learning** that maximizes training opportunities
- **Easy integration** that requires minimal changes to existing code
- **Comprehensive monitoring** that provides insights into model performance
This system addresses the specific requirements you outlined:
✅ MSE-based accuracy calculation
✅ Training at each inference using last prediction vs. current outcome
✅ Separate accuracy tracking for up to 6 last predictions per timeframe
✅ Models know which timeframe they're predicting on
✅ Hourly multi-timeframe inference (4 predictions per hour)
✅ Integration with existing 1-5 second inference frequency