Files
gogo2/docs/ENHANCED_REWARD_SYSTEM.md
2025-08-23 01:07:05 +03:00

11 KiB

Enhanced Reward System for Reinforcement Learning Training

Overview

This document describes the implementation of an enhanced reward system for your reinforcement learning trading models. The system uses mean squared error (MSE) between predictions and empirical outcomes as the primary reward mechanism, with support for multiple timeframes and comprehensive accuracy tracking.

Key Features

MSE-Based Reward Calculation

  • Uses mean squared difference between predicted and actual prices
  • Exponential decay function heavily penalizes large prediction errors
  • Direction accuracy bonus/penalty system
  • Confidence-weighted final rewards

Multi-Timeframe Support

  • Separate tracking for 1s, 1m, 1h, 1d timeframes
  • Independent accuracy metrics for each timeframe
  • Timeframe-specific evaluation timeouts
  • Models know which timeframe they're predicting on

Prediction History Tracking

  • Maintains last 6 predictions per timeframe per symbol
  • Comprehensive prediction records with outcomes
  • Historical accuracy analysis
  • Memory-efficient with automatic cleanup

Real-Time Training

  • Training triggered at each inference when outcomes are available
  • Separate training batches for each model and timeframe
  • Automatic evaluation of predictions after appropriate timeouts
  • Integration with existing RL training infrastructure

Enhanced Inference Scheduling

  • Continuous inference every 1-5 seconds on primary timeframe
  • Hourly multi-timeframe inference (4 predictions per hour - one for each timeframe)
  • Timeframe-aware inference context
  • Proper scheduling and coordination

Architecture

graph TD
    A[Market Data] --> B[Timeframe Inference Coordinator]
    B --> C[Model Inference]
    C --> D[Enhanced Reward Calculator]
    D --> E[Prediction Tracking]
    E --> F[Outcome Evaluation]
    F --> G[MSE Reward Calculation]
    G --> H[Enhanced RL Training Adapter]
    H --> I[Model Training]
    I --> J[Performance Monitoring]

Core Components

1. EnhancedRewardCalculator (core/enhanced_reward_calculator.py)

Purpose: Central reward calculation engine using MSE methodology

Key Methods:

  • add_prediction() - Track new predictions
  • evaluate_predictions() - Calculate rewards when outcomes available
  • get_accuracy_summary() - Comprehensive accuracy metrics
  • get_training_data() - Extract training samples for models

Reward Formula:

# MSE calculation
price_error = actual_price - predicted_price
mse = price_error ** 2

# Normalize to reasonable scale
max_mse = (current_price * 0.1) ** 2  # 10% as max expected error
normalized_mse = min(mse / max_mse, 1.0)

# Exponential decay (heavily penalize large errors)
mse_reward = exp(-5 * normalized_mse)  # Range: [exp(-5), 1]

# Direction bonus/penalty
direction_bonus = 0.5 if direction_correct else -0.5

# Final reward (confidence weighted)
final_reward = (mse_reward + direction_bonus) * confidence

2. TimeframeInferenceCoordinator (core/timeframe_inference_coordinator.py)

Purpose: Coordinates timeframe-aware model inference with proper scheduling

Key Features:

  • Continuous inference loop for each symbol (every 5 seconds)
  • Hourly multi-timeframe scheduler (4 predictions per hour)
  • Inference context management (models know target timeframe)
  • Automatic reward evaluation and training triggers

Scheduling:

  • Every 5 seconds: Inference on primary timeframe (1s)
  • Every hour: One inference for each timeframe (1s, 1m, 1h, 1d)
  • Evaluation timeouts: 5s for 1s predictions, 60s for 1m, 300s for 1h, 900s for 1d

3. EnhancedRLTrainingAdapter (core/enhanced_rl_training_adapter.py)

Purpose: Bridge between new reward system and existing RL training infrastructure

Key Features:

  • Model inference wrappers for DQN, COB RL, and CNN models
  • Training batch creation from prediction records and rewards
  • Real-time training triggers based on evaluation results
  • Backward compatibility with existing training systems

4. EnhancedRewardSystemIntegration (core/enhanced_reward_system_integration.py)

Purpose: Simple integration point for existing systems

Key Features:

  • One-line integration with existing TradingOrchestrator
  • Helper functions for easy prediction tracking
  • Comprehensive monitoring and statistics
  • Minimal code changes required

Integration Guide

Step 1: Import Required Components

Add to your orchestrator.py:

from core.enhanced_reward_system_integration import (
    integrate_enhanced_rewards, 
    add_prediction_to_enhanced_rewards
)

Step 2: Initialize in TradingOrchestrator

In your TradingOrchestrator.__init__():

# Add this line after existing initialization
integrate_enhanced_rewards(self, symbols=['ETH/USDT', 'BTC/USDT'])

Step 3: Start the System

In your TradingOrchestrator.run() method:

# Add this line after initialization
await self.enhanced_reward_system.start_integration()

Step 4: Track Predictions

In your model inference methods (CNN, DQN, COB RL):

# Example in CNN inference
prediction_id = add_prediction_to_enhanced_rewards(
    self,  # orchestrator instance
    symbol,  # 'ETH/USDT'
    timeframe,  # '1s', '1m', '1h', '1d'
    predicted_price,  # model's price prediction
    direction,  # -1 (down), 0 (neutral), 1 (up)  
    confidence,  # 0.0 to 1.0
    current_price,  # current market price
    'enhanced_cnn'  # model name
)

Step 5: Monitor Performance

# Get comprehensive statistics
stats = self.enhanced_reward_system.get_integration_statistics()
accuracy = self.enhanced_reward_system.get_model_accuracy()

# Force evaluation for testing
self.enhanced_reward_system.force_evaluation_and_training('ETH/USDT', '1s')

Usage Example

See examples/enhanced_reward_system_example.py for a complete demonstration.

python examples/enhanced_reward_system_example.py

Performance Benefits

🎯 Better Accuracy Measurement

  • MSE rewards provide nuanced feedback vs. simple directional accuracy
  • Price prediction accuracy measured alongside direction accuracy
  • Confidence-weighted rewards encourage well-calibrated predictions

📊 Multi-Timeframe Intelligence

  • Separate tracking prevents timeframe confusion
  • Timeframe-specific evaluation accounts for different market dynamics
  • Comprehensive accuracy picture across all prediction horizons

Real-Time Learning

  • Immediate training when prediction outcomes available
  • No batch delays - models learn from every prediction
  • Adaptive training frequency based on prediction evaluation

🔄 Enhanced Inference Scheduling

  • Optimal prediction frequency balances real-time response with computational efficiency
  • Hourly multi-timeframe predictions provide comprehensive market coverage
  • Context-aware models make better predictions knowing their target timeframe

Configuration

Evaluation Timeouts (Configurable in EnhancedRewardCalculator)

evaluation_timeouts = {
    TimeFrame.SECONDS_1: 5,    # Evaluate 1s predictions after 5 seconds
    TimeFrame.MINUTES_1: 60,   # Evaluate 1m predictions after 1 minute  
    TimeFrame.HOURS_1: 300,    # Evaluate 1h predictions after 5 minutes
    TimeFrame.DAYS_1: 900      # Evaluate 1d predictions after 15 minutes
}

Inference Scheduling (Configurable in TimeframeInferenceCoordinator)

schedule = InferenceSchedule(
    continuous_interval_seconds=5.0,  # Continuous inference every 5 seconds
    hourly_timeframes=[TimeFrame.SECONDS_1, TimeFrame.MINUTES_1, 
                      TimeFrame.HOURS_1, TimeFrame.DAYS_1]
)

Training Configuration (Configurable in EnhancedRLTrainingAdapter)

min_batch_size = 8  # Minimum samples for training
max_batch_size = 64  # Maximum samples per training batch
training_interval_seconds = 5.0  # Training check frequency

Monitoring and Statistics

Integration Statistics

stats = enhanced_reward_system.get_integration_statistics()

Returns:

  • System running status
  • Total predictions tracked
  • Component status
  • Inference and training statistics
  • Performance metrics

Model Accuracy

accuracy = enhanced_reward_system.get_model_accuracy()

Returns for each symbol and timeframe:

  • Total predictions made
  • Direction accuracy percentage
  • Average MSE
  • Recent prediction count

Real-Time Monitoring

The system provides comprehensive logging at different levels:

  • INFO: Major system events, training results
  • DEBUG: Detailed prediction tracking, reward calculations
  • ERROR: System errors and recovery actions

Backward Compatibility

The enhanced reward system is designed to be fully backward compatible:

Existing models continue to work without modification Existing training systems remain functional Existing reward calculations can run in parallel Gradual migration - enable for specific models incrementally

Testing and Validation

Force Evaluation for Testing

# Force immediate evaluation of all predictions
enhanced_reward_system.force_evaluation_and_training()

# Force evaluation for specific symbol/timeframe
enhanced_reward_system.force_evaluation_and_training('ETH/USDT', '1s')

Manual Prediction Addition

# Add predictions manually for testing
prediction_id = enhanced_reward_system.add_prediction_manually(
    symbol='ETH/USDT',
    timeframe_str='1s',
    predicted_price=3150.50,
    predicted_direction=1,
    confidence=0.85,
    current_price=3150.00,
    model_name='test_model'
)

Memory Management

The system includes automatic memory management:

  • Automatic prediction cleanup (configurable retention period)
  • Circular buffers for prediction history (max 100 per timeframe)
  • Price cache management (max 1000 price points per symbol)
  • Efficient storage using deques and compressed data structures

Future Enhancements

The architecture supports easy extension for:

  1. Additional timeframes (30s, 5m, 15m, etc.)
  2. Custom reward functions (Sharpe ratio, maximum drawdown, etc.)
  3. Multi-symbol correlation rewards
  4. Advanced statistical metrics (Sortino ratio, Calmar ratio)
  5. Model ensemble reward aggregation
  6. A/B testing framework for reward functions

Conclusion

The Enhanced Reward System provides a comprehensive foundation for improving RL model training through:

  • Precise MSE-based rewards that accurately measure prediction quality
  • Multi-timeframe intelligence that prevents confusion between different prediction horizons
  • Real-time learning that maximizes training opportunities
  • Easy integration that requires minimal changes to existing code
  • Comprehensive monitoring that provides insights into model performance

This system addresses the specific requirements you outlined: MSE-based accuracy calculation Training at each inference using last prediction vs. current outcome Separate accuracy tracking for up to 6 last predictions per timeframe Models know which timeframe they're predicting on Hourly multi-timeframe inference (4 predictions per hour) Integration with existing 1-5 second inference frequency