Files
gogo2/TRAINING_FIX_IMPLEMENTATION.md
Dobromir Popov 0d08339d98 docs: Add comprehensive training fix implementation plan
- Document critical issues and fixes applied
- Detail proper training loop architecture
- Outline signal-position linking system
- Define comprehensive reward calculation
- List implementation phases and next steps
2025-10-01 00:08:46 +03:00

8.6 KiB

Trading System Training Fix Implementation

Date: September 30, 2025
Status: In Progress


Critical Issues Identified

1. Division by Zero FIXED

Problem: Trading executor crashed when price was 0 or invalid
Solution: Added price validation before division in core/trading_executor.py

if current_price <= 0:
    logger.error(f"Invalid price {current_price} for {symbol}")
    return False

2. Mock Predictions FIXED

Problem: System fell back to "mock predictions" when training unavailable (POLICY VIOLATION!)
Solution: Removed mock fallback, system now fails gracefully

logger.error("CRITICAL: Enhanced training system not available - predictions disabled. NEVER use mock data.")

3. Torch Import ALREADY FIXED

Problem: "cannot access local variable 'torch'" error
Status: Already has None placeholder when import fails


Training Loop Issues

###Current State (BROKEN):

  1. Immediate Training on Next Tick

    • Training happens on next_price - current_price (≈0.00)
    • No actual position tracking
    • Rewards are meaningless noise
  2. No Position Close Training

    • Positions open/close but NO training triggered
    • Real PnL calculated but unused for training
    • Models never learn from actual trade outcomes
  3. Manual Trades Only

    • Only manual trades trigger model training
    • Automated trades don't train models

Proper Training Loop Implementation

Required Components:

1. Signal-Position Linking

class SignalPositionTracker:
    """Links trading signals to positions for outcome-based training"""
    
    def __init__(self):
        self.active_trades = {}  # position_id -> signal_data
        
    def register_signal(self, signal_id, signal_data, position_id):
        """Store signal context when position opens"""
        self.active_trades[position_id] = {
            'signal_id': signal_id,
            'signal': signal_data,
            'entry_time': datetime.now(),
            'market_state': signal_data.get('market_state'),
            'models_used': {
                'cnn': signal_data.get('cnn_contribution', 0),
                'dqn': signal_data.get('dqn_contribution', 0),
                'cob_rl': signal_data.get('cob_contribution', 0)
            }
        }
    
    def get_signal_for_position(self, position_id):
        """Retrieve signal when position closes"""
        return self.active_trades.pop(position_id, None)

2. Position Close Hook

# In core/trading_executor.py after trade_record is created:

def _on_position_close(self, trade_record, position):
    """Called when position closes - trigger training"""
    
    # Get original signal
    signal_data = self.position_tracker.get_signal_for_position(position.id)
    if not signal_data:
        logger.warning(f"No signal found for position {position.id}")
        return
    
    # Calculate comprehensive reward
    reward = self._calculate_training_reward(trade_record, signal_data)
    
    # Train all models that contributed
    if self.orchestrator:
        self.orchestrator.train_on_trade_outcome(
            signal_data=signal_data,
            trade_record=trade_record,
            reward=reward
        )

3. Comprehensive Reward Function

def _calculate_training_reward(self, trade_record, signal_data):
    """Calculate sophisticated reward from closed trade"""
    
    # Base PnL (already includes fees)
    pnl = trade_record.pnl
    
    # Time penalty (encourage faster trades)
    hold_time_minutes = trade_record.hold_time_seconds / 60
    time_penalty = -0.001 * max(0, hold_time_minutes - 5)  # Penalty after 5min
    
    # Risk-adjusted reward
    position_risk = trade_record.quantity * trade_record.entry_price / self.balance
    risk_adjusted = pnl / (position_risk + 0.01)
    
    # Consistency bonus/penalty
    recent_pnls = [t.pnl for t in self.trade_history[-20:]]
    if len(recent_pnls) > 1:
        pnl_std = np.std(recent_pnls)
        consistency = pnl / (pnl_std + 0.001)
    else:
        consistency = 0
    
    # Win/loss streak adjustment
    if pnl > 0:
        streak_bonus = min(0.1, self.winning_streak * 0.02)
    else:
        streak_bonus = -min(0.2, self.losing_streak * 0.05)
    
    # Final reward (scaled for model learning)
    final_reward = (
        pnl * 10.0 +              # Base PnL (scaled)
        time_penalty +             # Efficiency
        risk_adjusted * 2.0 +     # Risk management
        consistency * 0.5 +        # Volatility
        streak_bonus               # Consistency
    )
    
    logger.info(f"REWARD CALC: PnL={pnl:.4f}, Time={time_penalty:.4f}, "
                f"Risk={risk_adjusted:.4f}, Final={final_reward:.4f}")
    
    return final_reward

4. Multi-Model Training

# In core/orchestrator.py

def train_on_trade_outcome(self, signal_data, trade_record, reward):
    """Train all models that contributed to the signal"""
    
    market_state = signal_data.get('market_state')
    action = self._action_to_index(trade_record.side)  # BUY=0, SELL=1
    
    # Train CNN
    if self.cnn_model and signal_data['models_used']['cnn'] > 0:
        weight = signal_data['models_used']['cnn']
        self._train_cnn_on_outcome(market_state, action, reward, weight)
        logger.info(f"CNN trained with weight {weight:.2f}")
    
    # Train DQN
    if self.dqn_agent and signal_data['models_used']['dqn'] > 0:
        weight = signal_data['models_used']['dqn']
        next_state = self._extract_current_state()
        self.dqn_agent.remember(market_state, action, reward * weight, next_state, done=True)
        
        if len(self.dqn_agent.memory) > 32:
            loss = self.dqn_agent.replay(batch_size=32)
            logger.info(f"DQN trained with weight {weight:.2f}, loss={loss:.4f}")
    
    # Train COB RL
    if self.cob_rl_model and signal_data['models_used']['cob_rl'] > 0:
        weight = signal_data['models_used']['cob_rl']
        cob_data = signal_data.get('cob_data', {})
        self._train_cob_on_outcome(cob_data, action, reward, weight)
        logger.info(f"COB RL trained with weight {weight:.2f}")
    
    logger.info(f"✅ TRAINED ALL MODELS: PnL=${trade_record.pnl:.2f}, Reward={reward:.4f}")

Implementation Steps

Phase 1: Core Infrastructure (Priority 1)

  • Fix division by zero
  • Remove mock predictions
  • Fix torch imports

Phase 2: Training Loop (Priority 2) - IN PROGRESS

  • Create SignalPositionTracker class
  • Add position close hook in trading_executor
  • Implement comprehensive reward function
  • Add train_on_trade_outcome to orchestrator
  • Remove immediate training on next-tick

Phase 3: Reward Improvements (Priority 3)

  • Multi-timeframe rewards (1m, 5m, 15m outcomes)
  • Selective training (skip tiny movements)
  • Better feature engineering
  • Prioritized experience replay

Phase 4: Testing & Validation

  • Test with paper trading
  • Validate rewards are non-zero
  • Confirm models are training
  • Monitor training metrics

Expected Improvements

Before:

  • Rewards: ~0.00 (next-tick noise)
  • Training: Only on next-tick price
  • Learning: Models see no real outcomes
  • Effectiveness: 1/10

After:

  • Rewards: Real PnL-based (-$5 to +$10 range)
  • Training: On actual position close
  • Learning: Models see real trade results
  • Effectiveness: 9/10

Files to Modify

  1. core/trading_executor.py

    • Add position close hook
    • Create SignalPositionTracker
    • Implement reward calculation
  2. core/orchestrator.py

    • Add train_on_trade_outcome method
    • Implement multi-model training
  3. web/clean_dashboard.py

    • Remove immediate training
    • Add signal registration on execution
    • Link signals to positions
  4. core/training_integration.py (optional)

    • May need updates for consistency

Monitoring & Validation

Log Messages to Watch:

✅ TRAINED ALL MODELS: PnL=$2.35, Reward=25.40
REWARD CALC: PnL=0.0235, Time=-0.002, Risk=1.15, Final=25.40
CNN trained with weight 0.35
DQN trained with weight 0.45, loss=0.0123
COB RL trained with weight 0.20

Metrics to Track:

  • Average reward per trade (should be >> 0.01)
  • Training frequency (should match trade close frequency)
  • Model convergence (loss decreasing over time)
  • Win rate improvement (should increase with training)

Next Steps

  1. Implement SignalPositionTracker
  2. Add position close hook
  3. Create reward calculation
  4. Test with 10 manual trades
  5. Validate rewards are meaningful
  6. Deploy to automated trading

Status: Phase 1 Complete, Phase 2 In Progress