training - changed incentive to penalize wrong HOLD

2025-12-10 16:07:51 +02:00
parent 5349e23563
commit 71153c04c2
2 changed files with 220 additions and 3 deletions
--- a/HOLD_AVOIDANCE_INCENTIVE_SYSTEM.md
+++ b/HOLD_AVOIDANCE_INCENTIVE_SYSTEM.md
@@ -0,0 +1,146 @@
 # HOLD Avoidance Incentive System - Complete
 ## Problem Identified
 The model was getting stuck in HOLD mode because:
 1. **HOLD is "safe"** - never gets penalized for wrong predictions
 2. **No incentive for action-taking** - only profit/loss matters
 3. **Missing opportunity cost** - no penalty for missing profitable moves
 4. **Overconfident HOLD** - model becomes confident in doing nothing
 ## Solution: Future Price Alignment Reward System
 ### 1. ✅ Future Price Alignment Rewards
 **Incentivizes actions that align with future price movement:**
 ```python
 # BUY Actions
 if price_goes_up > 0.1%:
    reward += up_to_5%_bonus
 elif price_goes_down < -0.1%:
    reward -= up_to_5%_penalty
 # SELL Actions  
 if price_goes_down < -0.1%:
    reward += up_to_5%_bonus
 elif price_goes_up > 0.1%:
    reward -= up_to_5%_penalty
 # HOLD Actions
 if price_stays_flat < 0.5%:
    reward += small_bonus
 else:
    reward -= missed_opportunity_penalty
 ```
 ### 2. ✅ Action Diversity Penalty
 **Discourages excessive HOLD actions:**
 ```python
 # Every HOLD action gets small constant penalty
 if action == 'HOLD':
    reward -= 0.005  # Encourages action-taking
 ```
 ### 3. ✅ Missed Opportunity Penalties
 **Penalizes HOLD when significant price moves occur:**
 ```python
 if action == 'HOLD' and abs(price_change) > 0.5%:
    penalty = -min(abs(price_change) / 100, 0.1)  # Up to 10% penalty
    reward += penalty
 ```
 ### 4. ✅ Confidence-Based Adjustments
 **Higher penalties for overconfident wrong predictions:**
 ```python
 if confidence > 0.8 and wrong_prediction:
    penalty *= (1 + confidence)  # Amplify penalty for overconfident mistakes
 elif confidence > 0.8 and correct_prediction:
    reward *= 1.2  # Small bonus for confident correct predictions
 ```
 ## Reward Structure Examples
 ### Scenario 1: BUY before 2% price increase
 - **Base reward**: +2% (profit)
 - **Alignment bonus**: +2% (correct direction)
 - **Total**: +4% (strong positive reinforcement)
 ### Scenario 2: SELL before 2% price increase (wrong)
 - **Base reward**: -2% (loss)
 - **Alignment penalty**: -2% (wrong direction)
 - **Confidence penalty**: -1% (if 90% confident)
 - **Total**: -5% (strong negative reinforcement)
 ### Scenario 3: HOLD during 3% price increase (missed opportunity)
 - **Base reward**: 0% (no trade)
 - **Missed opportunity**: -3% (could have profited)
 - **Diversity penalty**: -0.5% (discourages HOLD)
 - **Total**: -3.5% (teaches to take action)
 ### Scenario 4: HOLD during 0.2% price change (correct)
 - **Base reward**: 0% (no trade)
 - **Correct HOLD bonus**: +0.5% (price stayed flat)
 - **Diversity penalty**: -0.5% (constant HOLD penalty)
 - **Total**: 0% (neutral, but not penalized)
 ## Expected Behavioral Changes
 ### 1. Reduced HOLD Bias
 - **Before**: Model defaults to HOLD (safe option)
 - **After**: Model considers opportunity cost of inaction
 ### 2. Better Action Timing
 - **Before**: Random BUY/SELL timing
 - **After**: Actions align with future price movements
 ### 3. Confidence Calibration
 - **Before**: Overconfident in HOLD decisions
 - **After**: Penalized for overconfident wrong predictions
 ### 4. Opportunity Recognition
 - **Before**: Ignores profitable opportunities
 - **After**: Learns to recognize and act on price movements
 ## Implementation Details
 ### Training Data Enhancement
 Each training sample now includes:
 - `entry_price` and `exit_price` for alignment calculation
 - `confidence` level for confidence-based adjustments
 - Future price movement analysis
 - Opportunity cost calculations
 ### Reward Calculation Flow
 1. **Calculate base reward** from actual profit/loss
 2. **Add alignment reward** based on action vs future price
 3. **Apply diversity penalty** for HOLD actions
 4. **Adjust for confidence** level and correctness
 5. **Combine all components** for final reward
 ### Logging and Debugging
 ```
 Alignment reward: SELL with +1.50% move, conf=0.93 = -0.0245
 Alignment reward: BUY with +2.10% move, conf=0.87 = +0.0252
 Alignment reward: HOLD with +3.20% move, conf=0.65 = -0.0370
 ```
 ## Expected Results
 ### Short Term (1-2 hours)
 - ✅ **Reduced HOLD frequency** - Model takes more actions
 - ✅ **Better action timing** - Actions align with price movements
 - ✅ **Improved confidence** - Less overconfident HOLD decisions
 ### Medium Term (1-2 days)
 - ✅ **Higher profitability** - Better action selection
 - ✅ **Reduced missed opportunities** - Acts on significant moves
 - ✅ **Balanced action distribution** - Not stuck in HOLD mode
 ### Long Term (1+ weeks)
 - ✅ **Adaptive behavior** - Learns market patterns
 - ✅ **Risk-adjusted actions** - Considers opportunity costs
 - ✅ **Optimal action frequency** - Right balance of action vs patience
 The system now provides strong incentives for taking profitable actions while penalizing both wrong actions AND missed opportunities, breaking the HOLD bias!
--- a/core/real_training_adapter.py
+++ b/core/real_training_adapter.py
@@ -1640,8 +1640,14 @@ class RealTrainingAdapter:
        # Add experiences to replay buffer
        for data in training_data:
-            # Calculate reward from profit/loss
+            # Calculate reward from profit/loss + future price alignment
-            reward = data['profit_loss_pct'] / 100.0 if data.get('profit_loss_pct') else 0.0
+            base_reward = data['profit_loss_pct'] / 100.0 if data.get('profit_loss_pct') else 0.0
            # FUTURE PRICE ALIGNMENT REWARD: Incentivize actions that align with future price movement
            alignment_reward = self._calculate_future_price_alignment_reward(data)
            # Combine rewards: base profit + alignment bonus
            reward = base_reward + alignment_reward
            # Add to memory if agent has remember method
            if hasattr(agent, 'remember'):
@@ -1664,6 +1670,68 @@ class RealTrainingAdapter:
        # Accuracy calculated from actual training metrics, not synthetic
        session.accuracy = None  # Will be set by training loop if available
    def _calculate_future_price_alignment_reward(self, data):
        """
        Calculate reward based on how well the action aligns with future price movement
        This incentivizes:
        - BUY actions when price goes up
        - SELL actions when price goes down  
        - HOLD actions when price stays flat
        - Penalizes wrong actions (BUY before price drop, SELL before price rise)
        """
        try:
            action = data.get('action', 'HOLD')
            entry_price = data.get('entry_price', 0)
            exit_price = data.get('exit_price', 0)
            if not entry_price or not exit_price or entry_price == exit_price:
                return 0.0  # No alignment reward if no price movement
            # Calculate actual price movement
            price_change_pct = ((exit_price - entry_price) / entry_price) * 100
            # Define alignment rewards
            alignment_reward = 0.0
            if action == 'BUY':
                if price_change_pct > 0.1:  # Price went up significantly
                    alignment_reward = min(price_change_pct / 100, 0.05)  # Up to 5% bonus
                elif price_change_pct < -0.1:  # Price went down - wrong action
                    alignment_reward = max(price_change_pct / 100, -0.05)  # Up to 5% penalty
            elif action == 'SELL':
                if price_change_pct < -0.1:  # Price went down - correct action
                    alignment_reward = min(abs(price_change_pct) / 100, 0.05)  # Up to 5% bonus
                elif price_change_pct > 0.1:  # Price went up - wrong action
                    alignment_reward = -min(price_change_pct / 100, 0.05)  # Up to 5% penalty
            elif action == 'HOLD':
                if abs(price_change_pct) < 0.5:  # Price stayed relatively flat - correct
                    alignment_reward = 0.005  # Small bonus for correct HOLD (reduced)
                else:  # Price moved significantly - missed opportunity
                    # Larger penalty for missing significant moves
                    missed_opportunity_penalty = -min(abs(price_change_pct) / 100, 0.1)  # Up to 10% penalty
                    alignment_reward = missed_opportunity_penalty
                # DIVERSITY PENALTY: Discourage excessive HOLD actions
                # Add small penalty to HOLD to encourage action-taking
                alignment_reward -= 0.005  # Small constant penalty for HOLD
            # CONFIDENCE ADJUSTMENT: Higher penalties for confident wrong predictions
            confidence = data.get('confidence', 0.5)
            if confidence > 0.8 and alignment_reward < 0:  # High confidence but wrong
                alignment_reward *= (1 + confidence)  # Amplify penalty for overconfident mistakes
            elif confidence > 0.8 and alignment_reward > 0:  # High confidence and right
                alignment_reward *= 1.2  # Small bonus for confident correct predictions
            logger.debug(f"Alignment reward: {action} with {price_change_pct:.2f}% move, conf={confidence:.2f} = {alignment_reward:.4f}")
            return alignment_reward
        except Exception as e:
            logger.debug(f"Error calculating alignment reward: {e}")
            return 0.0
    def _build_state_from_data(self, data: Dict, agent: Any) -> List[float]:
        """Build proper state representation from training data"""
        try:
@@ -3177,7 +3245,10 @@ class RealTrainingAdapter:
        # Similar to DQN training
        for data in training_data:
-            reward = data['profit_loss_pct'] / 100.0 if data.get('profit_loss_pct') else 0.0
+            # Calculate reward from profit/loss + future price alignment
            base_reward = data['profit_loss_pct'] / 100.0 if data.get('profit_loss_pct') else 0.0
            alignment_reward = self._calculate_future_price_alignment_reward(data)
            reward = base_reward + alignment_reward
            if hasattr(agent, 'remember'):
                state = [0.0] * agent.state_size if hasattr(agent, 'state_size') else []