training - changed incentive to penalize wrong HOLD

2025-12-10 16:07:51 +02:00
parent 5349e23563
commit 71153c04c2
2 changed files with 220 additions and 3 deletions
--- a/HOLD_AVOIDANCE_INCENTIVE_SYSTEM.md
+++ b/HOLD_AVOIDANCE_INCENTIVE_SYSTEM.md
@@ -0,0 +1,146 @@
+# HOLD Avoidance Incentive System - Complete
+
+## Problem Identified
+The model was getting stuck in HOLD mode because:
+1. **HOLD is "safe"** - never gets penalized for wrong predictions
+2. **No incentive for action-taking** - only profit/loss matters
+3. **Missing opportunity cost** - no penalty for missing profitable moves
+4. **Overconfident HOLD** - model becomes confident in doing nothing
+
+## Solution: Future Price Alignment Reward System
+
+### 1. ✅ Future Price Alignment Rewards
+**Incentivizes actions that align with future price movement:**
+
+```python
+# BUY Actions
+if price_goes_up > 0.1%:
+    reward += up_to_5%_bonus
+elif price_goes_down < -0.1%:
+    reward -= up_to_5%_penalty
+
+# SELL Actions  
+if price_goes_down < -0.1%:
+    reward += up_to_5%_bonus
+elif price_goes_up > 0.1%:
+    reward -= up_to_5%_penalty
+
+# HOLD Actions
+if price_stays_flat < 0.5%:
+    reward += small_bonus
+else:
+    reward -= missed_opportunity_penalty
+```
+
+### 2. ✅ Action Diversity Penalty
+**Discourages excessive HOLD actions:**
+
+```python
+# Every HOLD action gets small constant penalty
+if action == 'HOLD':
+    reward -= 0.005  # Encourages action-taking
+```
+
+### 3. ✅ Missed Opportunity Penalties
+**Penalizes HOLD when significant price moves occur:**
+
+```python
+if action == 'HOLD' and abs(price_change) > 0.5%:
+    penalty = -min(abs(price_change) / 100, 0.1)  # Up to 10% penalty
+    reward += penalty
+```
+
+### 4. ✅ Confidence-Based Adjustments
+**Higher penalties for overconfident wrong predictions:**
+
+```python
+if confidence > 0.8 and wrong_prediction:
+    penalty *= (1 + confidence)  # Amplify penalty for overconfident mistakes
+elif confidence > 0.8 and correct_prediction:
+    reward *= 1.2  # Small bonus for confident correct predictions
+```
+
+## Reward Structure Examples
+
+### Scenario 1: BUY before 2% price increase
+- **Base reward**: +2% (profit)
+- **Alignment bonus**: +2% (correct direction)
+- **Total**: +4% (strong positive reinforcement)
+
+### Scenario 2: SELL before 2% price increase (wrong)
+- **Base reward**: -2% (loss)
+- **Alignment penalty**: -2% (wrong direction)
+- **Confidence penalty**: -1% (if 90% confident)
+- **Total**: -5% (strong negative reinforcement)
+
+### Scenario 3: HOLD during 3% price increase (missed opportunity)
+- **Base reward**: 0% (no trade)
+- **Missed opportunity**: -3% (could have profited)
+- **Diversity penalty**: -0.5% (discourages HOLD)
+- **Total**: -3.5% (teaches to take action)
+
+### Scenario 4: HOLD during 0.2% price change (correct)
+- **Base reward**: 0% (no trade)
+- **Correct HOLD bonus**: +0.5% (price stayed flat)
+- **Diversity penalty**: -0.5% (constant HOLD penalty)
+- **Total**: 0% (neutral, but not penalized)
+
+## Expected Behavioral Changes
+
+### 1. Reduced HOLD Bias
+- **Before**: Model defaults to HOLD (safe option)
+- **After**: Model considers opportunity cost of inaction
+
+### 2. Better Action Timing
+- **Before**: Random BUY/SELL timing
+- **After**: Actions align with future price movements
+
+### 3. Confidence Calibration
+- **Before**: Overconfident in HOLD decisions
+- **After**: Penalized for overconfident wrong predictions
+
+### 4. Opportunity Recognition
+- **Before**: Ignores profitable opportunities
+- **After**: Learns to recognize and act on price movements
+
+## Implementation Details
+
+### Training Data Enhancement
+Each training sample now includes:
+- `entry_price` and `exit_price` for alignment calculation
+- `confidence` level for confidence-based adjustments
+- Future price movement analysis
+- Opportunity cost calculations
+
+### Reward Calculation Flow
+1. **Calculate base reward** from actual profit/loss
+2. **Add alignment reward** based on action vs future price
+3. **Apply diversity penalty** for HOLD actions
+4. **Adjust for confidence** level and correctness
+5. **Combine all components** for final reward
+
+### Logging and Debugging
+```
+Alignment reward: SELL with +1.50% move, conf=0.93 = -0.0245
+Alignment reward: BUY with +2.10% move, conf=0.87 = +0.0252
+Alignment reward: HOLD with +3.20% move, conf=0.65 = -0.0370
+```
+
+## Expected Results
+
+### Short Term (1-2 hours)
+- ✅ **Reduced HOLD frequency** - Model takes more actions
+- ✅ **Better action timing** - Actions align with price movements
+- ✅ **Improved confidence** - Less overconfident HOLD decisions
+
+### Medium Term (1-2 days)
+- ✅ **Higher profitability** - Better action selection
+- ✅ **Reduced missed opportunities** - Acts on significant moves
+- ✅ **Balanced action distribution** - Not stuck in HOLD mode
+
+### Long Term (1+ weeks)
+- ✅ **Adaptive behavior** - Learns market patterns
+- ✅ **Risk-adjusted actions** - Considers opportunity costs
+- ✅ **Optimal action frequency** - Right balance of action vs patience
+
+The system now provides strong incentives for taking profitable actions while penalizing both wrong actions AND missed opportunities, breaking the HOLD bias!
--- a/core/real_training_adapter.py
+++ b/core/real_training_adapter.py
@@ -1640,8 +1640,14 @@ class RealTrainingAdapter:
        
        # Add experiences to replay buffer
        for data in training_data:
-            # Calculate reward from profit/loss
-            reward = data['profit_loss_pct'] / 100.0 if data.get('profit_loss_pct') else 0.0
+            # Calculate reward from profit/loss + future price alignment
+            base_reward = data['profit_loss_pct'] / 100.0 if data.get('profit_loss_pct') else 0.0
+            
+            # FUTURE PRICE ALIGNMENT REWARD: Incentivize actions that align with future price movement
+            alignment_reward = self._calculate_future_price_alignment_reward(data)
+            
+            # Combine rewards: base profit + alignment bonus
+            reward = base_reward + alignment_reward
            
            # Add to memory if agent has remember method
            if hasattr(agent, 'remember'):
@@ -1664,6 +1670,68 @@ class RealTrainingAdapter:
        # Accuracy calculated from actual training metrics, not synthetic
        session.accuracy = None  # Will be set by training loop if available
    
+    def _calculate_future_price_alignment_reward(self, data):
+        """
+        Calculate reward based on how well the action aligns with future price movement
+        
+        This incentivizes:
+        - BUY actions when price goes up
+        - SELL actions when price goes down  
+        - HOLD actions when price stays flat
+        - Penalizes wrong actions (BUY before price drop, SELL before price rise)
+        """
+        try:
+            action = data.get('action', 'HOLD')
+            entry_price = data.get('entry_price', 0)
+            exit_price = data.get('exit_price', 0)
+            
+            if not entry_price or not exit_price or entry_price == exit_price:
+                return 0.0  # No alignment reward if no price movement
+            
+            # Calculate actual price movement
+            price_change_pct = ((exit_price - entry_price) / entry_price) * 100
+            
+            # Define alignment rewards
+            alignment_reward = 0.0
+            
+            if action == 'BUY':
+                if price_change_pct > 0.1:  # Price went up significantly
+                    alignment_reward = min(price_change_pct / 100, 0.05)  # Up to 5% bonus
+                elif price_change_pct < -0.1:  # Price went down - wrong action
+                    alignment_reward = max(price_change_pct / 100, -0.05)  # Up to 5% penalty
+                    
+            elif action == 'SELL':
+                if price_change_pct < -0.1:  # Price went down - correct action
+                    alignment_reward = min(abs(price_change_pct) / 100, 0.05)  # Up to 5% bonus
+                elif price_change_pct > 0.1:  # Price went up - wrong action
+                    alignment_reward = -min(price_change_pct / 100, 0.05)  # Up to 5% penalty
+                    
+            elif action == 'HOLD':
+                if abs(price_change_pct) < 0.5:  # Price stayed relatively flat - correct
+                    alignment_reward = 0.005  # Small bonus for correct HOLD (reduced)
+                else:  # Price moved significantly - missed opportunity
+                    # Larger penalty for missing significant moves
+                    missed_opportunity_penalty = -min(abs(price_change_pct) / 100, 0.1)  # Up to 10% penalty
+                    alignment_reward = missed_opportunity_penalty
+                    
+                # DIVERSITY PENALTY: Discourage excessive HOLD actions
+                # Add small penalty to HOLD to encourage action-taking
+                alignment_reward -= 0.005  # Small constant penalty for HOLD
+            
+            # CONFIDENCE ADJUSTMENT: Higher penalties for confident wrong predictions
+            confidence = data.get('confidence', 0.5)
+            if confidence > 0.8 and alignment_reward < 0:  # High confidence but wrong
+                alignment_reward *= (1 + confidence)  # Amplify penalty for overconfident mistakes
+            elif confidence > 0.8 and alignment_reward > 0:  # High confidence and right
+                alignment_reward *= 1.2  # Small bonus for confident correct predictions
+            
+            logger.debug(f"Alignment reward: {action} with {price_change_pct:.2f}% move, conf={confidence:.2f} = {alignment_reward:.4f}")
+            return alignment_reward
+            
+        except Exception as e:
+            logger.debug(f"Error calculating alignment reward: {e}")
+            return 0.0
+    
    def _build_state_from_data(self, data: Dict, agent: Any) -> List[float]:
        """Build proper state representation from training data"""
        try:
@@ -3177,7 +3245,10 @@ class RealTrainingAdapter:
        
        # Similar to DQN training
        for data in training_data:
-            reward = data['profit_loss_pct'] / 100.0 if data.get('profit_loss_pct') else 0.0
+            # Calculate reward from profit/loss + future price alignment
+            base_reward = data['profit_loss_pct'] / 100.0 if data.get('profit_loss_pct') else 0.0
+            alignment_reward = self._calculate_future_price_alignment_reward(data)
+            reward = base_reward + alignment_reward
            
            if hasattr(agent, 'remember'):
                state = [0.0] * agent.state_size if hasattr(agent, 'state_size') else []