training - changed incentive to penalize wrong HOLD
This commit is contained in:
146
HOLD_AVOIDANCE_INCENTIVE_SYSTEM.md
Normal file
146
HOLD_AVOIDANCE_INCENTIVE_SYSTEM.md
Normal file
@@ -0,0 +1,146 @@
|
||||
# HOLD Avoidance Incentive System - Complete
|
||||
|
||||
## Problem Identified
|
||||
The model was getting stuck in HOLD mode because:
|
||||
1. **HOLD is "safe"** - never gets penalized for wrong predictions
|
||||
2. **No incentive for action-taking** - only profit/loss matters
|
||||
3. **Missing opportunity cost** - no penalty for missing profitable moves
|
||||
4. **Overconfident HOLD** - model becomes confident in doing nothing
|
||||
|
||||
## Solution: Future Price Alignment Reward System
|
||||
|
||||
### 1. ✅ Future Price Alignment Rewards
|
||||
**Incentivizes actions that align with future price movement:**
|
||||
|
||||
```python
|
||||
# BUY Actions
|
||||
if price_goes_up > 0.1%:
|
||||
reward += up_to_5%_bonus
|
||||
elif price_goes_down < -0.1%:
|
||||
reward -= up_to_5%_penalty
|
||||
|
||||
# SELL Actions
|
||||
if price_goes_down < -0.1%:
|
||||
reward += up_to_5%_bonus
|
||||
elif price_goes_up > 0.1%:
|
||||
reward -= up_to_5%_penalty
|
||||
|
||||
# HOLD Actions
|
||||
if price_stays_flat < 0.5%:
|
||||
reward += small_bonus
|
||||
else:
|
||||
reward -= missed_opportunity_penalty
|
||||
```
|
||||
|
||||
### 2. ✅ Action Diversity Penalty
|
||||
**Discourages excessive HOLD actions:**
|
||||
|
||||
```python
|
||||
# Every HOLD action gets small constant penalty
|
||||
if action == 'HOLD':
|
||||
reward -= 0.005 # Encourages action-taking
|
||||
```
|
||||
|
||||
### 3. ✅ Missed Opportunity Penalties
|
||||
**Penalizes HOLD when significant price moves occur:**
|
||||
|
||||
```python
|
||||
if action == 'HOLD' and abs(price_change) > 0.5%:
|
||||
penalty = -min(abs(price_change) / 100, 0.1) # Up to 10% penalty
|
||||
reward += penalty
|
||||
```
|
||||
|
||||
### 4. ✅ Confidence-Based Adjustments
|
||||
**Higher penalties for overconfident wrong predictions:**
|
||||
|
||||
```python
|
||||
if confidence > 0.8 and wrong_prediction:
|
||||
penalty *= (1 + confidence) # Amplify penalty for overconfident mistakes
|
||||
elif confidence > 0.8 and correct_prediction:
|
||||
reward *= 1.2 # Small bonus for confident correct predictions
|
||||
```
|
||||
|
||||
## Reward Structure Examples
|
||||
|
||||
### Scenario 1: BUY before 2% price increase
|
||||
- **Base reward**: +2% (profit)
|
||||
- **Alignment bonus**: +2% (correct direction)
|
||||
- **Total**: +4% (strong positive reinforcement)
|
||||
|
||||
### Scenario 2: SELL before 2% price increase (wrong)
|
||||
- **Base reward**: -2% (loss)
|
||||
- **Alignment penalty**: -2% (wrong direction)
|
||||
- **Confidence penalty**: -1% (if 90% confident)
|
||||
- **Total**: -5% (strong negative reinforcement)
|
||||
|
||||
### Scenario 3: HOLD during 3% price increase (missed opportunity)
|
||||
- **Base reward**: 0% (no trade)
|
||||
- **Missed opportunity**: -3% (could have profited)
|
||||
- **Diversity penalty**: -0.5% (discourages HOLD)
|
||||
- **Total**: -3.5% (teaches to take action)
|
||||
|
||||
### Scenario 4: HOLD during 0.2% price change (correct)
|
||||
- **Base reward**: 0% (no trade)
|
||||
- **Correct HOLD bonus**: +0.5% (price stayed flat)
|
||||
- **Diversity penalty**: -0.5% (constant HOLD penalty)
|
||||
- **Total**: 0% (neutral, but not penalized)
|
||||
|
||||
## Expected Behavioral Changes
|
||||
|
||||
### 1. Reduced HOLD Bias
|
||||
- **Before**: Model defaults to HOLD (safe option)
|
||||
- **After**: Model considers opportunity cost of inaction
|
||||
|
||||
### 2. Better Action Timing
|
||||
- **Before**: Random BUY/SELL timing
|
||||
- **After**: Actions align with future price movements
|
||||
|
||||
### 3. Confidence Calibration
|
||||
- **Before**: Overconfident in HOLD decisions
|
||||
- **After**: Penalized for overconfident wrong predictions
|
||||
|
||||
### 4. Opportunity Recognition
|
||||
- **Before**: Ignores profitable opportunities
|
||||
- **After**: Learns to recognize and act on price movements
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Training Data Enhancement
|
||||
Each training sample now includes:
|
||||
- `entry_price` and `exit_price` for alignment calculation
|
||||
- `confidence` level for confidence-based adjustments
|
||||
- Future price movement analysis
|
||||
- Opportunity cost calculations
|
||||
|
||||
### Reward Calculation Flow
|
||||
1. **Calculate base reward** from actual profit/loss
|
||||
2. **Add alignment reward** based on action vs future price
|
||||
3. **Apply diversity penalty** for HOLD actions
|
||||
4. **Adjust for confidence** level and correctness
|
||||
5. **Combine all components** for final reward
|
||||
|
||||
### Logging and Debugging
|
||||
```
|
||||
Alignment reward: SELL with +1.50% move, conf=0.93 = -0.0245
|
||||
Alignment reward: BUY with +2.10% move, conf=0.87 = +0.0252
|
||||
Alignment reward: HOLD with +3.20% move, conf=0.65 = -0.0370
|
||||
```
|
||||
|
||||
## Expected Results
|
||||
|
||||
### Short Term (1-2 hours)
|
||||
- ✅ **Reduced HOLD frequency** - Model takes more actions
|
||||
- ✅ **Better action timing** - Actions align with price movements
|
||||
- ✅ **Improved confidence** - Less overconfident HOLD decisions
|
||||
|
||||
### Medium Term (1-2 days)
|
||||
- ✅ **Higher profitability** - Better action selection
|
||||
- ✅ **Reduced missed opportunities** - Acts on significant moves
|
||||
- ✅ **Balanced action distribution** - Not stuck in HOLD mode
|
||||
|
||||
### Long Term (1+ weeks)
|
||||
- ✅ **Adaptive behavior** - Learns market patterns
|
||||
- ✅ **Risk-adjusted actions** - Considers opportunity costs
|
||||
- ✅ **Optimal action frequency** - Right balance of action vs patience
|
||||
|
||||
The system now provides strong incentives for taking profitable actions while penalizing both wrong actions AND missed opportunities, breaking the HOLD bias!
|
||||
@@ -1640,8 +1640,14 @@ class RealTrainingAdapter:
|
||||
|
||||
# Add experiences to replay buffer
|
||||
for data in training_data:
|
||||
# Calculate reward from profit/loss
|
||||
reward = data['profit_loss_pct'] / 100.0 if data.get('profit_loss_pct') else 0.0
|
||||
# Calculate reward from profit/loss + future price alignment
|
||||
base_reward = data['profit_loss_pct'] / 100.0 if data.get('profit_loss_pct') else 0.0
|
||||
|
||||
# FUTURE PRICE ALIGNMENT REWARD: Incentivize actions that align with future price movement
|
||||
alignment_reward = self._calculate_future_price_alignment_reward(data)
|
||||
|
||||
# Combine rewards: base profit + alignment bonus
|
||||
reward = base_reward + alignment_reward
|
||||
|
||||
# Add to memory if agent has remember method
|
||||
if hasattr(agent, 'remember'):
|
||||
@@ -1664,6 +1670,68 @@ class RealTrainingAdapter:
|
||||
# Accuracy calculated from actual training metrics, not synthetic
|
||||
session.accuracy = None # Will be set by training loop if available
|
||||
|
||||
def _calculate_future_price_alignment_reward(self, data):
|
||||
"""
|
||||
Calculate reward based on how well the action aligns with future price movement
|
||||
|
||||
This incentivizes:
|
||||
- BUY actions when price goes up
|
||||
- SELL actions when price goes down
|
||||
- HOLD actions when price stays flat
|
||||
- Penalizes wrong actions (BUY before price drop, SELL before price rise)
|
||||
"""
|
||||
try:
|
||||
action = data.get('action', 'HOLD')
|
||||
entry_price = data.get('entry_price', 0)
|
||||
exit_price = data.get('exit_price', 0)
|
||||
|
||||
if not entry_price or not exit_price or entry_price == exit_price:
|
||||
return 0.0 # No alignment reward if no price movement
|
||||
|
||||
# Calculate actual price movement
|
||||
price_change_pct = ((exit_price - entry_price) / entry_price) * 100
|
||||
|
||||
# Define alignment rewards
|
||||
alignment_reward = 0.0
|
||||
|
||||
if action == 'BUY':
|
||||
if price_change_pct > 0.1: # Price went up significantly
|
||||
alignment_reward = min(price_change_pct / 100, 0.05) # Up to 5% bonus
|
||||
elif price_change_pct < -0.1: # Price went down - wrong action
|
||||
alignment_reward = max(price_change_pct / 100, -0.05) # Up to 5% penalty
|
||||
|
||||
elif action == 'SELL':
|
||||
if price_change_pct < -0.1: # Price went down - correct action
|
||||
alignment_reward = min(abs(price_change_pct) / 100, 0.05) # Up to 5% bonus
|
||||
elif price_change_pct > 0.1: # Price went up - wrong action
|
||||
alignment_reward = -min(price_change_pct / 100, 0.05) # Up to 5% penalty
|
||||
|
||||
elif action == 'HOLD':
|
||||
if abs(price_change_pct) < 0.5: # Price stayed relatively flat - correct
|
||||
alignment_reward = 0.005 # Small bonus for correct HOLD (reduced)
|
||||
else: # Price moved significantly - missed opportunity
|
||||
# Larger penalty for missing significant moves
|
||||
missed_opportunity_penalty = -min(abs(price_change_pct) / 100, 0.1) # Up to 10% penalty
|
||||
alignment_reward = missed_opportunity_penalty
|
||||
|
||||
# DIVERSITY PENALTY: Discourage excessive HOLD actions
|
||||
# Add small penalty to HOLD to encourage action-taking
|
||||
alignment_reward -= 0.005 # Small constant penalty for HOLD
|
||||
|
||||
# CONFIDENCE ADJUSTMENT: Higher penalties for confident wrong predictions
|
||||
confidence = data.get('confidence', 0.5)
|
||||
if confidence > 0.8 and alignment_reward < 0: # High confidence but wrong
|
||||
alignment_reward *= (1 + confidence) # Amplify penalty for overconfident mistakes
|
||||
elif confidence > 0.8 and alignment_reward > 0: # High confidence and right
|
||||
alignment_reward *= 1.2 # Small bonus for confident correct predictions
|
||||
|
||||
logger.debug(f"Alignment reward: {action} with {price_change_pct:.2f}% move, conf={confidence:.2f} = {alignment_reward:.4f}")
|
||||
return alignment_reward
|
||||
|
||||
except Exception as e:
|
||||
logger.debug(f"Error calculating alignment reward: {e}")
|
||||
return 0.0
|
||||
|
||||
def _build_state_from_data(self, data: Dict, agent: Any) -> List[float]:
|
||||
"""Build proper state representation from training data"""
|
||||
try:
|
||||
@@ -3177,7 +3245,10 @@ class RealTrainingAdapter:
|
||||
|
||||
# Similar to DQN training
|
||||
for data in training_data:
|
||||
reward = data['profit_loss_pct'] / 100.0 if data.get('profit_loss_pct') else 0.0
|
||||
# Calculate reward from profit/loss + future price alignment
|
||||
base_reward = data['profit_loss_pct'] / 100.0 if data.get('profit_loss_pct') else 0.0
|
||||
alignment_reward = self._calculate_future_price_alignment_reward(data)
|
||||
reward = base_reward + alignment_reward
|
||||
|
||||
if hasattr(agent, 'remember'):
|
||||
state = [0.0] * agent.state_size if hasattr(agent, 'state_size') else []
|
||||
|
||||
Reference in New Issue
Block a user