training: conviction-aware reward shaping

This commit is contained in:
Dobromir Popov
2025-08-10 13:23:29 +03:00
parent 6861d0f20b
commit b3c5076e37
2 changed files with 69 additions and 10 deletions

View File

@@ -101,4 +101,32 @@ also, adjust our bybit api so we trade with usdt futures - where we can have up
3. we don't calculate the COB imbalance correctly - we have MA with 4 time windows.
4. we have some more work on the models statistics and overview but we can focust there later when we fix the other issues
5. audit and backtest if calculate_williams_pivot_points works correctly. show pivot points on the dash on the 1m candlesticks
5. audit and backtest if calculate_williams_pivot_points works correctly. show pivot points on the dash on the 1m candlesticks
can we enhance our RL reward/punish to promote closing loosing trades and keep winning ones taking into account the predicted price direction and conviction. For example the more loosing a open position is the more we should be biased to closing it. but if the models predict with high certainty that there will be a big move up we will be more tolerant to a drawdown. and the opposite - we should be inclined to close winning trades but keep them as long as the price goes up and we project more upside. Do you think there is a smart way to implement that in the current RL and other training pipelines?
I want it more to be a part of a proper rewardfunction bias rather than a algorithmic calculation on the post signal processing as I prefer that this is a behaviour the moedl learns and is adapted to the current condition without hard bowndaries.
THINK REALY HARD
do we evaluate and reward/punish each model at each reference? we lost track of our model training metrics. in the dash we show:
Models & Training Progress
Loaded Models (5)
DQN_AGENT - ACTIVE (0) [CKPT]
Inf
Trn
Route
Last: NONE (0.0%) @ N/A
Loss: N/A
Rate: 0.00/s | 24h: 0
Last Inf: None | Train: None
ENHANCED_CNN - ACTIVE (0) [CKPT]
Inf
Trn
Route
Last: NONE (0.0%) @ N/A
Loss: 2133105152.0000 | Best: 34.2300
Rate: 0.00/s | 24h: 0
Last Inf: None | Train: None
DQN_AGENT and ENHANCED_CNN were the models we had the training working well. we had to include the others but it seems we still havent or at least do not store their metrics and best checkpoints

View File

@@ -3834,17 +3834,48 @@ class TradingOrchestrator:
base_reward = -0.1 * prediction_confidence
logger.debug(f"NOISE INCORRECT: Wrong direction on noise movement = {base_reward:.2f}")
# POSITION-AWARE ADJUSTMENTS
# POSITION-AWARE ADJUSTMENTS (conviction-aware; learned bias via reward shaping)
if has_position:
# Adjust rewards based on current position status
if current_position_pnl > 0.5: # Profitable position
# Derive conviction from prediction_confidence (0..1)
conviction = max(0.0, min(1.0, float(prediction_confidence)))
# Estimate expected move magnitude if provided by vector; else 0
expected_move_pct = 0.0
try:
if predicted_price_vector and isinstance(predicted_price_vector, dict):
# Accept either a normalized magnitude or compute from price fields if present
if 'expected_move_pct' in predicted_price_vector:
expected_move_pct = float(predicted_price_vector.get('expected_move_pct', 0.0))
elif 'predicted_price' in predicted_price_vector and 'current_price' in predicted_price_vector:
cp = float(predicted_price_vector.get('current_price') or 0.0)
pp = float(predicted_price_vector.get('predicted_price') or 0.0)
if cp > 0 and pp > 0:
expected_move_pct = ((pp - cp) / cp) * 100.0
except Exception:
expected_move_pct = 0.0
# Normalize expected move impact into [0,1]
expected_move_norm = max(0.0, min(1.0, abs(expected_move_pct) / 2.0)) # 2% move caps to 1.0
# Conviction-tolerant drawdown penalty (cut losers early unless strong conviction for recovery)
if current_position_pnl < 0:
pnl_loss = abs(current_position_pnl)
# Scale negative PnL into [0,1] using a soft scale (1% -> 1.0 cap)
loss_norm = max(0.0, min(1.0, pnl_loss / 1.0))
tolerance = (1.0 - min(0.9, conviction * expected_move_norm)) # high conviction reduces penalty
penalty = loss_norm * tolerance
base_reward -= 1.0 * penalty
logger.debug(
f"CONVICTION DRAWdown: pnl={current_position_pnl:.3f}, conv={conviction:.2f}, exp={expected_move_norm:.2f}, penalty={penalty:.3f}"
)
else:
# Let winners run when conviction supports it
gain = max(0.0, current_position_pnl)
gain_norm = max(0.0, min(1.0, gain / 1.0))
run_bonus = 0.2 * gain_norm * (0.5 + 0.5 * conviction)
# Small nudge to keep holding if directionally correct
if predicted_action == "HOLD" and price_change_pct > 0:
base_reward += 0.5 # Bonus for holding profitable position during uptrend
logger.debug(f"POSITION BONUS: Holding profitable position during uptrend = +0.5")
elif current_position_pnl < -0.5: # Losing position
if predicted_action in ["BUY", "SELL"] and directional_correct:
base_reward += 0.3 # Bonus for taking action to exit losing position
logger.debug(f"EXIT BONUS: Taking action on losing position = +0.3")
base_reward += run_bonus
logger.debug(f"RUN BONUS: gain={gain:.3f}, conv={conviction:.2f}, bonus={run_bonus:.3f}")
# PRICE VECTOR BONUS (if available)
if predicted_price_vector and isinstance(predicted_price_vector, dict):