training: conviction-aware reward shaping

2025-08-10 13:23:29 +03:00
parent 6861d0f20b
commit b3c5076e37
2 changed files with 69 additions and 10 deletions
--- a/_dev/dev_notes.md
+++ b/_dev/dev_notes.md
@@ -101,4 +101,32 @@ also, adjust our bybit api so we trade with usdt futures - where we can have up
 3. we don't calculate the COB imbalance correctly - we have MA with  4 time windows. 
 4. we have some more work on the models statistics and overview but we can focust there later when we fix the other issues

-5. audit and backtest if calculate_williams_pivot_points works correctly. show pivot points on the dash on the 1m candlesticks
+5. audit and backtest if calculate_williams_pivot_points works correctly. show pivot points on the dash on the 1m candlesticks
+
+
+
+can we enhance our RL reward/punish to promote closing loosing trades and keep winning ones taking into account the predicted price direction and conviction. For example the more loosing a open position is the more we should be biased to closing it. but if the models predict with high certainty that there will be a big move up we will be more tolerant to a drawdown. and the opposite - we should be inclined to close winning trades but keep them as long as the price goes up and we  project more upside. Do you think there is a smart way to implement that in the current RL and other training pipelines?
+I want it more to be a part of a proper rewardfunction bias rather than a algorithmic calculation on the post signal processing as I prefer that this is a behaviour the moedl learns and is adapted to the current condition without hard bowndaries.
+THINK REALY HARD  
+
+
+do we evaluate and reward/punish each model at each reference? we lost track of our model training metrics. in the dash we show:
+Models & Training Progress
+Loaded Models (5)
+DQN_AGENT - ACTIVE (0) [CKPT]
+Inf
+Trn
+Route
+Last: NONE (0.0%) @ N/A
+Loss: N/A
+Rate: 0.00/s | 24h: 0
+Last Inf: None | Train: None
+ENHANCED_CNN - ACTIVE (0) [CKPT]
+Inf
+Trn
+Route
+Last: NONE (0.0%) @ N/A
+Loss: 2133105152.0000 | Best: 34.2300
+Rate: 0.00/s | 24h: 0
+Last Inf: None | Train: None
+DQN_AGENT and ENHANCED_CNN were the models we had the training working well. we had to include the others but it seems we still havent or at least do not store their metrics and best checkpoints
--- a/core/orchestrator.py
+++ b/core/orchestrator.py
@@ -3834,17 +3834,48 @@ class TradingOrchestrator:
                base_reward = -0.1 * prediction_confidence
                logger.debug(f"NOISE INCORRECT: Wrong direction on noise movement = {base_reward:.2f}")
            
-            # POSITION-AWARE ADJUSTMENTS
+            # POSITION-AWARE ADJUSTMENTS (conviction-aware; learned bias via reward shaping)
            if has_position:
-                # Adjust rewards based on current position status
-                if current_position_pnl > 0.5:  # Profitable position
+                # Derive conviction from prediction_confidence (0..1)
+                conviction = max(0.0, min(1.0, float(prediction_confidence)))
+                # Estimate expected move magnitude if provided by vector; else 0
+                expected_move_pct = 0.0
+                try:
+                    if predicted_price_vector and isinstance(predicted_price_vector, dict):
+                        # Accept either a normalized magnitude or compute from price fields if present
+                        if 'expected_move_pct' in predicted_price_vector:
+                            expected_move_pct = float(predicted_price_vector.get('expected_move_pct', 0.0))
+                        elif 'predicted_price' in predicted_price_vector and 'current_price' in predicted_price_vector:
+                            cp = float(predicted_price_vector.get('current_price') or 0.0)
+                            pp = float(predicted_price_vector.get('predicted_price') or 0.0)
+                            if cp > 0 and pp > 0:
+                                expected_move_pct = ((pp - cp) / cp) * 100.0
+                except Exception:
+                    expected_move_pct = 0.0
+
+                # Normalize expected move impact into [0,1]
+                expected_move_norm = max(0.0, min(1.0, abs(expected_move_pct) / 2.0))  # 2% move caps to 1.0
+
+                # Conviction-tolerant drawdown penalty (cut losers early unless strong conviction for recovery)
+                if current_position_pnl < 0:
+                    pnl_loss = abs(current_position_pnl)
+                    # Scale negative PnL into [0,1] using a soft scale (1% -> 1.0 cap)
+                    loss_norm = max(0.0, min(1.0, pnl_loss / 1.0))
+                    tolerance = (1.0 - min(0.9, conviction * expected_move_norm))  # high conviction reduces penalty
+                    penalty = loss_norm * tolerance
+                    base_reward -= 1.0 * penalty
+                    logger.debug(
+                        f"CONVICTION DRAWdown: pnl={current_position_pnl:.3f}, conv={conviction:.2f}, exp={expected_move_norm:.2f}, penalty={penalty:.3f}"
+                    )
+                else:
+                    # Let winners run when conviction supports it
+                    gain = max(0.0, current_position_pnl)
+                    gain_norm = max(0.0, min(1.0, gain / 1.0))
+                    run_bonus = 0.2 * gain_norm * (0.5 + 0.5 * conviction)
+                    # Small nudge to keep holding if directionally correct
                    if predicted_action == "HOLD" and price_change_pct > 0:
-                        base_reward += 0.5  # Bonus for holding profitable position during uptrend
-                        logger.debug(f"POSITION BONUS: Holding profitable position during uptrend = +0.5")
-                elif current_position_pnl < -0.5:  # Losing position
-                    if predicted_action in ["BUY", "SELL"] and directional_correct:
-                        base_reward += 0.3  # Bonus for taking action to exit losing position
-                        logger.debug(f"EXIT BONUS: Taking action on losing position = +0.3")
+                        base_reward += run_bonus
+                        logger.debug(f"RUN BONUS: gain={gain:.3f}, conv={conviction:.2f}, bonus={run_bonus:.3f}")
            
            # PRICE VECTOR BONUS (if available)
            if predicted_price_vector and isinstance(predicted_price_vector, dict):