training: conviction-aware reward shaping

2025-08-10 13:23:29 +03:00
parent 6861d0f20b
commit b3c5076e37
2 changed files with 69 additions and 10 deletions
--- a/_dev/dev_notes.md
+++ b/_dev/dev_notes.md
@@ -101,4 +101,32 @@ also, adjust our bybit api so we trade with usdt futures - where we can have up
 3. we don't calculate the COB imbalance correctly - we have MA with  4 time windows. 
 4. we have some more work on the models statistics and overview but we can focust there later when we fix the other issues

-5. audit and backtest if calculate_williams_pivot_points works correctly. show pivot points on the dash on the 1m candlesticks
+5. audit and backtest if calculate_williams_pivot_points works correctly. show pivot points on the dash on the 1m candlesticks
+
+
+
+can we enhance our RL reward/punish to promote closing loosing trades and keep winning ones taking into account the predicted price direction and conviction. For example the more loosing a open position is the more we should be biased to closing it. but if the models predict with high certainty that there will be a big move up we will be more tolerant to a drawdown. and the opposite - we should be inclined to close winning trades but keep them as long as the price goes up and we  project more upside. Do you think there is a smart way to implement that in the current RL and other training pipelines?
+I want it more to be a part of a proper rewardfunction bias rather than a algorithmic calculation on the post signal processing as I prefer that this is a behaviour the moedl learns and is adapted to the current condition without hard bowndaries.
+THINK REALY HARD  
+
+
+do we evaluate and reward/punish each model at each reference? we lost track of our model training metrics. in the dash we show:
+Models & Training Progress
+Loaded Models (5)
+DQN_AGENT - ACTIVE (0) [CKPT]
+Inf
+Trn
+Route
+Last: NONE (0.0%) @ N/A
+Loss: N/A
+Rate: 0.00/s | 24h: 0
+Last Inf: None | Train: None
+ENHANCED_CNN - ACTIVE (0) [CKPT]
+Inf
+Trn
+Route
+Last: NONE (0.0%) @ N/A
+Loss: 2133105152.0000 | Best: 34.2300
+Rate: 0.00/s | 24h: 0
+Last Inf: None | Train: None
+DQN_AGENT and ENHANCED_CNN were the models we had the training working well. we had to include the others but it seems we still havent or at least do not store their metrics and best checkpoints