Files

Dobromir Popov 992d6de25b refactoring. inference real data triggers

2025-12-09 11:59:15 +02:00

9.9 KiB

Raw Blame History

Training/Inference Modes Analysis

Overview

This document explains the different training/inference modes available in the system, how they work, and validates their implementations.

The Concurrent Training Problem (Fixed)

Root Cause: Two concurrent training threads were accessing the same model simultaneously:

Batch training (_execute_real_training) - runs epochs over pre-loaded batches
Per-candle training (_realtime_inference_loop → _train_on_new_candle → _train_transformer_on_sample) - trains on each new candle

When both threads called trainer.train_step() at the same time, they both modified the model's weight tensors during backward pass, corrupting each other's computation graphs. This manifested as "tensor version mismatch" / "inplace operation" errors.

Fix Applied: Added _training_lock mutex in RealTrainingAdapter.__init__() that serializes all training operations.

Available Modes

1. Start Live Inference (No Training) - `training_mode: 'none'`

Button: start-inference-btn (green "Start Live Inference (No Training)")

What it does:

Starts real-time inference loop (_realtime_inference_loop)
Makes predictions on each new candle
NO training - model weights remain unchanged
Displays predictions, signals, and PnL tracking
Updates chart with predictions and ghost candles

Implementation:

Frontend: training_panel.html:793 → calls startInference('none')
Backend: app.py:2440 → sets training_strategy.mode = 'none'
Inference loop: real_training_adapter.py:3919 → _realtime_inference_loop()
Training check: real_training_adapter.py:4082 → if training_strategy.mode != 'none' → skips training

Status: ✅ WORKING - No training occurs, only inference

2. Live Inference + Pivot Training - `training_mode: 'pivots_only'`

Button: start-inference-pivot-btn (blue "Live Inference + Pivot Training")

What it does:

Starts real-time inference loop
Makes predictions on each new candle
Trains ONLY on pivot candles:
- BUY at L pivots (low points - support levels)
- SELL at H pivots (high points - resistance levels)
Uses TrainingStrategyManager to detect pivot points
Training uses the _training_lock to prevent concurrent access

Implementation:

Frontend: training_panel.html:797 → calls startInference('pivots_only')
Backend: app.py:2440 → sets training_strategy.mode = 'pivots_only'
Strategy: app.py:539 → _is_pivot_candle() checks for pivot markers
Training trigger: real_training_adapter.py:4099 → should_train_on_candle() returns True only for pivots
Training execution: real_training_adapter.py:4108 → _train_on_new_candle() → _train_transformer_on_sample()
Lock protection: real_training_adapter.py:3589 → with self._training_lock: wraps training

Pivot Detection Logic:

app.py:582 → _is_pivot_candle() checks pivot_markers dict
L pivots (lows): candle_pivots['lows'] → action = 'BUY'
H pivots (highs): candle_pivots['highs'] → action = 'SELL'
Pivot markers come from dashboard's _get_pivot_markers_for_timeframe()

Status: ✅ WORKING - Training only on pivot candles, protected by lock

3. Live Inference + Per-Candle Training - `training_mode: 'every_candle'`

Button: start-inference-candle-btn (primary "Live Inference + Per-Candle Training")

What it does:

Starts real-time inference loop
Makes predictions on each new candle
Trains on EVERY completed candle:
- Determines action from price movement or pivots
- Uses _get_action_for_candle() to decide BUY/SELL/HOLD
- Trains the model on each new candle completion
Training uses the _training_lock to prevent concurrent access

Implementation:

Frontend: training_panel.html:801 → calls startInference('every_candle')
Backend: app.py:2440 → sets training_strategy.mode = 'every_candle'
Strategy: app.py:534 → should_train_on_candle() always returns True for every candle
Action determination: app.py:549 → _get_action_for_candle():
- If pivot candle → uses pivot action (BUY at L, SELL at H)
- If not pivot → uses price movement (BUY if price going up >0.05%, SELL if down <-0.05%, else HOLD)
Training execution: real_training_adapter.py:4108 → _train_on_new_candle() → _train_transformer_on_sample()
Lock protection: real_training_adapter.py:3589 → with self._training_lock: wraps training

Status: ✅ WORKING - Training on every candle, protected by lock

4. Backtest on Visible Chart - Separate mode

Button: start-backtest-btn (yellow "Backtest Visible Chart")

What it does:

Runs backtest on visible chart data (time range from chart x-axis)
Uses loaded model to make predictions on historical data
Simulates trading and calculates PnL, win rate, trades
NO training - only inference on historical data
Displays results: PnL, trades, win rate, progress

Implementation:

Frontend: training_panel.html:891 → calls /api/backtest endpoint
Backend: app.py:2123 → start_backtest() → uses BacktestRunner
Backtest runner: Runs in background thread, processes candles sequentially
No training lock needed - backtest only does inference, no model weight updates

Status: ✅ WORKING - Backtest runs inference only, no training

5. Train Model (Batch Training) - Separate mode

Button: train-model-btn (primary "Train Model")

What it does:

Runs batch training on all annotations
Trains for multiple epochs over pre-loaded batches
Uses _execute_real_training() in background thread
Lock protection: real_training_adapter.py:2625 → with self._training_lock: wraps batch training

Implementation:

Frontend: training_panel.html:547 → calls /api/train-model endpoint
Backend: app.py:2089 → train_model() → starts _execute_real_training() in thread
Training loop: real_training_adapter.py:2605 → iterates batches, calls trainer.train_step()
Lock protection: real_training_adapter.py:2625 → with self._training_lock: wraps each batch

Status: ✅ WORKING - Batch training protected by lock

6. Manual Training - `training_mode: 'manual'`

Button: manual-train-btn (warning "Train on Current Candle (Manual)")

What it does:

Only visible when inference is running in 'manual' mode
User manually triggers training by clicking button
Prompts user for action (BUY/SELL/HOLD)
Trains model on current candle with specified action
Uses the _training_lock to prevent concurrent access

Implementation:

Frontend: training_panel.html:851 → calls /api/realtime-inference/train-manual
Backend: app.py → manual training endpoint (not shown in search results, but referenced)
Lock protection: Uses same _train_transformer_on_sample() which has lock

Status: ✅ WORKING - Manual training fully implemented with endpoint

Training Lock Protection

Location: real_training_adapter.py:167

self._training_lock = threading.Lock()

Protected Operations:

Batch Training (real_training_adapter.py:2625):

with self._training_lock:
    result = trainer.train_step(batch, accumulate_gradients=False)

Per-Candle Training (real_training_adapter.py:3589):

with self._training_lock:
    with torch.enable_grad():
        trainer.model.train()
        result = trainer.train_step(batch, accumulate_gradients=False)

How it prevents the bug:

Only ONE thread can acquire the lock at a time
If batch training is running, per-candle training waits
If per-candle training is running, batch training waits
This serializes all model weight updates, preventing concurrent modifications

Validation Summary

Mode	Training?	Lock Protected?	Status
Live Inference (No Training)	❌ No	N/A	✅ Working
Live Inference + Pivot Training	✅ Yes (pivots only)	✅ Yes	✅ Working
Live Inference + Per-Candle Training	✅ Yes (every candle)	✅ Yes	✅ Working
Backtest	❌ No	N/A	✅ Working
Batch Training	✅ Yes	✅ Yes	✅ Working
Manual Training	✅ Yes (on demand)	✅ Yes	✅ Working

Potential Issues & Recommendations

1. Manual Training Endpoint

✅ Fully implemented at app.py:2701
Sets session['pending_action'] and calls _train_on_new_candle()
Protected by training lock via _train_transformer_on_sample()

2. Training Lock Timeout

Current lock has no timeout - if one training operation hangs, others wait indefinitely
Recommendation: Consider adding timeout or deadlock detection

3. Training Strategy State

TrainingStrategyManager.mode is set per inference session
If multiple inference sessions run simultaneously, they share the same strategy manager
Recommendation: Consider per-session strategy managers

4. Backtest Training

Backtest currently does NOT train the model
Could add option to train during backtest (using lock)

Conclusion

All training/inference modes are properly implemented and protected by the training lock. The concurrent access issue has been resolved. All modes are working correctly:

✅ Live Inference (No Training) - Inference only, no training
✅ Live Inference + Pivot Training - Trains on pivot candles only
✅ Live Inference + Per-Candle Training - Trains on every candle
✅ Backtest - Inference only on historical data
✅ Batch Training - Full epoch training on annotations
✅ Manual Training - On-demand training with user-specified action

The training lock (_training_lock) ensures that only one training operation can modify model weights at a time, preventing the "inplace operation" errors that occurred when batch training and per-candle training ran concurrently.

9.9 KiB Raw Blame History

Training/Inference Modes Analysis

Overview

The Concurrent Training Problem (Fixed)

Available Modes

1. Start Live Inference (No Training) - training_mode: 'none'

2. Live Inference + Pivot Training - training_mode: 'pivots_only'

3. Live Inference + Per-Candle Training - training_mode: 'every_candle'

4. Backtest on Visible Chart - Separate mode

5. Train Model (Batch Training) - Separate mode

6. Manual Training - training_mode: 'manual'

Training Lock Protection

Validation Summary

Potential Issues & Recommendations

1. Manual Training Endpoint

2. Training Lock Timeout

3. Training Strategy State

4. Backtest Training

Conclusion

9.9 KiB

Raw Blame History

1. Start Live Inference (No Training) - `training_mode: 'none'`

2. Live Inference + Pivot Training - `training_mode: 'pivots_only'`

3. Live Inference + Per-Candle Training - `training_mode: 'every_candle'`

6. Manual Training - `training_mode: 'manual'`