popov/gogo2

Fork 0

Files

Dobromir Popov 6ac324289c fetching data from the DB to train

2025-10-31 03:14:35 +02:00

5.6 KiB

Raw Blame History

Training Improvements Summary

What Changed

1. Extended Data Fetching Window ✅

Before:

context_window = 5  # Only ±5 minutes
start_time = timestamp - 5 minutes
end_time = timestamp + 5 minutes

After:

context_window = 5
negative_samples_window = 15  # ±15 candles
extended_window = max(5, 15 + 10)  # = 25 minutes

start_time = timestamp - 25 minutes
end_time = timestamp + 25 minutes

Impact: Fetches enough data to create ±15 candle negative samples

2. Dynamic Candle Limits ✅

Before:

limit = 200  # Fixed for all timeframes

After:

if timeframe == '1s':
    limit = extended_window_minutes * 60 * 2 + 100  # ~3100
elif timeframe == '1m':
    limit = extended_window_minutes * 2 + 50        # ~100
elif timeframe == '1h':
    limit = max(200, extended_window_minutes // 30) # 200+
elif timeframe == '1d':
    limit = 200

Impact: Requests appropriate amount of data per timeframe

3. Improved Logging ✅

Before:

DEBUG - Added 30 negative samples

After:

INFO - Test case 1: ENTRY sample - LONG @ 2500.0
INFO - Test case 1: Added 30 HOLD samples (during position)
INFO - Test case 1: EXIT sample @ 2562.5 (2.50%)
INFO - Test case 1: Added 30 NO_TRADE samples (±15 candles)
INFO -    → 15 before signal, 15 after signal

Impact: Clear visibility into training data composition

4. Historical Data Priority ✅

Before:

df = data_provider.get_historical_data(limit=100)  # Latest data

After:

# Try DuckDB first (historical at specific timestamp)
df = duckdb_storage.get_ohlcv_data(
    start_time=start_time,
    end_time=end_time
)

# Fallback to replay
if df is None:
    df = data_provider.get_historical_data_replay(
        start_time=start_time,
        end_time=end_time
    )

# Last resort: latest data (with warning)
if df is None:
    logger.warning("Using latest data as fallback")
    df = data_provider.get_historical_data(limit=limit)

Impact: Trains on correct historical data, not current data

Training Data Composition

Per Annotation

Sample Type	Count	Repetitions	Total Batches
ENTRY	1	100	100
HOLD	~30	25	750
EXIT	1	100	100
NO_TRADE	~30	50	1,500
Total	~62	-	~2,450

5 Annotations

Sample Type	Count	Total Batches
ENTRY	5	500
HOLD	~150	3,750
EXIT	5	500
NO_TRADE	~150	7,500
Total	~310	~12,250

Key Ratio: 1:30 (entry:no_trade) - Model learns to be selective!

What This Achieves

1. Continuous Data Training ✅

Trains on every candle ±15 around signals
Not just isolated entry/exit points
Learns from continuous price action

2. Negative Sampling ✅

30 NO_TRADE samples per annotation
15 before signal (don't enter too early)
15 after signal (don't chase)

3. Context Learning ✅

Model sees what happened before signal
Model sees what happened after signal
Learns timing and context

4. Selective Trading ✅

High ratio of NO_TRADE samples
Teaches model to wait for quality setups
Reduces false signals

Example Training Output

Starting REAL training with 5 test cases for model Transformer

Preparing training data from 5 test cases...
   Negative sampling: +/-15 candles around signals
   Training repetitions: 100x per sample

   Fetching market state dynamically for test case 1...
   Fetching HISTORICAL market state for ETH/USDT at 2025-10-27 14:00
   Timeframes: ['1s', '1m', '1h', '1d'], Extended window: ±25 minutes
   (Includes ±15 candles for negative sampling)
       1m: 100 candles from DuckDB (historical)
       1h: 200 candles from DuckDB (historical)
       1d: 200 candles from DuckDB (historical)
    Fetched market state with 3 timeframes

   Test case 1: ENTRY sample - LONG @ 2500.0
   Test case 1: Added 30 HOLD samples (during position)
   Test case 1: EXIT sample @ 2562.5 (2.50%)
   Test case 1: Added 30 NO_TRADE samples (±15 candles)
      → 15 before signal, 15 after signal

 Prepared 310 training samples from 5 test cases
   ENTRY samples: 5
   HOLD samples: 150
   EXIT samples: 5
   NO_TRADE samples: 150
   Ratio: 1:30.0 (entry:no_trade)

 Starting Transformer training...
    Converting annotation data to transformer format...
     Converted 310 samples to 12,250 training batches

Files Modified

ANNOTATE/core/real_training_adapter.py
- Extended data fetching window
- Dynamic candle limits
- Improved logging
- Historical data priority

New Documentation

ANNOTATE/CONTINUOUS_DATA_TRAINING_STRATEGY.md
- Detailed explanation of training strategy
- Sample composition breakdown
- Configuration guidelines
- Monitoring tips
ANNOTATE/DATA_LOADING_ARCHITECTURE.md
- Data storage architecture
- Dynamic loading strategy
- Troubleshooting guide
MODEL_INPUTS_OUTPUTS_REFERENCE.md
- All model inputs/outputs
- Shape specifications
- Integration examples

Next Steps

Test Training
- Run training with 5+ annotations
- Verify NO_TRADE samples are created
- Check logs for data fetching
Monitor Ratios
- Ideal: 1:20 to 1:40 (entry:no_trade)
- Adjust negative_samples_window if needed
Verify Data
- Ensure DuckDB has historical data
- Check for "fallback" warnings
- Confirm timestamps match annotations
Tune Parameters
- Adjust extended_window_minutes if needed
- Modify repetitions based on dataset size
- Balance training time vs accuracy

5.6 KiB Raw Blame History

Training Improvements Summary

What Changed

1. Extended Data Fetching Window ✅

2. Dynamic Candle Limits ✅

3. Improved Logging ✅

4. Historical Data Priority ✅

Training Data Composition

Per Annotation

5 Annotations

What This Achieves

1. Continuous Data Training ✅

2. Negative Sampling ✅

3. Context Learning ✅

4. Selective Trading ✅

Example Training Output

Files Modified

New Documentation

Next Steps

5.6 KiB

Raw Blame History