# Continuous Data Training Strategy ## Overview The ANNOTATE system trains models on **continuous OHLCV data** from the database, not just on annotated signals. This teaches the model **when to act AND when NOT to act**. ## Training Data Composition For each annotation, the system creates multiple training samples: ### 1. ENTRY Sample (1 per annotation) - **Label**: `ENTRY` - **Action**: `BUY` or `SELL` - **Purpose**: Teach model to recognize entry signals - **Repetitions**: 100x (configurable) ```python { 'label': 'ENTRY', 'action': 'BUY', 'direction': 'LONG', 'timestamp': '2025-10-27 14:00', 'entry_price': 2500.0, 'repetitions': 100 } ``` ### 2. HOLD Samples (N per annotation) - **Label**: `HOLD` - **Action**: `HOLD` - **Purpose**: Teach model to maintain position - **Count**: Every candle between entry and exit - **Repetitions**: 25x (1/4 of entry reps) ```python # For a 30-minute trade with 1m candles = 30 HOLD samples { 'label': 'HOLD', 'action': 'HOLD', 'in_position': True, 'timestamp': '2025-10-27 14:05', # During position 'repetitions': 25 } ``` ### 3. EXIT Sample (1 per annotation) - **Label**: `EXIT` - **Action**: `CLOSE` - **Purpose**: Teach model to recognize exit signals - **Repetitions**: 100x ```python { 'label': 'EXIT', 'action': 'CLOSE', 'timestamp': '2025-10-27 14:30', 'exit_price': 2562.5, 'profit_loss_pct': 2.5, 'repetitions': 100 } ``` ### 4. NO_TRADE Samples (±15 candles per annotation) - **Label**: `NO_TRADE` - **Action**: `HOLD` - **Purpose**: Teach model when NOT to trade - **Count**: Up to 30 samples (15 before + 15 after signal) - **Repetitions**: 50x (1/2 of entry reps) ```python # 15 candles BEFORE entry signal { 'label': 'NO_TRADE', 'action': 'HOLD', 'timestamp': '2025-10-27 13:45', # 15 min before entry 'direction': 'NONE', 'repetitions': 50 } # 15 candles AFTER entry signal { 'label': 'NO_TRADE', 'action': 'HOLD', 'timestamp': '2025-10-27 14:15', # 15 min after entry 'direction': 'NONE', 'repetitions': 50 } ``` ## Data Fetching Strategy ### Extended Time Window To support negative sampling (±15 candles), the system fetches an **extended time window**: ```python # Configuration context_window_minutes = 5 # Base context negative_samples_window = 15 # ±15 candles extended_window = max(5, 15 + 10) # = 25 minutes # Time range start_time = entry_timestamp - 25 minutes end_time = entry_timestamp + 25 minutes ``` ### Candle Limits by Timeframe ```python # 1s timeframe: 25 min × 60 sec × 2 + buffer = ~3100 candles # 1m timeframe: 25 min × 2 + buffer = ~100 candles # 1h timeframe: 200 candles (fixed) # 1d timeframe: 200 candles (fixed) ``` ## Training Sample Distribution ### Example: Single Annotation ``` Annotation: LONG entry at 14:00, exit at 14:30 (30 min hold) Training Samples Created: ├── 1 ENTRY sample @ 14:00 (×100 reps) = 100 batches ├── 30 HOLD samples @ 14:01-14:29 (×25 reps) = 750 batches ├── 1 EXIT sample @ 14:30 (×100 reps) = 100 batches └── 30 NO_TRADE samples @ 13:45-13:59 & 14:01-14:15 (×50 reps) = 1500 batches Total: 62 unique samples → 2,450 training batches ``` ### Example: 5 Annotations ``` 5 annotations with similar structure: Training Samples: ├── ENTRY: 5 samples (×100 reps) = 500 batches ├── HOLD: ~150 samples (×25 reps) = 3,750 batches ├── EXIT: 5 samples (×100 reps) = 500 batches └── NO_TRADE: ~150 samples (×50 reps) = 7,500 batches Total: ~310 unique samples → 12,250 training batches Ratio: 1:30 (entry:no_trade) - teaches model to be selective! ``` ## Why This Works ### 1. Reduces False Positives By training on NO_TRADE samples around signals, the model learns: - Not every price movement is a signal - Context matters (what happened before/after) - Patience is important (wait for the right moment) ### 2. Improves Timing By training on continuous data, the model learns: - Gradual buildup to entry signals - How market conditions evolve - Difference between "almost" and "ready" ### 3. Teaches Position Management By training on HOLD samples, the model learns: - When to stay in position - Not to exit early - How to ride trends ### 4. Balanced Training The repetition strategy ensures balanced learning: - ENTRY: 100 reps (high importance) - EXIT: 100 reps (high importance) - NO_TRADE: 50 reps (moderate importance) - HOLD: 25 reps (lower importance, but many samples) ## Database Requirements ### Continuous OHLCV Storage The system requires **continuous historical data** in DuckDB: ```sql -- Example: Check data availability SELECT symbol, timeframe, COUNT(*) as candle_count, MIN(timestamp) as first_candle, MAX(timestamp) as last_candle FROM ohlcv_data WHERE symbol = 'ETH/USDT' GROUP BY symbol, timeframe; ``` ### Data Gaps If there are gaps in the data: - Negative samples will be fewer (< 30) - Model still trains but with less context - Warning logged: "Could not create full negative sample set" ## Configuration ### Adjustable Parameters ```python # In _prepare_training_data() negative_samples_window = 15 # ±15 candles (default) training_repetitions = 100 # 100x per sample (default) # Derived repetitions hold_repetitions = 100 // 4 # 25x no_trade_repetitions = 100 // 2 # 50x ``` ### Tuning Guidelines | Parameter | Small Dataset | Large Dataset | High Precision | |-----------|--------------|---------------|----------------| | `negative_samples_window` | 10 | 20 | 15 | | `training_repetitions` | 50 | 200 | 100 | | `extended_window_minutes` | 15 | 30 | 25 | ## Monitoring ### Training Logs Look for these log messages: ``` ✅ Good: "Fetching HISTORICAL market state for ETH/USDT at 2025-10-27 14:00" "Extended window: ±25 minutes (Includes ±15 candles for negative sampling)" "1m: 100 candles from DuckDB (historical)" "Added 30 NO_TRADE samples (±15 candles)" "→ 15 before signal, 15 after signal" ⚠️ Warning: "No historical data found, using latest data as fallback" "Could not create full negative sample set (only 8 samples)" "Market data has 50 timestamps from ... to ..." (insufficient data) ``` ### Sample Distribution Check the final distribution: ``` INFO - Prepared 310 training samples from 5 test cases INFO - ENTRY samples: 5 INFO - HOLD samples: 150 INFO - EXIT samples: 5 INFO - NO_TRADE samples: 150 INFO - Ratio: 1:30.0 (entry:no_trade) ``` **Ideal Ratio**: 1:20 to 1:40 (entry:no_trade) - Too low (< 1:10): Model may overtrade - Too high (> 1:50): Model may undertrade ## Benefits ### 1. Realistic Training - Trains on actual market conditions - Includes noise and false signals - Learns from continuous price action ### 2. Better Generalization - Not just memorizing entry points - Understands context and timing - Reduces overfitting ### 3. Selective Trading - High ratio of NO_TRADE samples - Learns to wait for quality setups - Reduces false signals in production ### 4. Efficient Use of Data - One annotation → 60+ training samples - Leverages continuous database storage - No manual labeling of negative samples ## Example Training Session ``` Starting REAL training with 5 test cases for model Transformer Preparing training data from 5 test cases... Negative sampling: +/-15 candles around signals Training repetitions: 100x per sample Fetching market state dynamically for test case 1... Fetching HISTORICAL market state for ETH/USDT at 2025-10-27 14:00 Timeframes: ['1s', '1m', '1h', '1d'], Extended window: ±25 minutes (Includes ±15 candles for negative sampling) 1m: 100 candles from DuckDB (historical) 1h: 200 candles from DuckDB (historical) 1d: 200 candles from DuckDB (historical) Fetched market state with 3 timeframes Test case 1: ENTRY sample - LONG @ 2500.0 Test case 1: Added 30 HOLD samples (during position) Test case 1: EXIT sample @ 2562.5 (2.50%) Test case 1: Added 30 NO_TRADE samples (±15 candles) → 15 before signal, 15 after signal [... repeat for test cases 2-5 ...] Prepared 310 training samples from 5 test cases ENTRY samples: 5 HOLD samples: 150 EXIT samples: 5 NO_TRADE samples: 150 Ratio: 1:30.0 (entry:no_trade) Starting Transformer training... Converting annotation data to transformer format... Converted 310 samples to 12,250 training batches Training batch 1/12250: loss=0.523 Training batch 100/12250: loss=0.412 Training batch 200/12250: loss=0.356 ... ``` ## Summary - ✅ Trains on **continuous OHLCV data** from database - ✅ Creates **±15 candle negative samples** automatically - ✅ Teaches model **when to act AND when NOT to act** - ✅ Uses **extended time window** to fetch sufficient data - ✅ Balanced training with **1:30 entry:no_trade ratio** - ✅ Efficient: **1 annotation → 60+ training samples** - ✅ Realistic: Includes noise, false signals, and context