334 lines
8.8 KiB
Markdown
334 lines
8.8 KiB
Markdown
# Continuous Data Training Strategy
|
||
|
||
## Overview
|
||
|
||
The ANNOTATE system trains models on **continuous OHLCV data** from the database, not just on annotated signals. This teaches the model **when to act AND when NOT to act**.
|
||
|
||
## Training Data Composition
|
||
|
||
For each annotation, the system creates multiple training samples:
|
||
|
||
### 1. ENTRY Sample (1 per annotation)
|
||
- **Label**: `ENTRY`
|
||
- **Action**: `BUY` or `SELL`
|
||
- **Purpose**: Teach model to recognize entry signals
|
||
- **Repetitions**: 100x (configurable)
|
||
|
||
```python
|
||
{
|
||
'label': 'ENTRY',
|
||
'action': 'BUY',
|
||
'direction': 'LONG',
|
||
'timestamp': '2025-10-27 14:00',
|
||
'entry_price': 2500.0,
|
||
'repetitions': 100
|
||
}
|
||
```
|
||
|
||
### 2. HOLD Samples (N per annotation)
|
||
- **Label**: `HOLD`
|
||
- **Action**: `HOLD`
|
||
- **Purpose**: Teach model to maintain position
|
||
- **Count**: Every candle between entry and exit
|
||
- **Repetitions**: 25x (1/4 of entry reps)
|
||
|
||
```python
|
||
# For a 30-minute trade with 1m candles = 30 HOLD samples
|
||
{
|
||
'label': 'HOLD',
|
||
'action': 'HOLD',
|
||
'in_position': True,
|
||
'timestamp': '2025-10-27 14:05', # During position
|
||
'repetitions': 25
|
||
}
|
||
```
|
||
|
||
### 3. EXIT Sample (1 per annotation)
|
||
- **Label**: `EXIT`
|
||
- **Action**: `CLOSE`
|
||
- **Purpose**: Teach model to recognize exit signals
|
||
- **Repetitions**: 100x
|
||
|
||
```python
|
||
{
|
||
'label': 'EXIT',
|
||
'action': 'CLOSE',
|
||
'timestamp': '2025-10-27 14:30',
|
||
'exit_price': 2562.5,
|
||
'profit_loss_pct': 2.5,
|
||
'repetitions': 100
|
||
}
|
||
```
|
||
|
||
### 4. NO_TRADE Samples (±15 candles per annotation)
|
||
- **Label**: `NO_TRADE`
|
||
- **Action**: `HOLD`
|
||
- **Purpose**: Teach model when NOT to trade
|
||
- **Count**: Up to 30 samples (15 before + 15 after signal)
|
||
- **Repetitions**: 50x (1/2 of entry reps)
|
||
|
||
```python
|
||
# 15 candles BEFORE entry signal
|
||
{
|
||
'label': 'NO_TRADE',
|
||
'action': 'HOLD',
|
||
'timestamp': '2025-10-27 13:45', # 15 min before entry
|
||
'direction': 'NONE',
|
||
'repetitions': 50
|
||
}
|
||
|
||
# 15 candles AFTER entry signal
|
||
{
|
||
'label': 'NO_TRADE',
|
||
'action': 'HOLD',
|
||
'timestamp': '2025-10-27 14:15', # 15 min after entry
|
||
'direction': 'NONE',
|
||
'repetitions': 50
|
||
}
|
||
```
|
||
|
||
## Data Fetching Strategy
|
||
|
||
### Extended Time Window
|
||
|
||
To support negative sampling (±15 candles), the system fetches an **extended time window**:
|
||
|
||
```python
|
||
# Configuration
|
||
context_window_minutes = 5 # Base context
|
||
negative_samples_window = 15 # ±15 candles
|
||
extended_window = max(5, 15 + 10) # = 25 minutes
|
||
|
||
# Time range
|
||
start_time = entry_timestamp - 25 minutes
|
||
end_time = entry_timestamp + 25 minutes
|
||
```
|
||
|
||
### Candle Limits by Timeframe
|
||
|
||
```python
|
||
# 1s timeframe: 25 min × 60 sec × 2 + buffer = ~3100 candles
|
||
# 1m timeframe: 25 min × 2 + buffer = ~100 candles
|
||
# 1h timeframe: 200 candles (fixed)
|
||
# 1d timeframe: 200 candles (fixed)
|
||
```
|
||
|
||
## Training Sample Distribution
|
||
|
||
### Example: Single Annotation
|
||
|
||
```
|
||
Annotation: LONG entry at 14:00, exit at 14:30 (30 min hold)
|
||
|
||
Training Samples Created:
|
||
├── 1 ENTRY sample @ 14:00 (×100 reps) = 100 batches
|
||
├── 30 HOLD samples @ 14:01-14:29 (×25 reps) = 750 batches
|
||
├── 1 EXIT sample @ 14:30 (×100 reps) = 100 batches
|
||
└── 30 NO_TRADE samples @ 13:45-13:59 & 14:01-14:15 (×50 reps) = 1500 batches
|
||
|
||
Total: 62 unique samples → 2,450 training batches
|
||
```
|
||
|
||
### Example: 5 Annotations
|
||
|
||
```
|
||
5 annotations with similar structure:
|
||
|
||
Training Samples:
|
||
├── ENTRY: 5 samples (×100 reps) = 500 batches
|
||
├── HOLD: ~150 samples (×25 reps) = 3,750 batches
|
||
├── EXIT: 5 samples (×100 reps) = 500 batches
|
||
└── NO_TRADE: ~150 samples (×50 reps) = 7,500 batches
|
||
|
||
Total: ~310 unique samples → 12,250 training batches
|
||
|
||
Ratio: 1:30 (entry:no_trade) - teaches model to be selective!
|
||
```
|
||
|
||
## Why This Works
|
||
|
||
### 1. Reduces False Positives
|
||
By training on NO_TRADE samples around signals, the model learns:
|
||
- Not every price movement is a signal
|
||
- Context matters (what happened before/after)
|
||
- Patience is important (wait for the right moment)
|
||
|
||
### 2. Improves Timing
|
||
By training on continuous data, the model learns:
|
||
- Gradual buildup to entry signals
|
||
- How market conditions evolve
|
||
- Difference between "almost" and "ready"
|
||
|
||
### 3. Teaches Position Management
|
||
By training on HOLD samples, the model learns:
|
||
- When to stay in position
|
||
- Not to exit early
|
||
- How to ride trends
|
||
|
||
### 4. Balanced Training
|
||
The repetition strategy ensures balanced learning:
|
||
- ENTRY: 100 reps (high importance)
|
||
- EXIT: 100 reps (high importance)
|
||
- NO_TRADE: 50 reps (moderate importance)
|
||
- HOLD: 25 reps (lower importance, but many samples)
|
||
|
||
## Database Requirements
|
||
|
||
### Continuous OHLCV Storage
|
||
|
||
The system requires **continuous historical data** in DuckDB:
|
||
|
||
```sql
|
||
-- Example: Check data availability
|
||
SELECT
|
||
symbol,
|
||
timeframe,
|
||
COUNT(*) as candle_count,
|
||
MIN(timestamp) as first_candle,
|
||
MAX(timestamp) as last_candle
|
||
FROM ohlcv_data
|
||
WHERE symbol = 'ETH/USDT'
|
||
GROUP BY symbol, timeframe;
|
||
```
|
||
|
||
### Data Gaps
|
||
|
||
If there are gaps in the data:
|
||
- Negative samples will be fewer (< 30)
|
||
- Model still trains but with less context
|
||
- Warning logged: "Could not create full negative sample set"
|
||
|
||
## Configuration
|
||
|
||
### Adjustable Parameters
|
||
|
||
```python
|
||
# In _prepare_training_data()
|
||
negative_samples_window = 15 # ±15 candles (default)
|
||
training_repetitions = 100 # 100x per sample (default)
|
||
|
||
# Derived repetitions
|
||
hold_repetitions = 100 // 4 # 25x
|
||
no_trade_repetitions = 100 // 2 # 50x
|
||
```
|
||
|
||
### Tuning Guidelines
|
||
|
||
| Parameter | Small Dataset | Large Dataset | High Precision |
|
||
|-----------|--------------|---------------|----------------|
|
||
| `negative_samples_window` | 10 | 20 | 15 |
|
||
| `training_repetitions` | 50 | 200 | 100 |
|
||
| `extended_window_minutes` | 15 | 30 | 25 |
|
||
|
||
## Monitoring
|
||
|
||
### Training Logs
|
||
|
||
Look for these log messages:
|
||
|
||
```
|
||
✅ Good:
|
||
"Fetching HISTORICAL market state for ETH/USDT at 2025-10-27 14:00"
|
||
"Extended window: ±25 minutes (Includes ±15 candles for negative sampling)"
|
||
"1m: 100 candles from DuckDB (historical)"
|
||
"Added 30 NO_TRADE samples (±15 candles)"
|
||
"→ 15 before signal, 15 after signal"
|
||
|
||
⚠️ Warning:
|
||
"No historical data found, using latest data as fallback"
|
||
"Could not create full negative sample set (only 8 samples)"
|
||
"Market data has 50 timestamps from ... to ..." (insufficient data)
|
||
```
|
||
|
||
### Sample Distribution
|
||
|
||
Check the final distribution:
|
||
|
||
```
|
||
INFO - Prepared 310 training samples from 5 test cases
|
||
INFO - ENTRY samples: 5
|
||
INFO - HOLD samples: 150
|
||
INFO - EXIT samples: 5
|
||
INFO - NO_TRADE samples: 150
|
||
INFO - Ratio: 1:30.0 (entry:no_trade)
|
||
```
|
||
|
||
**Ideal Ratio**: 1:20 to 1:40 (entry:no_trade)
|
||
- Too low (< 1:10): Model may overtrade
|
||
- Too high (> 1:50): Model may undertrade
|
||
|
||
## Benefits
|
||
|
||
### 1. Realistic Training
|
||
- Trains on actual market conditions
|
||
- Includes noise and false signals
|
||
- Learns from continuous price action
|
||
|
||
### 2. Better Generalization
|
||
- Not just memorizing entry points
|
||
- Understands context and timing
|
||
- Reduces overfitting
|
||
|
||
### 3. Selective Trading
|
||
- High ratio of NO_TRADE samples
|
||
- Learns to wait for quality setups
|
||
- Reduces false signals in production
|
||
|
||
### 4. Efficient Use of Data
|
||
- One annotation → 60+ training samples
|
||
- Leverages continuous database storage
|
||
- No manual labeling of negative samples
|
||
|
||
## Example Training Session
|
||
|
||
```
|
||
Starting REAL training with 5 test cases for model Transformer
|
||
|
||
Preparing training data from 5 test cases...
|
||
Negative sampling: +/-15 candles around signals
|
||
Training repetitions: 100x per sample
|
||
|
||
Fetching market state dynamically for test case 1...
|
||
Fetching HISTORICAL market state for ETH/USDT at 2025-10-27 14:00
|
||
Timeframes: ['1s', '1m', '1h', '1d'], Extended window: ±25 minutes
|
||
(Includes ±15 candles for negative sampling)
|
||
1m: 100 candles from DuckDB (historical)
|
||
1h: 200 candles from DuckDB (historical)
|
||
1d: 200 candles from DuckDB (historical)
|
||
Fetched market state with 3 timeframes
|
||
|
||
Test case 1: ENTRY sample - LONG @ 2500.0
|
||
Test case 1: Added 30 HOLD samples (during position)
|
||
Test case 1: EXIT sample @ 2562.5 (2.50%)
|
||
Test case 1: Added 30 NO_TRADE samples (±15 candles)
|
||
→ 15 before signal, 15 after signal
|
||
|
||
[... repeat for test cases 2-5 ...]
|
||
|
||
Prepared 310 training samples from 5 test cases
|
||
ENTRY samples: 5
|
||
HOLD samples: 150
|
||
EXIT samples: 5
|
||
NO_TRADE samples: 150
|
||
Ratio: 1:30.0 (entry:no_trade)
|
||
|
||
Starting Transformer training...
|
||
Converting annotation data to transformer format...
|
||
Converted 310 samples to 12,250 training batches
|
||
|
||
Training batch 1/12250: loss=0.523
|
||
Training batch 100/12250: loss=0.412
|
||
Training batch 200/12250: loss=0.356
|
||
...
|
||
```
|
||
|
||
## Summary
|
||
|
||
- ✅ Trains on **continuous OHLCV data** from database
|
||
- ✅ Creates **±15 candle negative samples** automatically
|
||
- ✅ Teaches model **when to act AND when NOT to act**
|
||
- ✅ Uses **extended time window** to fetch sufficient data
|
||
- ✅ Balanced training with **1:30 entry:no_trade ratio**
|
||
- ✅ Efficient: **1 annotation → 60+ training samples**
|
||
- ✅ Realistic: Includes noise, false signals, and context
|