241 lines
5.6 KiB
Markdown
241 lines
5.6 KiB
Markdown
# Training Improvements Summary
|
|
|
|
## What Changed
|
|
|
|
### 1. Extended Data Fetching Window ✅
|
|
|
|
**Before:**
|
|
```python
|
|
context_window = 5 # Only ±5 minutes
|
|
start_time = timestamp - 5 minutes
|
|
end_time = timestamp + 5 minutes
|
|
```
|
|
|
|
**After:**
|
|
```python
|
|
context_window = 5
|
|
negative_samples_window = 15 # ±15 candles
|
|
extended_window = max(5, 15 + 10) # = 25 minutes
|
|
|
|
start_time = timestamp - 25 minutes
|
|
end_time = timestamp + 25 minutes
|
|
```
|
|
|
|
**Impact**: Fetches enough data to create ±15 candle negative samples
|
|
|
|
---
|
|
|
|
### 2. Dynamic Candle Limits ✅
|
|
|
|
**Before:**
|
|
```python
|
|
limit = 200 # Fixed for all timeframes
|
|
```
|
|
|
|
**After:**
|
|
```python
|
|
if timeframe == '1s':
|
|
limit = extended_window_minutes * 60 * 2 + 100 # ~3100
|
|
elif timeframe == '1m':
|
|
limit = extended_window_minutes * 2 + 50 # ~100
|
|
elif timeframe == '1h':
|
|
limit = max(200, extended_window_minutes // 30) # 200+
|
|
elif timeframe == '1d':
|
|
limit = 200
|
|
```
|
|
|
|
**Impact**: Requests appropriate amount of data per timeframe
|
|
|
|
---
|
|
|
|
### 3. Improved Logging ✅
|
|
|
|
**Before:**
|
|
```
|
|
DEBUG - Added 30 negative samples
|
|
```
|
|
|
|
**After:**
|
|
```
|
|
INFO - Test case 1: ENTRY sample - LONG @ 2500.0
|
|
INFO - Test case 1: Added 30 HOLD samples (during position)
|
|
INFO - Test case 1: EXIT sample @ 2562.5 (2.50%)
|
|
INFO - Test case 1: Added 30 NO_TRADE samples (±15 candles)
|
|
INFO - → 15 before signal, 15 after signal
|
|
```
|
|
|
|
**Impact**: Clear visibility into training data composition
|
|
|
|
---
|
|
|
|
### 4. Historical Data Priority ✅
|
|
|
|
**Before:**
|
|
```python
|
|
df = data_provider.get_historical_data(limit=100) # Latest data
|
|
```
|
|
|
|
**After:**
|
|
```python
|
|
# Try DuckDB first (historical at specific timestamp)
|
|
df = duckdb_storage.get_ohlcv_data(
|
|
start_time=start_time,
|
|
end_time=end_time
|
|
)
|
|
|
|
# Fallback to replay
|
|
if df is None:
|
|
df = data_provider.get_historical_data_replay(
|
|
start_time=start_time,
|
|
end_time=end_time
|
|
)
|
|
|
|
# Last resort: latest data (with warning)
|
|
if df is None:
|
|
logger.warning("Using latest data as fallback")
|
|
df = data_provider.get_historical_data(limit=limit)
|
|
```
|
|
|
|
**Impact**: Trains on correct historical data, not current data
|
|
|
|
---
|
|
|
|
## Training Data Composition
|
|
|
|
### Per Annotation
|
|
|
|
| Sample Type | Count | Repetitions | Total Batches |
|
|
|------------|-------|-------------|---------------|
|
|
| ENTRY | 1 | 100 | 100 |
|
|
| HOLD | ~30 | 25 | 750 |
|
|
| EXIT | 1 | 100 | 100 |
|
|
| NO_TRADE | ~30 | 50 | 1,500 |
|
|
| **Total** | **~62** | **-** | **~2,450** |
|
|
|
|
### 5 Annotations
|
|
|
|
| Sample Type | Count | Total Batches |
|
|
|------------|-------|---------------|
|
|
| ENTRY | 5 | 500 |
|
|
| HOLD | ~150 | 3,750 |
|
|
| EXIT | 5 | 500 |
|
|
| NO_TRADE | ~150 | 7,500 |
|
|
| **Total** | **~310** | **~12,250** |
|
|
|
|
**Key Ratio**: 1:30 (entry:no_trade) - Model learns to be selective!
|
|
|
|
---
|
|
|
|
## What This Achieves
|
|
|
|
### 1. Continuous Data Training ✅
|
|
- Trains on every candle ±15 around signals
|
|
- Not just isolated entry/exit points
|
|
- Learns from continuous price action
|
|
|
|
### 2. Negative Sampling ✅
|
|
- 30 NO_TRADE samples per annotation
|
|
- 15 before signal (don't enter too early)
|
|
- 15 after signal (don't chase)
|
|
|
|
### 3. Context Learning ✅
|
|
- Model sees what happened before signal
|
|
- Model sees what happened after signal
|
|
- Learns timing and context
|
|
|
|
### 4. Selective Trading ✅
|
|
- High ratio of NO_TRADE samples
|
|
- Teaches model to wait for quality setups
|
|
- Reduces false signals
|
|
|
|
---
|
|
|
|
## Example Training Output
|
|
|
|
```
|
|
Starting REAL training with 5 test cases for model Transformer
|
|
|
|
Preparing training data from 5 test cases...
|
|
Negative sampling: +/-15 candles around signals
|
|
Training repetitions: 100x per sample
|
|
|
|
Fetching market state dynamically for test case 1...
|
|
Fetching HISTORICAL market state for ETH/USDT at 2025-10-27 14:00
|
|
Timeframes: ['1s', '1m', '1h', '1d'], Extended window: ±25 minutes
|
|
(Includes ±15 candles for negative sampling)
|
|
1m: 100 candles from DuckDB (historical)
|
|
1h: 200 candles from DuckDB (historical)
|
|
1d: 200 candles from DuckDB (historical)
|
|
Fetched market state with 3 timeframes
|
|
|
|
Test case 1: ENTRY sample - LONG @ 2500.0
|
|
Test case 1: Added 30 HOLD samples (during position)
|
|
Test case 1: EXIT sample @ 2562.5 (2.50%)
|
|
Test case 1: Added 30 NO_TRADE samples (±15 candles)
|
|
→ 15 before signal, 15 after signal
|
|
|
|
Prepared 310 training samples from 5 test cases
|
|
ENTRY samples: 5
|
|
HOLD samples: 150
|
|
EXIT samples: 5
|
|
NO_TRADE samples: 150
|
|
Ratio: 1:30.0 (entry:no_trade)
|
|
|
|
Starting Transformer training...
|
|
Converting annotation data to transformer format...
|
|
Converted 310 samples to 12,250 training batches
|
|
```
|
|
|
|
---
|
|
|
|
## Files Modified
|
|
|
|
1. `ANNOTATE/core/real_training_adapter.py`
|
|
- Extended data fetching window
|
|
- Dynamic candle limits
|
|
- Improved logging
|
|
- Historical data priority
|
|
|
|
---
|
|
|
|
## New Documentation
|
|
|
|
1. `ANNOTATE/CONTINUOUS_DATA_TRAINING_STRATEGY.md`
|
|
- Detailed explanation of training strategy
|
|
- Sample composition breakdown
|
|
- Configuration guidelines
|
|
- Monitoring tips
|
|
|
|
2. `ANNOTATE/DATA_LOADING_ARCHITECTURE.md`
|
|
- Data storage architecture
|
|
- Dynamic loading strategy
|
|
- Troubleshooting guide
|
|
|
|
3. `MODEL_INPUTS_OUTPUTS_REFERENCE.md`
|
|
- All model inputs/outputs
|
|
- Shape specifications
|
|
- Integration examples
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. **Test Training**
|
|
- Run training with 5+ annotations
|
|
- Verify NO_TRADE samples are created
|
|
- Check logs for data fetching
|
|
|
|
2. **Monitor Ratios**
|
|
- Ideal: 1:20 to 1:40 (entry:no_trade)
|
|
- Adjust `negative_samples_window` if needed
|
|
|
|
3. **Verify Data**
|
|
- Ensure DuckDB has historical data
|
|
- Check for "fallback" warnings
|
|
- Confirm timestamps match annotations
|
|
|
|
4. **Tune Parameters**
|
|
- Adjust `extended_window_minutes` if needed
|
|
- Modify repetitions based on dataset size
|
|
- Balance training time vs accuracy
|