fetching data from the DB to train
This commit is contained in:
240
ANNOTATE/TRAINING_IMPROVEMENTS_SUMMARY.md
Normal file
240
ANNOTATE/TRAINING_IMPROVEMENTS_SUMMARY.md
Normal file
@@ -0,0 +1,240 @@
|
||||
# Training Improvements Summary
|
||||
|
||||
## What Changed
|
||||
|
||||
### 1. Extended Data Fetching Window ✅
|
||||
|
||||
**Before:**
|
||||
```python
|
||||
context_window = 5 # Only ±5 minutes
|
||||
start_time = timestamp - 5 minutes
|
||||
end_time = timestamp + 5 minutes
|
||||
```
|
||||
|
||||
**After:**
|
||||
```python
|
||||
context_window = 5
|
||||
negative_samples_window = 15 # ±15 candles
|
||||
extended_window = max(5, 15 + 10) # = 25 minutes
|
||||
|
||||
start_time = timestamp - 25 minutes
|
||||
end_time = timestamp + 25 minutes
|
||||
```
|
||||
|
||||
**Impact**: Fetches enough data to create ±15 candle negative samples
|
||||
|
||||
---
|
||||
|
||||
### 2. Dynamic Candle Limits ✅
|
||||
|
||||
**Before:**
|
||||
```python
|
||||
limit = 200 # Fixed for all timeframes
|
||||
```
|
||||
|
||||
**After:**
|
||||
```python
|
||||
if timeframe == '1s':
|
||||
limit = extended_window_minutes * 60 * 2 + 100 # ~3100
|
||||
elif timeframe == '1m':
|
||||
limit = extended_window_minutes * 2 + 50 # ~100
|
||||
elif timeframe == '1h':
|
||||
limit = max(200, extended_window_minutes // 30) # 200+
|
||||
elif timeframe == '1d':
|
||||
limit = 200
|
||||
```
|
||||
|
||||
**Impact**: Requests appropriate amount of data per timeframe
|
||||
|
||||
---
|
||||
|
||||
### 3. Improved Logging ✅
|
||||
|
||||
**Before:**
|
||||
```
|
||||
DEBUG - Added 30 negative samples
|
||||
```
|
||||
|
||||
**After:**
|
||||
```
|
||||
INFO - Test case 1: ENTRY sample - LONG @ 2500.0
|
||||
INFO - Test case 1: Added 30 HOLD samples (during position)
|
||||
INFO - Test case 1: EXIT sample @ 2562.5 (2.50%)
|
||||
INFO - Test case 1: Added 30 NO_TRADE samples (±15 candles)
|
||||
INFO - → 15 before signal, 15 after signal
|
||||
```
|
||||
|
||||
**Impact**: Clear visibility into training data composition
|
||||
|
||||
---
|
||||
|
||||
### 4. Historical Data Priority ✅
|
||||
|
||||
**Before:**
|
||||
```python
|
||||
df = data_provider.get_historical_data(limit=100) # Latest data
|
||||
```
|
||||
|
||||
**After:**
|
||||
```python
|
||||
# Try DuckDB first (historical at specific timestamp)
|
||||
df = duckdb_storage.get_ohlcv_data(
|
||||
start_time=start_time,
|
||||
end_time=end_time
|
||||
)
|
||||
|
||||
# Fallback to replay
|
||||
if df is None:
|
||||
df = data_provider.get_historical_data_replay(
|
||||
start_time=start_time,
|
||||
end_time=end_time
|
||||
)
|
||||
|
||||
# Last resort: latest data (with warning)
|
||||
if df is None:
|
||||
logger.warning("Using latest data as fallback")
|
||||
df = data_provider.get_historical_data(limit=limit)
|
||||
```
|
||||
|
||||
**Impact**: Trains on correct historical data, not current data
|
||||
|
||||
---
|
||||
|
||||
## Training Data Composition
|
||||
|
||||
### Per Annotation
|
||||
|
||||
| Sample Type | Count | Repetitions | Total Batches |
|
||||
|------------|-------|-------------|---------------|
|
||||
| ENTRY | 1 | 100 | 100 |
|
||||
| HOLD | ~30 | 25 | 750 |
|
||||
| EXIT | 1 | 100 | 100 |
|
||||
| NO_TRADE | ~30 | 50 | 1,500 |
|
||||
| **Total** | **~62** | **-** | **~2,450** |
|
||||
|
||||
### 5 Annotations
|
||||
|
||||
| Sample Type | Count | Total Batches |
|
||||
|------------|-------|---------------|
|
||||
| ENTRY | 5 | 500 |
|
||||
| HOLD | ~150 | 3,750 |
|
||||
| EXIT | 5 | 500 |
|
||||
| NO_TRADE | ~150 | 7,500 |
|
||||
| **Total** | **~310** | **~12,250** |
|
||||
|
||||
**Key Ratio**: 1:30 (entry:no_trade) - Model learns to be selective!
|
||||
|
||||
---
|
||||
|
||||
## What This Achieves
|
||||
|
||||
### 1. Continuous Data Training ✅
|
||||
- Trains on every candle ±15 around signals
|
||||
- Not just isolated entry/exit points
|
||||
- Learns from continuous price action
|
||||
|
||||
### 2. Negative Sampling ✅
|
||||
- 30 NO_TRADE samples per annotation
|
||||
- 15 before signal (don't enter too early)
|
||||
- 15 after signal (don't chase)
|
||||
|
||||
### 3. Context Learning ✅
|
||||
- Model sees what happened before signal
|
||||
- Model sees what happened after signal
|
||||
- Learns timing and context
|
||||
|
||||
### 4. Selective Trading ✅
|
||||
- High ratio of NO_TRADE samples
|
||||
- Teaches model to wait for quality setups
|
||||
- Reduces false signals
|
||||
|
||||
---
|
||||
|
||||
## Example Training Output
|
||||
|
||||
```
|
||||
Starting REAL training with 5 test cases for model Transformer
|
||||
|
||||
Preparing training data from 5 test cases...
|
||||
Negative sampling: +/-15 candles around signals
|
||||
Training repetitions: 100x per sample
|
||||
|
||||
Fetching market state dynamically for test case 1...
|
||||
Fetching HISTORICAL market state for ETH/USDT at 2025-10-27 14:00
|
||||
Timeframes: ['1s', '1m', '1h', '1d'], Extended window: ±25 minutes
|
||||
(Includes ±15 candles for negative sampling)
|
||||
1m: 100 candles from DuckDB (historical)
|
||||
1h: 200 candles from DuckDB (historical)
|
||||
1d: 200 candles from DuckDB (historical)
|
||||
Fetched market state with 3 timeframes
|
||||
|
||||
Test case 1: ENTRY sample - LONG @ 2500.0
|
||||
Test case 1: Added 30 HOLD samples (during position)
|
||||
Test case 1: EXIT sample @ 2562.5 (2.50%)
|
||||
Test case 1: Added 30 NO_TRADE samples (±15 candles)
|
||||
→ 15 before signal, 15 after signal
|
||||
|
||||
Prepared 310 training samples from 5 test cases
|
||||
ENTRY samples: 5
|
||||
HOLD samples: 150
|
||||
EXIT samples: 5
|
||||
NO_TRADE samples: 150
|
||||
Ratio: 1:30.0 (entry:no_trade)
|
||||
|
||||
Starting Transformer training...
|
||||
Converting annotation data to transformer format...
|
||||
Converted 310 samples to 12,250 training batches
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Files Modified
|
||||
|
||||
1. `ANNOTATE/core/real_training_adapter.py`
|
||||
- Extended data fetching window
|
||||
- Dynamic candle limits
|
||||
- Improved logging
|
||||
- Historical data priority
|
||||
|
||||
---
|
||||
|
||||
## New Documentation
|
||||
|
||||
1. `ANNOTATE/CONTINUOUS_DATA_TRAINING_STRATEGY.md`
|
||||
- Detailed explanation of training strategy
|
||||
- Sample composition breakdown
|
||||
- Configuration guidelines
|
||||
- Monitoring tips
|
||||
|
||||
2. `ANNOTATE/DATA_LOADING_ARCHITECTURE.md`
|
||||
- Data storage architecture
|
||||
- Dynamic loading strategy
|
||||
- Troubleshooting guide
|
||||
|
||||
3. `MODEL_INPUTS_OUTPUTS_REFERENCE.md`
|
||||
- All model inputs/outputs
|
||||
- Shape specifications
|
||||
- Integration examples
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Test Training**
|
||||
- Run training with 5+ annotations
|
||||
- Verify NO_TRADE samples are created
|
||||
- Check logs for data fetching
|
||||
|
||||
2. **Monitor Ratios**
|
||||
- Ideal: 1:20 to 1:40 (entry:no_trade)
|
||||
- Adjust `negative_samples_window` if needed
|
||||
|
||||
3. **Verify Data**
|
||||
- Ensure DuckDB has historical data
|
||||
- Check for "fallback" warnings
|
||||
- Confirm timestamps match annotations
|
||||
|
||||
4. **Tune Parameters**
|
||||
- Adjust `extended_window_minutes` if needed
|
||||
- Modify repetitions based on dataset size
|
||||
- Balance training time vs accuracy
|
||||
Reference in New Issue
Block a user