fetching data from the DB to train

2025-10-31 03:14:35 +02:00
parent 07150fd019
commit 6ac324289c
6 changed files with 1113 additions and 46 deletions
--- a/ANNOTATE/TRAINING_IMPROVEMENTS_SUMMARY.md
+++ b/ANNOTATE/TRAINING_IMPROVEMENTS_SUMMARY.md
@@ -0,0 +1,240 @@
+# Training Improvements Summary
+
+## What Changed
+
+### 1. Extended Data Fetching Window ✅
+
+**Before:**
+```python
+context_window = 5  # Only ±5 minutes
+start_time = timestamp - 5 minutes
+end_time = timestamp + 5 minutes
+```
+
+**After:**
+```python
+context_window = 5
+negative_samples_window = 15  # ±15 candles
+extended_window = max(5, 15 + 10)  # = 25 minutes
+
+start_time = timestamp - 25 minutes
+end_time = timestamp + 25 minutes
+```
+
+**Impact**: Fetches enough data to create ±15 candle negative samples
+
+---
+
+### 2. Dynamic Candle Limits ✅
+
+**Before:**
+```python
+limit = 200  # Fixed for all timeframes
+```
+
+**After:**
+```python
+if timeframe == '1s':
+    limit = extended_window_minutes * 60 * 2 + 100  # ~3100
+elif timeframe == '1m':
+    limit = extended_window_minutes * 2 + 50        # ~100
+elif timeframe == '1h':
+    limit = max(200, extended_window_minutes // 30) # 200+
+elif timeframe == '1d':
+    limit = 200
+```
+
+**Impact**: Requests appropriate amount of data per timeframe
+
+---
+
+### 3. Improved Logging ✅
+
+**Before:**
+```
+DEBUG - Added 30 negative samples
+```
+
+**After:**
+```
+INFO - Test case 1: ENTRY sample - LONG @ 2500.0
+INFO - Test case 1: Added 30 HOLD samples (during position)
+INFO - Test case 1: EXIT sample @ 2562.5 (2.50%)
+INFO - Test case 1: Added 30 NO_TRADE samples (±15 candles)
+INFO -    → 15 before signal, 15 after signal
+```
+
+**Impact**: Clear visibility into training data composition
+
+---
+
+### 4. Historical Data Priority ✅
+
+**Before:**
+```python
+df = data_provider.get_historical_data(limit=100)  # Latest data
+```
+
+**After:**
+```python
+# Try DuckDB first (historical at specific timestamp)
+df = duckdb_storage.get_ohlcv_data(
+    start_time=start_time,
+    end_time=end_time
+)
+
+# Fallback to replay
+if df is None:
+    df = data_provider.get_historical_data_replay(
+        start_time=start_time,
+        end_time=end_time
+    )
+
+# Last resort: latest data (with warning)
+if df is None:
+    logger.warning("Using latest data as fallback")
+    df = data_provider.get_historical_data(limit=limit)
+```
+
+**Impact**: Trains on correct historical data, not current data
+
+---
+
+## Training Data Composition
+
+### Per Annotation
+
+| Sample Type | Count | Repetitions | Total Batches |
+|------------|-------|-------------|---------------|
+| ENTRY | 1 | 100 | 100 |
+| HOLD | ~30 | 25 | 750 |
+| EXIT | 1 | 100 | 100 |
+| NO_TRADE | ~30 | 50 | 1,500 |
+| **Total** | **~62** | **-** | **~2,450** |
+
+### 5 Annotations
+
+| Sample Type | Count | Total Batches |
+|------------|-------|---------------|
+| ENTRY | 5 | 500 |
+| HOLD | ~150 | 3,750 |
+| EXIT | 5 | 500 |
+| NO_TRADE | ~150 | 7,500 |
+| **Total** | **~310** | **~12,250** |
+
+**Key Ratio**: 1:30 (entry:no_trade) - Model learns to be selective!
+
+---
+
+## What This Achieves
+
+### 1. Continuous Data Training ✅
+- Trains on every candle ±15 around signals
+- Not just isolated entry/exit points
+- Learns from continuous price action
+
+### 2. Negative Sampling ✅
+- 30 NO_TRADE samples per annotation
+- 15 before signal (don't enter too early)
+- 15 after signal (don't chase)
+
+### 3. Context Learning ✅
+- Model sees what happened before signal
+- Model sees what happened after signal
+- Learns timing and context
+
+### 4. Selective Trading ✅
+- High ratio of NO_TRADE samples
+- Teaches model to wait for quality setups
+- Reduces false signals
+
+---
+
+## Example Training Output
+
+```
+Starting REAL training with 5 test cases for model Transformer
+
+Preparing training data from 5 test cases...
+   Negative sampling: +/-15 candles around signals
+   Training repetitions: 100x per sample
+
+   Fetching market state dynamically for test case 1...
+   Fetching HISTORICAL market state for ETH/USDT at 2025-10-27 14:00
+   Timeframes: ['1s', '1m', '1h', '1d'], Extended window: ±25 minutes
+   (Includes ±15 candles for negative sampling)
+       1m: 100 candles from DuckDB (historical)
+       1h: 200 candles from DuckDB (historical)
+       1d: 200 candles from DuckDB (historical)
+    Fetched market state with 3 timeframes
+
+   Test case 1: ENTRY sample - LONG @ 2500.0
+   Test case 1: Added 30 HOLD samples (during position)
+   Test case 1: EXIT sample @ 2562.5 (2.50%)
+   Test case 1: Added 30 NO_TRADE samples (±15 candles)
+      → 15 before signal, 15 after signal
+
+ Prepared 310 training samples from 5 test cases
+   ENTRY samples: 5
+   HOLD samples: 150
+   EXIT samples: 5
+   NO_TRADE samples: 150
+   Ratio: 1:30.0 (entry:no_trade)
+
+ Starting Transformer training...
+    Converting annotation data to transformer format...
+     Converted 310 samples to 12,250 training batches
+```
+
+---
+
+## Files Modified
+
+1. `ANNOTATE/core/real_training_adapter.py`
+   - Extended data fetching window
+   - Dynamic candle limits
+   - Improved logging
+   - Historical data priority
+
+---
+
+## New Documentation
+
+1. `ANNOTATE/CONTINUOUS_DATA_TRAINING_STRATEGY.md`
+   - Detailed explanation of training strategy
+   - Sample composition breakdown
+   - Configuration guidelines
+   - Monitoring tips
+
+2. `ANNOTATE/DATA_LOADING_ARCHITECTURE.md`
+   - Data storage architecture
+   - Dynamic loading strategy
+   - Troubleshooting guide
+
+3. `MODEL_INPUTS_OUTPUTS_REFERENCE.md`
+   - All model inputs/outputs
+   - Shape specifications
+   - Integration examples
+
+---
+
+## Next Steps
+
+1. **Test Training**
+   - Run training with 5+ annotations
+   - Verify NO_TRADE samples are created
+   - Check logs for data fetching
+
+2. **Monitor Ratios**
+   - Ideal: 1:20 to 1:40 (entry:no_trade)
+   - Adjust `negative_samples_window` if needed
+
+3. **Verify Data**
+   - Ensure DuckDB has historical data
+   - Check for "fallback" warnings
+   - Confirm timestamps match annotations
+
+4. **Tune Parameters**
+   - Adjust `extended_window_minutes` if needed
+   - Modify repetitions based on dataset size
+   - Balance training time vs accuracy