fetching data from the DB to train

2025-10-31 03:14:35 +02:00
parent 07150fd019
commit 6ac324289c
6 changed files with 1113 additions and 46 deletions
--- a/ANNOTATE/FINAL_DATA_STRUCTURE_SUMMARY.md
+++ b/ANNOTATE/FINAL_DATA_STRUCTURE_SUMMARY.md
@@ -0,0 +1,247 @@
+# Final Data Structure Implementation Summary
+
+## What Was Implemented
+
+### ✅ 5 Batches of 600 Candles Each
+
+**Primary Symbol** (e.g., ETH/USDT):
+- 1s timeframe: 600 candles (10 minutes of data)
+- 1m timeframe: 600 candles (10 hours of data)
+- 1h timeframe: 600 candles (25 days of data)
+- 1d timeframe: 600 candles (~1.6 years of data)
+
+**Secondary Symbol** (BTC/USDT or ETH/USDT):
+- 1m timeframe: 600 candles (10 hours of data)
+
+**Total**: 3,000 candles per annotation
+
+---
+
+## Symbol Pairing Logic
+
+```python
+def _get_secondary_symbol(primary_symbol):
+    """
+    ETH/USDT → BTC/USDT
+    SOL/USDT → BTC/USDT
+    BTC/USDT → ETH/USDT
+    """
+    if 'BTC' in primary_symbol:
+        return 'ETH/USDT'
+    else:
+        return 'BTC/USDT'
+```
+
+---
+
+## Data Structure
+
+```python
+market_state = {
+    'symbol': 'ETH/USDT',
+    'timestamp': '2025-10-27 14:00:00',
+    
+    # Primary symbol: 4 timeframes × 600 candles
+    'timeframes': {
+        '1s': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
+        '1m': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
+        '1h': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
+        '1d': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]}
+    },
+    
+    'secondary_symbol': 'BTC/USDT',
+    
+    # Secondary symbol: 1 timeframe × 600 candles
+    'secondary_timeframes': {
+        '1m': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]}
+    }
+}
+```
+
+---
+
+## Key Features
+
+### 1. Fixed Candle Count ✅
+- Always fetches 600 candles per batch
+- Configurable via `candles_per_timeframe` parameter
+- Consistent data structure for all models
+
+### 2. Historical Data Fetching ✅
+- Fetches data at annotation timestamp (not current)
+- Uses DuckDB for historical queries
+- Fallback to replay and latest data
+
+### 3. Multi-Symbol Support ✅
+- Primary symbol: All timeframes
+- Secondary symbol: 1m only (for correlation)
+- Automatic symbol pairing
+
+### 4. Time Window Calculation ✅
+```python
+time_windows = {
+    '1s': 600 seconds = 10 minutes,
+    '1m': 600 minutes = 10 hours,
+    '1h': 600 hours = 25 days,
+    '1d': 600 days = 1.6 years
+}
+```
+
+---
+
+## Example Training Log
+
+```
+Fetching HISTORICAL market state for ETH/USDT at 2025-10-27 14:00:00
+   Primary symbol: ETH/USDT - Timeframes: ['1s', '1m', '1h', '1d']
+   Secondary symbol: BTC/USDT - Timeframe: 1m
+   Candles per batch: 600
+
+   Fetching primary symbol data: ETH/USDT
+       ETH/USDT 1s: 600 candles
+       ETH/USDT 1m: 600 candles
+       ETH/USDT 1h: 600 candles
+       ETH/USDT 1d: 600 candles
+
+   Fetching secondary symbol data: BTC/USDT (1m)
+       BTC/USDT 1m: 600 candles
+
+    ✓ Fetched 4 primary timeframes (2400 total candles)
+    ✓ Fetched 1 secondary timeframes (600 total candles)
+
+   Test case 1: ENTRY sample - LONG @ 2500.0
+   Test case 1: Added 30 HOLD samples (during position)
+   Test case 1: Added 30 NO_TRADE samples (±15 candles)
+      → 15 before signal, 15 after signal
+```
+
+---
+
+## Memory & Storage
+
+### Per Annotation
+- **Values**: 18,000 (3,000 candles × 6 OHLCV fields)
+- **Memory**: ~144 KB (float64)
+- **Disk**: Minimal (metadata only, data fetched from DuckDB)
+
+### 100 Annotations
+- **Memory**: ~14.4 MB
+- **Training batches**: ~12,250 (with repetitions)
+
+---
+
+## Integration Points
+
+### 1. Annotation Manager
+```python
+# Saves lightweight metadata only
+test_case = {
+    'symbol': 'ETH/USDT',
+    'timestamp': '2025-10-27 14:00',
+    'training_config': {
+        'timeframes': ['1s', '1m', '1h', '1d'],
+        'candles_per_timeframe': 600
+    }
+}
+```
+
+### 2. Real Training Adapter
+```python
+# Fetches full OHLCV data dynamically
+market_state = _fetch_market_state_for_test_case(test_case)
+# Returns 3,000 candles (5 batches × 600)
+```
+
+### 3. Model Training
+```python
+# Converts to model input format
+batch = _convert_annotation_to_transformer_batch(training_sample)
+# Uses all 3,000 candles for context
+```
+
+---
+
+## Configuration
+
+### Default Settings
+```python
+candles_per_timeframe = 600
+timeframes = ['1s', '1m', '1h', '1d']
+```
+
+### Adjustable
+```python
+# Reduce for faster training
+candles_per_timeframe = 300
+
+# Increase for more context
+candles_per_timeframe = 1000
+
+# Limit timeframes
+timeframes = ['1m', '1h']
+```
+
+---
+
+## Validation
+
+### Data Quality Checks
+- ✅ Minimum 500 candles per batch (83% threshold)
+- ✅ Continuous timestamps (no large gaps)
+- ✅ Valid OHLCV values (no NaN/Inf)
+- ✅ Secondary symbol data available
+
+### Warning Conditions
+```python
+if len(candles) < 500:
+    logger.warning("Insufficient data")
+
+if len(candles) < 300:
+    logger.error("Critical: skipping batch")
+```
+
+---
+
+## Files Modified
+
+1. **ANNOTATE/core/real_training_adapter.py**
+   - Added `_get_secondary_symbol()` method
+   - Updated `_fetch_market_state_for_test_case()` to fetch 5 batches
+   - Fixed candle count to 600 per batch
+   - Added secondary symbol fetching
+
+---
+
+## Documentation Created
+
+1. **ANNOTATE/DATA_STRUCTURE_SPECIFICATION.md**
+   - Complete data structure specification
+   - Symbol pairing rules
+   - Time window calculations
+   - Integration guide
+
+2. **ANNOTATE/CONTINUOUS_DATA_TRAINING_STRATEGY.md**
+   - Training strategy explanation
+   - Negative sampling details
+   - Sample distribution
+
+3. **ANNOTATE/DATA_LOADING_ARCHITECTURE.md**
+   - Storage architecture
+   - Dynamic loading strategy
+   - Troubleshooting guide
+
+---
+
+## Summary
+
+✅ **5 batches** of 600 candles each  
+✅ **Primary symbol**: 4 timeframes (1s, 1m, 1h, 1d)  
+✅ **Secondary symbol**: 1 timeframe (1m) - BTC or ETH  
+✅ **3,000 total candles** per annotation  
+✅ **Historical data** from DuckDB at annotation timestamp  
+✅ **Automatic symbol pairing** (ETH→BTC, BTC→ETH)  
+✅ **Fallback strategy** for missing data  
+✅ **144 KB memory** per annotation  
+✅ **Continuous training** with negative sampling  
+
+The system now properly fetches and structures data according to the BaseDataInput specification!