fetching data from the DB to train

This commit is contained in:
Dobromir Popov
2025-10-31 03:14:35 +02:00
parent 07150fd019
commit 6ac324289c
6 changed files with 1113 additions and 46 deletions

View File

@@ -0,0 +1,247 @@
# Final Data Structure Implementation Summary
## What Was Implemented
### ✅ 5 Batches of 600 Candles Each
**Primary Symbol** (e.g., ETH/USDT):
- 1s timeframe: 600 candles (10 minutes of data)
- 1m timeframe: 600 candles (10 hours of data)
- 1h timeframe: 600 candles (25 days of data)
- 1d timeframe: 600 candles (~1.6 years of data)
**Secondary Symbol** (BTC/USDT or ETH/USDT):
- 1m timeframe: 600 candles (10 hours of data)
**Total**: 3,000 candles per annotation
---
## Symbol Pairing Logic
```python
def _get_secondary_symbol(primary_symbol):
"""
ETH/USDT → BTC/USDT
SOL/USDT → BTC/USDT
BTC/USDT → ETH/USDT
"""
if 'BTC' in primary_symbol:
return 'ETH/USDT'
else:
return 'BTC/USDT'
```
---
## Data Structure
```python
market_state = {
'symbol': 'ETH/USDT',
'timestamp': '2025-10-27 14:00:00',
# Primary symbol: 4 timeframes × 600 candles
'timeframes': {
'1s': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
'1m': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
'1h': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
'1d': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]}
},
'secondary_symbol': 'BTC/USDT',
# Secondary symbol: 1 timeframe × 600 candles
'secondary_timeframes': {
'1m': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]}
}
}
```
---
## Key Features
### 1. Fixed Candle Count ✅
- Always fetches 600 candles per batch
- Configurable via `candles_per_timeframe` parameter
- Consistent data structure for all models
### 2. Historical Data Fetching ✅
- Fetches data at annotation timestamp (not current)
- Uses DuckDB for historical queries
- Fallback to replay and latest data
### 3. Multi-Symbol Support ✅
- Primary symbol: All timeframes
- Secondary symbol: 1m only (for correlation)
- Automatic symbol pairing
### 4. Time Window Calculation ✅
```python
time_windows = {
'1s': 600 seconds = 10 minutes,
'1m': 600 minutes = 10 hours,
'1h': 600 hours = 25 days,
'1d': 600 days = 1.6 years
}
```
---
## Example Training Log
```
Fetching HISTORICAL market state for ETH/USDT at 2025-10-27 14:00:00
Primary symbol: ETH/USDT - Timeframes: ['1s', '1m', '1h', '1d']
Secondary symbol: BTC/USDT - Timeframe: 1m
Candles per batch: 600
Fetching primary symbol data: ETH/USDT
ETH/USDT 1s: 600 candles
ETH/USDT 1m: 600 candles
ETH/USDT 1h: 600 candles
ETH/USDT 1d: 600 candles
Fetching secondary symbol data: BTC/USDT (1m)
BTC/USDT 1m: 600 candles
✓ Fetched 4 primary timeframes (2400 total candles)
✓ Fetched 1 secondary timeframes (600 total candles)
Test case 1: ENTRY sample - LONG @ 2500.0
Test case 1: Added 30 HOLD samples (during position)
Test case 1: Added 30 NO_TRADE samples (±15 candles)
→ 15 before signal, 15 after signal
```
---
## Memory & Storage
### Per Annotation
- **Values**: 18,000 (3,000 candles × 6 OHLCV fields)
- **Memory**: ~144 KB (float64)
- **Disk**: Minimal (metadata only, data fetched from DuckDB)
### 100 Annotations
- **Memory**: ~14.4 MB
- **Training batches**: ~12,250 (with repetitions)
---
## Integration Points
### 1. Annotation Manager
```python
# Saves lightweight metadata only
test_case = {
'symbol': 'ETH/USDT',
'timestamp': '2025-10-27 14:00',
'training_config': {
'timeframes': ['1s', '1m', '1h', '1d'],
'candles_per_timeframe': 600
}
}
```
### 2. Real Training Adapter
```python
# Fetches full OHLCV data dynamically
market_state = _fetch_market_state_for_test_case(test_case)
# Returns 3,000 candles (5 batches × 600)
```
### 3. Model Training
```python
# Converts to model input format
batch = _convert_annotation_to_transformer_batch(training_sample)
# Uses all 3,000 candles for context
```
---
## Configuration
### Default Settings
```python
candles_per_timeframe = 600
timeframes = ['1s', '1m', '1h', '1d']
```
### Adjustable
```python
# Reduce for faster training
candles_per_timeframe = 300
# Increase for more context
candles_per_timeframe = 1000
# Limit timeframes
timeframes = ['1m', '1h']
```
---
## Validation
### Data Quality Checks
- ✅ Minimum 500 candles per batch (83% threshold)
- ✅ Continuous timestamps (no large gaps)
- ✅ Valid OHLCV values (no NaN/Inf)
- ✅ Secondary symbol data available
### Warning Conditions
```python
if len(candles) < 500:
logger.warning("Insufficient data")
if len(candles) < 300:
logger.error("Critical: skipping batch")
```
---
## Files Modified
1. **ANNOTATE/core/real_training_adapter.py**
- Added `_get_secondary_symbol()` method
- Updated `_fetch_market_state_for_test_case()` to fetch 5 batches
- Fixed candle count to 600 per batch
- Added secondary symbol fetching
---
## Documentation Created
1. **ANNOTATE/DATA_STRUCTURE_SPECIFICATION.md**
- Complete data structure specification
- Symbol pairing rules
- Time window calculations
- Integration guide
2. **ANNOTATE/CONTINUOUS_DATA_TRAINING_STRATEGY.md**
- Training strategy explanation
- Negative sampling details
- Sample distribution
3. **ANNOTATE/DATA_LOADING_ARCHITECTURE.md**
- Storage architecture
- Dynamic loading strategy
- Troubleshooting guide
---
## Summary
**5 batches** of 600 candles each
**Primary symbol**: 4 timeframes (1s, 1m, 1h, 1d)
**Secondary symbol**: 1 timeframe (1m) - BTC or ETH
**3,000 total candles** per annotation
**Historical data** from DuckDB at annotation timestamp
**Automatic symbol pairing** (ETH→BTC, BTC→ETH)
**Fallback strategy** for missing data
**144 KB memory** per annotation
**Continuous training** with negative sampling
The system now properly fetches and structures data according to the BaseDataInput specification!