fetching data from the DB to train
This commit is contained in:
247
ANNOTATE/FINAL_DATA_STRUCTURE_SUMMARY.md
Normal file
247
ANNOTATE/FINAL_DATA_STRUCTURE_SUMMARY.md
Normal file
@@ -0,0 +1,247 @@
|
||||
# Final Data Structure Implementation Summary
|
||||
|
||||
## What Was Implemented
|
||||
|
||||
### ✅ 5 Batches of 600 Candles Each
|
||||
|
||||
**Primary Symbol** (e.g., ETH/USDT):
|
||||
- 1s timeframe: 600 candles (10 minutes of data)
|
||||
- 1m timeframe: 600 candles (10 hours of data)
|
||||
- 1h timeframe: 600 candles (25 days of data)
|
||||
- 1d timeframe: 600 candles (~1.6 years of data)
|
||||
|
||||
**Secondary Symbol** (BTC/USDT or ETH/USDT):
|
||||
- 1m timeframe: 600 candles (10 hours of data)
|
||||
|
||||
**Total**: 3,000 candles per annotation
|
||||
|
||||
---
|
||||
|
||||
## Symbol Pairing Logic
|
||||
|
||||
```python
|
||||
def _get_secondary_symbol(primary_symbol):
|
||||
"""
|
||||
ETH/USDT → BTC/USDT
|
||||
SOL/USDT → BTC/USDT
|
||||
BTC/USDT → ETH/USDT
|
||||
"""
|
||||
if 'BTC' in primary_symbol:
|
||||
return 'ETH/USDT'
|
||||
else:
|
||||
return 'BTC/USDT'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Data Structure
|
||||
|
||||
```python
|
||||
market_state = {
|
||||
'symbol': 'ETH/USDT',
|
||||
'timestamp': '2025-10-27 14:00:00',
|
||||
|
||||
# Primary symbol: 4 timeframes × 600 candles
|
||||
'timeframes': {
|
||||
'1s': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
|
||||
'1m': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
|
||||
'1h': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
|
||||
'1d': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]}
|
||||
},
|
||||
|
||||
'secondary_symbol': 'BTC/USDT',
|
||||
|
||||
# Secondary symbol: 1 timeframe × 600 candles
|
||||
'secondary_timeframes': {
|
||||
'1m': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Features
|
||||
|
||||
### 1. Fixed Candle Count ✅
|
||||
- Always fetches 600 candles per batch
|
||||
- Configurable via `candles_per_timeframe` parameter
|
||||
- Consistent data structure for all models
|
||||
|
||||
### 2. Historical Data Fetching ✅
|
||||
- Fetches data at annotation timestamp (not current)
|
||||
- Uses DuckDB for historical queries
|
||||
- Fallback to replay and latest data
|
||||
|
||||
### 3. Multi-Symbol Support ✅
|
||||
- Primary symbol: All timeframes
|
||||
- Secondary symbol: 1m only (for correlation)
|
||||
- Automatic symbol pairing
|
||||
|
||||
### 4. Time Window Calculation ✅
|
||||
```python
|
||||
time_windows = {
|
||||
'1s': 600 seconds = 10 minutes,
|
||||
'1m': 600 minutes = 10 hours,
|
||||
'1h': 600 hours = 25 days,
|
||||
'1d': 600 days = 1.6 years
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Example Training Log
|
||||
|
||||
```
|
||||
Fetching HISTORICAL market state for ETH/USDT at 2025-10-27 14:00:00
|
||||
Primary symbol: ETH/USDT - Timeframes: ['1s', '1m', '1h', '1d']
|
||||
Secondary symbol: BTC/USDT - Timeframe: 1m
|
||||
Candles per batch: 600
|
||||
|
||||
Fetching primary symbol data: ETH/USDT
|
||||
ETH/USDT 1s: 600 candles
|
||||
ETH/USDT 1m: 600 candles
|
||||
ETH/USDT 1h: 600 candles
|
||||
ETH/USDT 1d: 600 candles
|
||||
|
||||
Fetching secondary symbol data: BTC/USDT (1m)
|
||||
BTC/USDT 1m: 600 candles
|
||||
|
||||
✓ Fetched 4 primary timeframes (2400 total candles)
|
||||
✓ Fetched 1 secondary timeframes (600 total candles)
|
||||
|
||||
Test case 1: ENTRY sample - LONG @ 2500.0
|
||||
Test case 1: Added 30 HOLD samples (during position)
|
||||
Test case 1: Added 30 NO_TRADE samples (±15 candles)
|
||||
→ 15 before signal, 15 after signal
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Memory & Storage
|
||||
|
||||
### Per Annotation
|
||||
- **Values**: 18,000 (3,000 candles × 6 OHLCV fields)
|
||||
- **Memory**: ~144 KB (float64)
|
||||
- **Disk**: Minimal (metadata only, data fetched from DuckDB)
|
||||
|
||||
### 100 Annotations
|
||||
- **Memory**: ~14.4 MB
|
||||
- **Training batches**: ~12,250 (with repetitions)
|
||||
|
||||
---
|
||||
|
||||
## Integration Points
|
||||
|
||||
### 1. Annotation Manager
|
||||
```python
|
||||
# Saves lightweight metadata only
|
||||
test_case = {
|
||||
'symbol': 'ETH/USDT',
|
||||
'timestamp': '2025-10-27 14:00',
|
||||
'training_config': {
|
||||
'timeframes': ['1s', '1m', '1h', '1d'],
|
||||
'candles_per_timeframe': 600
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Real Training Adapter
|
||||
```python
|
||||
# Fetches full OHLCV data dynamically
|
||||
market_state = _fetch_market_state_for_test_case(test_case)
|
||||
# Returns 3,000 candles (5 batches × 600)
|
||||
```
|
||||
|
||||
### 3. Model Training
|
||||
```python
|
||||
# Converts to model input format
|
||||
batch = _convert_annotation_to_transformer_batch(training_sample)
|
||||
# Uses all 3,000 candles for context
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
### Default Settings
|
||||
```python
|
||||
candles_per_timeframe = 600
|
||||
timeframes = ['1s', '1m', '1h', '1d']
|
||||
```
|
||||
|
||||
### Adjustable
|
||||
```python
|
||||
# Reduce for faster training
|
||||
candles_per_timeframe = 300
|
||||
|
||||
# Increase for more context
|
||||
candles_per_timeframe = 1000
|
||||
|
||||
# Limit timeframes
|
||||
timeframes = ['1m', '1h']
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Validation
|
||||
|
||||
### Data Quality Checks
|
||||
- ✅ Minimum 500 candles per batch (83% threshold)
|
||||
- ✅ Continuous timestamps (no large gaps)
|
||||
- ✅ Valid OHLCV values (no NaN/Inf)
|
||||
- ✅ Secondary symbol data available
|
||||
|
||||
### Warning Conditions
|
||||
```python
|
||||
if len(candles) < 500:
|
||||
logger.warning("Insufficient data")
|
||||
|
||||
if len(candles) < 300:
|
||||
logger.error("Critical: skipping batch")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Files Modified
|
||||
|
||||
1. **ANNOTATE/core/real_training_adapter.py**
|
||||
- Added `_get_secondary_symbol()` method
|
||||
- Updated `_fetch_market_state_for_test_case()` to fetch 5 batches
|
||||
- Fixed candle count to 600 per batch
|
||||
- Added secondary symbol fetching
|
||||
|
||||
---
|
||||
|
||||
## Documentation Created
|
||||
|
||||
1. **ANNOTATE/DATA_STRUCTURE_SPECIFICATION.md**
|
||||
- Complete data structure specification
|
||||
- Symbol pairing rules
|
||||
- Time window calculations
|
||||
- Integration guide
|
||||
|
||||
2. **ANNOTATE/CONTINUOUS_DATA_TRAINING_STRATEGY.md**
|
||||
- Training strategy explanation
|
||||
- Negative sampling details
|
||||
- Sample distribution
|
||||
|
||||
3. **ANNOTATE/DATA_LOADING_ARCHITECTURE.md**
|
||||
- Storage architecture
|
||||
- Dynamic loading strategy
|
||||
- Troubleshooting guide
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
✅ **5 batches** of 600 candles each
|
||||
✅ **Primary symbol**: 4 timeframes (1s, 1m, 1h, 1d)
|
||||
✅ **Secondary symbol**: 1 timeframe (1m) - BTC or ETH
|
||||
✅ **3,000 total candles** per annotation
|
||||
✅ **Historical data** from DuckDB at annotation timestamp
|
||||
✅ **Automatic symbol pairing** (ETH→BTC, BTC→ETH)
|
||||
✅ **Fallback strategy** for missing data
|
||||
✅ **144 KB memory** per annotation
|
||||
✅ **Continuous training** with negative sampling
|
||||
|
||||
The system now properly fetches and structures data according to the BaseDataInput specification!
|
||||
Reference in New Issue
Block a user