6.0 KiB
6.0 KiB
Final Data Structure Implementation Summary
What Was Implemented
✅ 5 Batches of 600 Candles Each
Primary Symbol (e.g., ETH/USDT):
- 1s timeframe: 600 candles (10 minutes of data)
- 1m timeframe: 600 candles (10 hours of data)
- 1h timeframe: 600 candles (25 days of data)
- 1d timeframe: 600 candles (~1.6 years of data)
Secondary Symbol (BTC/USDT or ETH/USDT):
- 1m timeframe: 600 candles (10 hours of data)
Total: 3,000 candles per annotation
Symbol Pairing Logic
def _get_secondary_symbol(primary_symbol):
"""
ETH/USDT → BTC/USDT
SOL/USDT → BTC/USDT
BTC/USDT → ETH/USDT
"""
if 'BTC' in primary_symbol:
return 'ETH/USDT'
else:
return 'BTC/USDT'
Data Structure
market_state = {
'symbol': 'ETH/USDT',
'timestamp': '2025-10-27 14:00:00',
# Primary symbol: 4 timeframes × 600 candles
'timeframes': {
'1s': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
'1m': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
'1h': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
'1d': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]}
},
'secondary_symbol': 'BTC/USDT',
# Secondary symbol: 1 timeframe × 600 candles
'secondary_timeframes': {
'1m': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]}
}
}
Key Features
1. Fixed Candle Count ✅
- Always fetches 600 candles per batch
- Configurable via
candles_per_timeframeparameter - Consistent data structure for all models
2. Historical Data Fetching ✅
- Fetches data at annotation timestamp (not current)
- Uses DuckDB for historical queries
- Fallback to replay and latest data
3. Multi-Symbol Support ✅
- Primary symbol: All timeframes
- Secondary symbol: 1m only (for correlation)
- Automatic symbol pairing
4. Time Window Calculation ✅
time_windows = {
'1s': 600 seconds = 10 minutes,
'1m': 600 minutes = 10 hours,
'1h': 600 hours = 25 days,
'1d': 600 days = 1.6 years
}
Example Training Log
Fetching HISTORICAL market state for ETH/USDT at 2025-10-27 14:00:00
Primary symbol: ETH/USDT - Timeframes: ['1s', '1m', '1h', '1d']
Secondary symbol: BTC/USDT - Timeframe: 1m
Candles per batch: 600
Fetching primary symbol data: ETH/USDT
ETH/USDT 1s: 600 candles
ETH/USDT 1m: 600 candles
ETH/USDT 1h: 600 candles
ETH/USDT 1d: 600 candles
Fetching secondary symbol data: BTC/USDT (1m)
BTC/USDT 1m: 600 candles
✓ Fetched 4 primary timeframes (2400 total candles)
✓ Fetched 1 secondary timeframes (600 total candles)
Test case 1: ENTRY sample - LONG @ 2500.0
Test case 1: Added 30 HOLD samples (during position)
Test case 1: Added 30 NO_TRADE samples (±15 candles)
→ 15 before signal, 15 after signal
Memory & Storage
Per Annotation
- Values: 18,000 (3,000 candles × 6 OHLCV fields)
- Memory: ~144 KB (float64)
- Disk: Minimal (metadata only, data fetched from DuckDB)
100 Annotations
- Memory: ~14.4 MB
- Training batches: ~12,250 (with repetitions)
Integration Points
1. Annotation Manager
# Saves lightweight metadata only
test_case = {
'symbol': 'ETH/USDT',
'timestamp': '2025-10-27 14:00',
'training_config': {
'timeframes': ['1s', '1m', '1h', '1d'],
'candles_per_timeframe': 600
}
}
2. Real Training Adapter
# Fetches full OHLCV data dynamically
market_state = _fetch_market_state_for_test_case(test_case)
# Returns 3,000 candles (5 batches × 600)
3. Model Training
# Converts to model input format
batch = _convert_annotation_to_transformer_batch(training_sample)
# Uses all 3,000 candles for context
Configuration
Default Settings
candles_per_timeframe = 600
timeframes = ['1s', '1m', '1h', '1d']
Adjustable
# Reduce for faster training
candles_per_timeframe = 300
# Increase for more context
candles_per_timeframe = 1000
# Limit timeframes
timeframes = ['1m', '1h']
Validation
Data Quality Checks
- ✅ Minimum 500 candles per batch (83% threshold)
- ✅ Continuous timestamps (no large gaps)
- ✅ Valid OHLCV values (no NaN/Inf)
- ✅ Secondary symbol data available
Warning Conditions
if len(candles) < 500:
logger.warning("Insufficient data")
if len(candles) < 300:
logger.error("Critical: skipping batch")
Files Modified
- ANNOTATE/core/real_training_adapter.py
- Added
_get_secondary_symbol()method - Updated
_fetch_market_state_for_test_case()to fetch 5 batches - Fixed candle count to 600 per batch
- Added secondary symbol fetching
- Added
Documentation Created
-
ANNOTATE/DATA_STRUCTURE_SPECIFICATION.md
- Complete data structure specification
- Symbol pairing rules
- Time window calculations
- Integration guide
-
ANNOTATE/CONTINUOUS_DATA_TRAINING_STRATEGY.md
- Training strategy explanation
- Negative sampling details
- Sample distribution
-
ANNOTATE/DATA_LOADING_ARCHITECTURE.md
- Storage architecture
- Dynamic loading strategy
- Troubleshooting guide
Summary
✅ 5 batches of 600 candles each
✅ Primary symbol: 4 timeframes (1s, 1m, 1h, 1d)
✅ Secondary symbol: 1 timeframe (1m) - BTC or ETH
✅ 3,000 total candles per annotation
✅ Historical data from DuckDB at annotation timestamp
✅ Automatic symbol pairing (ETH→BTC, BTC→ETH)
✅ Fallback strategy for missing data
✅ 144 KB memory per annotation
✅ Continuous training with negative sampling
The system now properly fetches and structures data according to the BaseDataInput specification!