248 lines
6.0 KiB
Markdown
248 lines
6.0 KiB
Markdown
# Final Data Structure Implementation Summary
|
||
|
||
## What Was Implemented
|
||
|
||
### ✅ 5 Batches of 600 Candles Each
|
||
|
||
**Primary Symbol** (e.g., ETH/USDT):
|
||
- 1s timeframe: 600 candles (10 minutes of data)
|
||
- 1m timeframe: 600 candles (10 hours of data)
|
||
- 1h timeframe: 600 candles (25 days of data)
|
||
- 1d timeframe: 600 candles (~1.6 years of data)
|
||
|
||
**Secondary Symbol** (BTC/USDT or ETH/USDT):
|
||
- 1m timeframe: 600 candles (10 hours of data)
|
||
|
||
**Total**: 3,000 candles per annotation
|
||
|
||
---
|
||
|
||
## Symbol Pairing Logic
|
||
|
||
```python
|
||
def _get_secondary_symbol(primary_symbol):
|
||
"""
|
||
ETH/USDT → BTC/USDT
|
||
SOL/USDT → BTC/USDT
|
||
BTC/USDT → ETH/USDT
|
||
"""
|
||
if 'BTC' in primary_symbol:
|
||
return 'ETH/USDT'
|
||
else:
|
||
return 'BTC/USDT'
|
||
```
|
||
|
||
---
|
||
|
||
## Data Structure
|
||
|
||
```python
|
||
market_state = {
|
||
'symbol': 'ETH/USDT',
|
||
'timestamp': '2025-10-27 14:00:00',
|
||
|
||
# Primary symbol: 4 timeframes × 600 candles
|
||
'timeframes': {
|
||
'1s': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
|
||
'1m': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
|
||
'1h': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
|
||
'1d': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]}
|
||
},
|
||
|
||
'secondary_symbol': 'BTC/USDT',
|
||
|
||
# Secondary symbol: 1 timeframe × 600 candles
|
||
'secondary_timeframes': {
|
||
'1m': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]}
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Key Features
|
||
|
||
### 1. Fixed Candle Count ✅
|
||
- Always fetches 600 candles per batch
|
||
- Configurable via `candles_per_timeframe` parameter
|
||
- Consistent data structure for all models
|
||
|
||
### 2. Historical Data Fetching ✅
|
||
- Fetches data at annotation timestamp (not current)
|
||
- Uses DuckDB for historical queries
|
||
- Fallback to replay and latest data
|
||
|
||
### 3. Multi-Symbol Support ✅
|
||
- Primary symbol: All timeframes
|
||
- Secondary symbol: 1m only (for correlation)
|
||
- Automatic symbol pairing
|
||
|
||
### 4. Time Window Calculation ✅
|
||
```python
|
||
time_windows = {
|
||
'1s': 600 seconds = 10 minutes,
|
||
'1m': 600 minutes = 10 hours,
|
||
'1h': 600 hours = 25 days,
|
||
'1d': 600 days = 1.6 years
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Example Training Log
|
||
|
||
```
|
||
Fetching HISTORICAL market state for ETH/USDT at 2025-10-27 14:00:00
|
||
Primary symbol: ETH/USDT - Timeframes: ['1s', '1m', '1h', '1d']
|
||
Secondary symbol: BTC/USDT - Timeframe: 1m
|
||
Candles per batch: 600
|
||
|
||
Fetching primary symbol data: ETH/USDT
|
||
ETH/USDT 1s: 600 candles
|
||
ETH/USDT 1m: 600 candles
|
||
ETH/USDT 1h: 600 candles
|
||
ETH/USDT 1d: 600 candles
|
||
|
||
Fetching secondary symbol data: BTC/USDT (1m)
|
||
BTC/USDT 1m: 600 candles
|
||
|
||
✓ Fetched 4 primary timeframes (2400 total candles)
|
||
✓ Fetched 1 secondary timeframes (600 total candles)
|
||
|
||
Test case 1: ENTRY sample - LONG @ 2500.0
|
||
Test case 1: Added 30 HOLD samples (during position)
|
||
Test case 1: Added 30 NO_TRADE samples (±15 candles)
|
||
→ 15 before signal, 15 after signal
|
||
```
|
||
|
||
---
|
||
|
||
## Memory & Storage
|
||
|
||
### Per Annotation
|
||
- **Values**: 18,000 (3,000 candles × 6 OHLCV fields)
|
||
- **Memory**: ~144 KB (float64)
|
||
- **Disk**: Minimal (metadata only, data fetched from DuckDB)
|
||
|
||
### 100 Annotations
|
||
- **Memory**: ~14.4 MB
|
||
- **Training batches**: ~12,250 (with repetitions)
|
||
|
||
---
|
||
|
||
## Integration Points
|
||
|
||
### 1. Annotation Manager
|
||
```python
|
||
# Saves lightweight metadata only
|
||
test_case = {
|
||
'symbol': 'ETH/USDT',
|
||
'timestamp': '2025-10-27 14:00',
|
||
'training_config': {
|
||
'timeframes': ['1s', '1m', '1h', '1d'],
|
||
'candles_per_timeframe': 600
|
||
}
|
||
}
|
||
```
|
||
|
||
### 2. Real Training Adapter
|
||
```python
|
||
# Fetches full OHLCV data dynamically
|
||
market_state = _fetch_market_state_for_test_case(test_case)
|
||
# Returns 3,000 candles (5 batches × 600)
|
||
```
|
||
|
||
### 3. Model Training
|
||
```python
|
||
# Converts to model input format
|
||
batch = _convert_annotation_to_transformer_batch(training_sample)
|
||
# Uses all 3,000 candles for context
|
||
```
|
||
|
||
---
|
||
|
||
## Configuration
|
||
|
||
### Default Settings
|
||
```python
|
||
candles_per_timeframe = 600
|
||
timeframes = ['1s', '1m', '1h', '1d']
|
||
```
|
||
|
||
### Adjustable
|
||
```python
|
||
# Reduce for faster training
|
||
candles_per_timeframe = 300
|
||
|
||
# Increase for more context
|
||
candles_per_timeframe = 1000
|
||
|
||
# Limit timeframes
|
||
timeframes = ['1m', '1h']
|
||
```
|
||
|
||
---
|
||
|
||
## Validation
|
||
|
||
### Data Quality Checks
|
||
- ✅ Minimum 500 candles per batch (83% threshold)
|
||
- ✅ Continuous timestamps (no large gaps)
|
||
- ✅ Valid OHLCV values (no NaN/Inf)
|
||
- ✅ Secondary symbol data available
|
||
|
||
### Warning Conditions
|
||
```python
|
||
if len(candles) < 500:
|
||
logger.warning("Insufficient data")
|
||
|
||
if len(candles) < 300:
|
||
logger.error("Critical: skipping batch")
|
||
```
|
||
|
||
---
|
||
|
||
## Files Modified
|
||
|
||
1. **ANNOTATE/core/real_training_adapter.py**
|
||
- Added `_get_secondary_symbol()` method
|
||
- Updated `_fetch_market_state_for_test_case()` to fetch 5 batches
|
||
- Fixed candle count to 600 per batch
|
||
- Added secondary symbol fetching
|
||
|
||
---
|
||
|
||
## Documentation Created
|
||
|
||
1. **ANNOTATE/DATA_STRUCTURE_SPECIFICATION.md**
|
||
- Complete data structure specification
|
||
- Symbol pairing rules
|
||
- Time window calculations
|
||
- Integration guide
|
||
|
||
2. **ANNOTATE/CONTINUOUS_DATA_TRAINING_STRATEGY.md**
|
||
- Training strategy explanation
|
||
- Negative sampling details
|
||
- Sample distribution
|
||
|
||
3. **ANNOTATE/DATA_LOADING_ARCHITECTURE.md**
|
||
- Storage architecture
|
||
- Dynamic loading strategy
|
||
- Troubleshooting guide
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
✅ **5 batches** of 600 candles each
|
||
✅ **Primary symbol**: 4 timeframes (1s, 1m, 1h, 1d)
|
||
✅ **Secondary symbol**: 1 timeframe (1m) - BTC or ETH
|
||
✅ **3,000 total candles** per annotation
|
||
✅ **Historical data** from DuckDB at annotation timestamp
|
||
✅ **Automatic symbol pairing** (ETH→BTC, BTC→ETH)
|
||
✅ **Fallback strategy** for missing data
|
||
✅ **144 KB memory** per annotation
|
||
✅ **Continuous training** with negative sampling
|
||
|
||
The system now properly fetches and structures data according to the BaseDataInput specification!
|