popov/gogo2

Fork 0

Files

Dobromir Popov 6ac324289c fetching data from the DB to train

2025-10-31 03:14:35 +02:00

6.0 KiB

Raw Blame History

Final Data Structure Implementation Summary

What Was Implemented

✅ 5 Batches of 600 Candles Each

Primary Symbol (e.g., ETH/USDT):

1s timeframe: 600 candles (10 minutes of data)
1m timeframe: 600 candles (10 hours of data)
1h timeframe: 600 candles (25 days of data)
1d timeframe: 600 candles (~1.6 years of data)

Secondary Symbol (BTC/USDT or ETH/USDT):

1m timeframe: 600 candles (10 hours of data)

Total: 3,000 candles per annotation

Symbol Pairing Logic

def _get_secondary_symbol(primary_symbol):
    """
    ETH/USDT → BTC/USDT
    SOL/USDT → BTC/USDT
    BTC/USDT → ETH/USDT
    """
    if 'BTC' in primary_symbol:
        return 'ETH/USDT'
    else:
        return 'BTC/USDT'

Data Structure

market_state = {
    'symbol': 'ETH/USDT',
    'timestamp': '2025-10-27 14:00:00',
    
    # Primary symbol: 4 timeframes × 600 candles
    'timeframes': {
        '1s': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
        '1m': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
        '1h': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
        '1d': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]}
    },
    
    'secondary_symbol': 'BTC/USDT',
    
    # Secondary symbol: 1 timeframe × 600 candles
    'secondary_timeframes': {
        '1m': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]}
    }
}

Key Features

1. Fixed Candle Count ✅

Always fetches 600 candles per batch
Configurable via candles_per_timeframe parameter
Consistent data structure for all models

2. Historical Data Fetching ✅

Fetches data at annotation timestamp (not current)
Uses DuckDB for historical queries
Fallback to replay and latest data

3. Multi-Symbol Support ✅

Primary symbol: All timeframes
Secondary symbol: 1m only (for correlation)
Automatic symbol pairing

4. Time Window Calculation ✅

time_windows = {
    '1s': 600 seconds = 10 minutes,
    '1m': 600 minutes = 10 hours,
    '1h': 600 hours = 25 days,
    '1d': 600 days = 1.6 years
}

Example Training Log

Fetching HISTORICAL market state for ETH/USDT at 2025-10-27 14:00:00
   Primary symbol: ETH/USDT - Timeframes: ['1s', '1m', '1h', '1d']
   Secondary symbol: BTC/USDT - Timeframe: 1m
   Candles per batch: 600

   Fetching primary symbol data: ETH/USDT
       ETH/USDT 1s: 600 candles
       ETH/USDT 1m: 600 candles
       ETH/USDT 1h: 600 candles
       ETH/USDT 1d: 600 candles

   Fetching secondary symbol data: BTC/USDT (1m)
       BTC/USDT 1m: 600 candles

    ✓ Fetched 4 primary timeframes (2400 total candles)
    ✓ Fetched 1 secondary timeframes (600 total candles)

   Test case 1: ENTRY sample - LONG @ 2500.0
   Test case 1: Added 30 HOLD samples (during position)
   Test case 1: Added 30 NO_TRADE samples (±15 candles)
      → 15 before signal, 15 after signal

Memory & Storage

Per Annotation

Values: 18,000 (3,000 candles × 6 OHLCV fields)
Memory: ~144 KB (float64)
Disk: Minimal (metadata only, data fetched from DuckDB)

100 Annotations

Memory: ~14.4 MB
Training batches: ~12,250 (with repetitions)

Integration Points

1. Annotation Manager

# Saves lightweight metadata only
test_case = {
    'symbol': 'ETH/USDT',
    'timestamp': '2025-10-27 14:00',
    'training_config': {
        'timeframes': ['1s', '1m', '1h', '1d'],
        'candles_per_timeframe': 600
    }
}

2. Real Training Adapter

# Fetches full OHLCV data dynamically
market_state = _fetch_market_state_for_test_case(test_case)
# Returns 3,000 candles (5 batches × 600)

3. Model Training

# Converts to model input format
batch = _convert_annotation_to_transformer_batch(training_sample)
# Uses all 3,000 candles for context

Configuration

Default Settings

candles_per_timeframe = 600
timeframes = ['1s', '1m', '1h', '1d']

Adjustable

# Reduce for faster training
candles_per_timeframe = 300

# Increase for more context
candles_per_timeframe = 1000

# Limit timeframes
timeframes = ['1m', '1h']

Validation

Data Quality Checks

✅ Minimum 500 candles per batch (83% threshold)
✅ Continuous timestamps (no large gaps)
✅ Valid OHLCV values (no NaN/Inf)
✅ Secondary symbol data available

Warning Conditions

if len(candles) < 500:
    logger.warning("Insufficient data")

if len(candles) < 300:
    logger.error("Critical: skipping batch")

Files Modified

ANNOTATE/core/real_training_adapter.py
- Added _get_secondary_symbol() method
- Updated _fetch_market_state_for_test_case() to fetch 5 batches
- Fixed candle count to 600 per batch
- Added secondary symbol fetching

Documentation Created

ANNOTATE/DATA_STRUCTURE_SPECIFICATION.md
- Complete data structure specification
- Symbol pairing rules
- Time window calculations
- Integration guide
ANNOTATE/CONTINUOUS_DATA_TRAINING_STRATEGY.md
- Training strategy explanation
- Negative sampling details
- Sample distribution
ANNOTATE/DATA_LOADING_ARCHITECTURE.md
- Storage architecture
- Dynamic loading strategy
- Troubleshooting guide

Summary

✅ 5 batches of 600 candles each
✅ Primary symbol: 4 timeframes (1s, 1m, 1h, 1d)
✅ Secondary symbol: 1 timeframe (1m) - BTC or ETH
✅ 3,000 total candles per annotation
✅ Historical data from DuckDB at annotation timestamp
✅ Automatic symbol pairing (ETH→BTC, BTC→ETH)
✅ Fallback strategy for missing data
✅ 144 KB memory per annotation
✅ Continuous training with negative sampling

The system now properly fetches and structures data according to the BaseDataInput specification!

6.0 KiB Raw Blame History Unescape Escape