Files
gogo2/ANNOTATE/FINAL_DATA_STRUCTURE_SUMMARY.md
2025-10-31 03:14:35 +02:00

6.0 KiB
Raw Blame History

Final Data Structure Implementation Summary

What Was Implemented

5 Batches of 600 Candles Each

Primary Symbol (e.g., ETH/USDT):

  • 1s timeframe: 600 candles (10 minutes of data)
  • 1m timeframe: 600 candles (10 hours of data)
  • 1h timeframe: 600 candles (25 days of data)
  • 1d timeframe: 600 candles (~1.6 years of data)

Secondary Symbol (BTC/USDT or ETH/USDT):

  • 1m timeframe: 600 candles (10 hours of data)

Total: 3,000 candles per annotation


Symbol Pairing Logic

def _get_secondary_symbol(primary_symbol):
    """
    ETH/USDT → BTC/USDT
    SOL/USDT → BTC/USDT
    BTC/USDT → ETH/USDT
    """
    if 'BTC' in primary_symbol:
        return 'ETH/USDT'
    else:
        return 'BTC/USDT'

Data Structure

market_state = {
    'symbol': 'ETH/USDT',
    'timestamp': '2025-10-27 14:00:00',
    
    # Primary symbol: 4 timeframes × 600 candles
    'timeframes': {
        '1s': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
        '1m': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
        '1h': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]},
        '1d': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]}
    },
    
    'secondary_symbol': 'BTC/USDT',
    
    # Secondary symbol: 1 timeframe × 600 candles
    'secondary_timeframes': {
        '1m': {'timestamps': [...], 'open': [...], 'high': [...], 'low': [...], 'close': [...], 'volume': [...]}
    }
}

Key Features

1. Fixed Candle Count

  • Always fetches 600 candles per batch
  • Configurable via candles_per_timeframe parameter
  • Consistent data structure for all models

2. Historical Data Fetching

  • Fetches data at annotation timestamp (not current)
  • Uses DuckDB for historical queries
  • Fallback to replay and latest data

3. Multi-Symbol Support

  • Primary symbol: All timeframes
  • Secondary symbol: 1m only (for correlation)
  • Automatic symbol pairing

4. Time Window Calculation

time_windows = {
    '1s': 600 seconds = 10 minutes,
    '1m': 600 minutes = 10 hours,
    '1h': 600 hours = 25 days,
    '1d': 600 days = 1.6 years
}

Example Training Log

Fetching HISTORICAL market state for ETH/USDT at 2025-10-27 14:00:00
   Primary symbol: ETH/USDT - Timeframes: ['1s', '1m', '1h', '1d']
   Secondary symbol: BTC/USDT - Timeframe: 1m
   Candles per batch: 600

   Fetching primary symbol data: ETH/USDT
       ETH/USDT 1s: 600 candles
       ETH/USDT 1m: 600 candles
       ETH/USDT 1h: 600 candles
       ETH/USDT 1d: 600 candles

   Fetching secondary symbol data: BTC/USDT (1m)
       BTC/USDT 1m: 600 candles

    ✓ Fetched 4 primary timeframes (2400 total candles)
    ✓ Fetched 1 secondary timeframes (600 total candles)

   Test case 1: ENTRY sample - LONG @ 2500.0
   Test case 1: Added 30 HOLD samples (during position)
   Test case 1: Added 30 NO_TRADE samples (±15 candles)
      → 15 before signal, 15 after signal

Memory & Storage

Per Annotation

  • Values: 18,000 (3,000 candles × 6 OHLCV fields)
  • Memory: ~144 KB (float64)
  • Disk: Minimal (metadata only, data fetched from DuckDB)

100 Annotations

  • Memory: ~14.4 MB
  • Training batches: ~12,250 (with repetitions)

Integration Points

1. Annotation Manager

# Saves lightweight metadata only
test_case = {
    'symbol': 'ETH/USDT',
    'timestamp': '2025-10-27 14:00',
    'training_config': {
        'timeframes': ['1s', '1m', '1h', '1d'],
        'candles_per_timeframe': 600
    }
}

2. Real Training Adapter

# Fetches full OHLCV data dynamically
market_state = _fetch_market_state_for_test_case(test_case)
# Returns 3,000 candles (5 batches × 600)

3. Model Training

# Converts to model input format
batch = _convert_annotation_to_transformer_batch(training_sample)
# Uses all 3,000 candles for context

Configuration

Default Settings

candles_per_timeframe = 600
timeframes = ['1s', '1m', '1h', '1d']

Adjustable

# Reduce for faster training
candles_per_timeframe = 300

# Increase for more context
candles_per_timeframe = 1000

# Limit timeframes
timeframes = ['1m', '1h']

Validation

Data Quality Checks

  • Minimum 500 candles per batch (83% threshold)
  • Continuous timestamps (no large gaps)
  • Valid OHLCV values (no NaN/Inf)
  • Secondary symbol data available

Warning Conditions

if len(candles) < 500:
    logger.warning("Insufficient data")

if len(candles) < 300:
    logger.error("Critical: skipping batch")

Files Modified

  1. ANNOTATE/core/real_training_adapter.py
    • Added _get_secondary_symbol() method
    • Updated _fetch_market_state_for_test_case() to fetch 5 batches
    • Fixed candle count to 600 per batch
    • Added secondary symbol fetching

Documentation Created

  1. ANNOTATE/DATA_STRUCTURE_SPECIFICATION.md

    • Complete data structure specification
    • Symbol pairing rules
    • Time window calculations
    • Integration guide
  2. ANNOTATE/CONTINUOUS_DATA_TRAINING_STRATEGY.md

    • Training strategy explanation
    • Negative sampling details
    • Sample distribution
  3. ANNOTATE/DATA_LOADING_ARCHITECTURE.md

    • Storage architecture
    • Dynamic loading strategy
    • Troubleshooting guide

Summary

5 batches of 600 candles each
Primary symbol: 4 timeframes (1s, 1m, 1h, 1d)
Secondary symbol: 1 timeframe (1m) - BTC or ETH
3,000 total candles per annotation
Historical data from DuckDB at annotation timestamp
Automatic symbol pairing (ETH→BTC, BTC→ETH)
Fallback strategy for missing data
144 KB memory per annotation
Continuous training with negative sampling

The system now properly fetches and structures data according to the BaseDataInput specification!