Files
gogo2/ANNOTATE/TRAINING_DATA_FORMAT.md
2025-10-18 23:44:02 +03:00

8.0 KiB

ANNOTATE - Training Data Format

🎯 Overview

The ANNOTATE system generates training data that includes ±5 minutes of market data around each trade signal. This allows models to learn:

  • WHERE to generate signals (at entry/exit points)
  • WHERE NOT to generate signals (before entry, after exit)
  • Context around the signal (what led to the trade)

📦 Test Case Structure

Complete Format

{
  "test_case_id": "annotation_uuid",
  "symbol": "ETH/USDT",
  "timestamp": "2024-01-15T10:30:00Z",
  "action": "BUY",
  
  "market_state": {
    "ohlcv_1s": {
      "timestamps": [...],  // ±5 minutes of 1s candles (~600 candles)
      "open": [...],
      "high": [...],
      "low": [...],
      "close": [...],
      "volume": [...]
    },
    "ohlcv_1m": {
      "timestamps": [...],  // ±5 minutes of 1m candles (~10 candles)
      "open": [...],
      "high": [...],
      "low": [...],
      "close": [...],
      "volume": [...]
    },
    "ohlcv_1h": {
      "timestamps": [...],  // ±5 minutes of 1h candles (usually 1 candle)
      "open": [...],
      "high": [...],
      "low": [...],
      "close": [...],
      "volume": [...]
    },
    "ohlcv_1d": {
      "timestamps": [...],  // ±5 minutes of 1d candles (usually 1 candle)
      "open": [...],
      "high": [...],
      "low": [...],
      "close": [...],
      "volume": [...]
    },
    
    "training_labels": {
      "labels_1m": [0, 0, 0, 1, 2, 2, 3, 0, 0, 0],  // Label for each 1m candle
      "direction": "LONG",
      "entry_timestamp": "2024-01-15T10:30:00",
      "exit_timestamp": "2024-01-15T10:35:00"
    }
  },
  
  "expected_outcome": {
    "direction": "LONG",
    "profit_loss_pct": 2.5,
    "entry_price": 2400.50,
    "exit_price": 2460.75,
    "holding_period_seconds": 300
  },
  
  "annotation_metadata": {
    "annotator": "manual",
    "confidence": 1.0,
    "notes": "",
    "created_at": "2024-01-15T11:00:00Z",
    "timeframe": "1m"
  }
}

🏷️ Training Labels

Label System

Each timestamp in the ±5 minute window is labeled:

Label Meaning Description
0 NO SIGNAL Before entry or after exit - model should NOT signal
1 ENTRY SIGNAL At entry time - model SHOULD signal BUY/SELL
2 HOLD Between entry and exit - model should maintain position
3 EXIT SIGNAL At exit time - model SHOULD signal close position

Example Timeline

Time:    10:25  10:26  10:27  10:28  10:29  10:30  10:31  10:32  10:33  10:34  10:35  10:36  10:37
Label:     0      0      0      0      0      1      2      2      2      2      3      0      0
Action:   NO     NO     NO     NO     NO    ENTRY  HOLD   HOLD   HOLD   HOLD   EXIT    NO     NO

Why This Matters

  • Negative Examples: Model learns NOT to signal at random times
  • Context: Model sees what happens before/after the signal
  • Precision: Model learns exact timing, not just "buy somewhere"

📊 Data Window

Time Window: ±5 Minutes

Entry Time: 10:30:00
Window Start: 10:25:00 (5 minutes before)
Window End: 10:35:00 (5 minutes after)

Candle Counts by Timeframe

Timeframe Candles in ±5min Purpose
1s ~600 candles Micro-structure, order flow
1m ~10 candles Short-term patterns
1h ~1 candle Trend context
1d ~1 candle Market regime

🎓 Training Strategy

Positive Examples (Signal Points)

  • Entry Point (Label 1): Model learns to recognize entry conditions
  • Exit Point (Label 3): Model learns to recognize exit conditions

Negative Examples (Non-Signal Points)

  • Before Entry (Label 0): Model learns NOT to signal too early
  • After Exit (Label 0): Model learns NOT to signal too late
  • During Hold (Label 2): Model learns to maintain position

Balanced Training

For each annotation:

  • 1 entry signal (Label 1)
  • 1 exit signal (Label 3)
  • ~3-5 hold periods (Label 2)
  • ~5-8 no-signal periods (Label 0)

This creates a balanced dataset where the model learns:

  • When TO act (20% of time)
  • When NOT to act (80% of time)

🔧 Implementation Details

Data Fetching

# Get ±5 minutes around entry
entry_time = annotation.entry['timestamp']
start_time = entry_time - timedelta(minutes=5)
end_time = entry_time + timedelta(minutes=5)

# Fetch data for window
df = data_provider.get_historical_data(
    symbol=symbol,
    timeframe=timeframe,
    limit=1000
)

# Filter to window
df_window = df[(df.index >= start_time) & (df.index <= end_time)]

Label Generation

for timestamp in timestamps:
    if near_entry(timestamp):
        label = 1  # ENTRY SIGNAL
    elif near_exit(timestamp):
        label = 3  # EXIT SIGNAL
    elif between_entry_and_exit(timestamp):
        label = 2  # HOLD
    else:
        label = 0  # NO SIGNAL

📈 Model Training Usage

CNN Training

# Input: OHLCV data for ±5 minutes
# Output: Probability distribution over labels [0, 1, 2, 3]

for timestamp, label in zip(timestamps, labels):
    features = extract_features(ohlcv_data, timestamp)
    prediction = model(features)
    loss = cross_entropy(prediction, label)
    loss.backward()

DQN Training

# State: Current market state
# Action: BUY/SELL/HOLD
# Reward: Based on label and outcome

for timestamp, label in zip(timestamps, labels):
    state = get_state(ohlcv_data, timestamp)
    action = agent.select_action(state)
    
    if label == 1:  # Should signal entry
        reward = +1 if action == BUY else -1
    elif label == 0:  # Should NOT signal
        reward = +1 if action == HOLD else -1

🎯 Benefits

1. Precision Training

  • Model learns exact timing of signals
  • Not just "buy somewhere in this range"
  • Reduces false positives

2. Negative Examples

  • Model learns when NOT to trade
  • Critical for avoiding bad signals
  • Improves precision/recall balance

3. Context Awareness

  • Model sees what led to the signal
  • Understands market conditions before entry
  • Better pattern recognition

4. Realistic Scenarios

  • Includes normal market noise
  • Not just "perfect" entry points
  • Model learns to filter noise

📊 Example Use Case

Scenario: Breakout Trade

Annotation:

  • Entry: 10:30:00 @ $2400 (breakout)
  • Exit: 10:35:00 @ $2460 (+2.5%)

Training Data Generated:

10:25 - 10:29: NO SIGNAL (consolidation before breakout)
10:30:        ENTRY SIGNAL (breakout confirmed)
10:31 - 10:34: HOLD (price moving up)
10:35:        EXIT SIGNAL (target reached)
10:36 - 10:40: NO SIGNAL (after exit)

Model Learns:

  • Don't signal during consolidation
  • Signal at breakout confirmation
  • Hold during profitable move
  • Exit at target
  • Don't signal after exit

🔍 Verification

Check Test Case Quality

# Load test case
with open('test_case.json') as f:
    tc = json.load(f)

# Verify data completeness
assert 'market_state' in tc
assert 'ohlcv_1m' in tc['market_state']
assert 'training_labels' in tc['market_state']

# Check label distribution
labels = tc['market_state']['training_labels']['labels_1m']
print(f"NO_SIGNAL: {labels.count(0)}")
print(f"ENTRY: {labels.count(1)}")
print(f"HOLD: {labels.count(2)}")
print(f"EXIT: {labels.count(3)}")

🚀 Summary

The ANNOTATE system generates production-ready training data with:

±5 minutes of context around each signal
Training labels for each timestamp
Negative examples (where NOT to signal)
Positive examples (where TO signal)
All 4 timeframes (1s, 1m, 1h, 1d)
Complete market state (OHLCV data)

This enables models to learn:

  • Precise timing of entry/exit signals
  • When NOT to trade (avoiding false positives)
  • Context awareness (what leads to signals)
  • Realistic scenarios (including market noise)

Result: Better trained models with higher precision and fewer false signals! 🎯