Files
gogo2/ANNOTATE/TRAINING_DATA_FORMAT.md
2025-10-25 16:35:08 +03:00

311 lines
7.9 KiB
Markdown

# ANNOTATE - Training Data Format
## 🎯 Overview
The ANNOTATE system generates training data that includes **±5 minutes of market data** around each trade signal. This allows models to learn:
- **WHERE to generate signals** (at entry/exit points)
- **WHERE NOT to generate signals** (before entry, after exit)
- **Context around the signal** (what led to the trade)
---
## 📦 Test Case Structure
### Complete Format
```json
{
"test_case_id": "annotation_uuid",
"symbol": "ETH/USDT",
"timestamp": "2024-01-15T10:30:00Z",
"action": "BUY",
"market_state": {
"ohlcv_1s": {
"timestamps": [...], // ±5 minutes of 1s candles (~600 candles)
"open": [...],
"high": [...],
"low": [...],
"close": [...],
"volume": [...]
},
"ohlcv_1m": {
"timestamps": [...], // ±5 minutes of 1m candles (~10 candles)
"open": [...],
"high": [...],
"low": [...],
"close": [...],
"volume": [...]
},
"ohlcv_1h": {
"timestamps": [...], // ±5 minutes of 1h candles (usually 1 candle)
"open": [...],
"high": [...],
"low": [...],
"close": [...],
"volume": [...]
},
"ohlcv_1d": {
"timestamps": [...], // ±5 minutes of 1d candles (usually 1 candle)
"open": [...],
"high": [...],
"low": [...],
"close": [...],
"volume": [...]
},
"training_labels": {
"labels_1m": [0, 0, 0, 1, 2, 2, 3, 0, 0, 0], // Label for each 1m candle
"direction": "LONG",
"entry_timestamp": "2024-01-15T10:30:00",
"exit_timestamp": "2024-01-15T10:35:00"
}
},
"expected_outcome": {
"direction": "LONG",
"profit_loss_pct": 2.5,
"entry_price": 2400.50,
"exit_price": 2460.75,
"holding_period_seconds": 300
},
"annotation_metadata": {
"annotator": "manual",
"confidence": 1.0,
"notes": "",
"created_at": "2024-01-15T11:00:00Z",
"timeframe": "1m"
}
}
```
---
## 🏷️ Training Labels
### Label System
Each timestamp in the ±5 minute window is labeled:
| Label | Meaning | Description |
|-------|---------|-------------|
| **0** | NO SIGNAL | Before entry or after exit - model should NOT signal |
| **1** | ENTRY SIGNAL | At entry time - model SHOULD signal BUY/SELL |
| **2** | HOLD | Between entry and exit - model should maintain position |
| **3** | EXIT SIGNAL | At exit time - model SHOULD signal close position |
### Example Timeline
```
Time: 10:25 10:26 10:27 10:28 10:29 10:30 10:31 10:32 10:33 10:34 10:35 10:36 10:37
Label: 0 0 0 0 0 1 2 2 2 2 3 0 0
Action: NO NO NO NO NO ENTRY HOLD HOLD HOLD HOLD EXIT NO NO
```
### Why This Matters
- **Negative Examples**: Model learns NOT to signal at random times
- **Context**: Model sees what happens before/after the signal
- **Precision**: Model learns exact timing, not just "buy somewhere"
---
## 📊 Data Window
### Time Window: ±5 Minutes
**Entry Time**: 10:30:00
**Window Start**: 10:25:00 (5 minutes before)
**Window End**: 10:35:00 (5 minutes after)
### Candle Counts by Timeframe
| Timeframe | Candles in ±5min | Purpose |
|-----------|------------------|---------|
| **1s** | ~600 candles | Micro-structure, order flow |
| **1m** | ~10 candles | Short-term patterns |
| **1h** | ~1 candle | Trend context |
| **1d** | ~1 candle | Market regime |
---
## 🎓 Training Strategy
### Positive Examples (Signal Points)
- **Entry Point** (Label 1): Model learns to recognize entry conditions
- **Exit Point** (Label 3): Model learns to recognize exit conditions
### Negative Examples (Non-Signal Points)
- **Before Entry** (Label 0): Model learns NOT to signal too early
- **After Exit** (Label 0): Model learns NOT to signal too late
- **During Hold** (Label 2): Model learns to maintain position
### Balanced Training
For each annotation:
- **1 entry signal** (Label 1)
- **1 exit signal** (Label 3)
- **~3-5 hold periods** (Label 2)
- **~5-8 no-signal periods** (Label 0)
This creates a balanced dataset where the model learns:
- When TO act (20% of time)
- When NOT to act (80% of time)
---
## 🔧 Implementation Details
### Data Fetching
```python
# Get ±5 minutes around entry
entry_time = annotation.entry['timestamp']
start_time = entry_time - timedelta(minutes=5)
end_time = entry_time + timedelta(minutes=5)
# Fetch data for window
df = data_provider.get_historical_data(
symbol=symbol,
timeframe=timeframe,
limit=1000
)
# Filter to window
df_window = df[(df.index >= start_time) & (df.index <= end_time)]
```
### Label Generation
```python
for timestamp in timestamps:
if near_entry(timestamp):
label = 1 # ENTRY SIGNAL
elif near_exit(timestamp):
label = 3 # EXIT SIGNAL
elif between_entry_and_exit(timestamp):
label = 2 # HOLD
else:
label = 0 # NO SIGNAL
```
---
## 📈 Model Training Usage
### CNN Training
```python
# Input: OHLCV data for ±5 minutes
# Output: Probability distribution over labels [0, 1, 2, 3]
for timestamp, label in zip(timestamps, labels):
features = extract_features(ohlcv_data, timestamp)
prediction = model(features)
loss = cross_entropy(prediction, label)
loss.backward()
```
### DQN Training
```python
# State: Current market state
# Action: BUY/SELL/HOLD
# Reward: Based on label and outcome
for timestamp, label in zip(timestamps, labels):
state = get_state(ohlcv_data, timestamp)
action = agent.select_action(state)
if label == 1: # Should signal entry
reward = +1 if action == BUY else -1
elif label == 0: # Should NOT signal
reward = +1 if action == HOLD else -1
```
---
## 🎯 Benefits
### 1. Precision Training
- Model learns **exact timing** of signals
- Not just "buy somewhere in this range"
- Reduces false positives
### 2. Negative Examples
- Model learns when **NOT** to trade
- Critical for avoiding bad signals
- Improves precision/recall balance
### 3. Context Awareness
- Model sees **what led to the signal**
- Understands market conditions before entry
- Better pattern recognition
### 4. Realistic Scenarios
- Includes normal market noise
- Not just "perfect" entry points
- Model learns to filter noise
---
## 📊 Example Use Case
### Scenario: Breakout Trade
**Annotation:**
- Entry: 10:30:00 @ $2400 (breakout)
- Exit: 10:35:00 @ $2460 (+2.5%)
**Training Data Generated:**
```
10:25 - 10:29: NO SIGNAL (consolidation before breakout)
10:30: ENTRY SIGNAL (breakout confirmed)
10:31 - 10:34: HOLD (price moving up)
10:35: EXIT SIGNAL (target reached)
10:36 - 10:40: NO SIGNAL (after exit)
```
**Model Learns:**
- Don't signal during consolidation
- Signal at breakout confirmation
- Hold during profitable move
- Exit at target
- Don't signal after exit
---
## 🔍 Verification
### Check Test Case Quality
```python
# Load test case
with open('test_case.json') as f:
tc = json.load(f)
# Verify data completeness
assert 'market_state' in tc
assert 'ohlcv_1m' in tc['market_state']
assert 'training_labels' in tc['market_state']
# Check label distribution
labels = tc['market_state']['training_labels']['labels_1m']
print(f"NO_SIGNAL: {labels.count(0)}")
print(f"ENTRY: {labels.count(1)}")
print(f"HOLD: {labels.count(2)}")
print(f"EXIT: {labels.count(3)}")
```
---
## Summary
The ANNOTATE system generates **production-ready training data** with:
**±5 minutes of context** around each signal
**Training labels** for each timestamp
**Negative examples** (where NOT to signal)
**Positive examples** (where TO signal)
**All 4 timeframes** (1s, 1m, 1h, 1d)
**Complete market state** (OHLCV data)
This enables models to learn:
- **Precise timing** of entry/exit signals
- **When NOT to trade** (avoiding false positives)
- **Context awareness** (what leads to signals)
- **Realistic scenarios** (including market noise)
**Result**: Better trained models with higher precision and fewer false signals! 🎯