804 lines
26 KiB
Markdown
804 lines
26 KiB
Markdown
# BaseDataInput Specification
|
||
|
||
## Overview
|
||
|
||
`BaseDataInput` is the **unified, standardized data structure** used across all models in the trading system for both inference and training. It ensures consistency, extensibility, and proper feature engineering across CNN, RL, LSTM, Transformer, and Orchestrator models.
|
||
|
||
**Location:** `core/data_models.py`
|
||
|
||
---
|
||
|
||
## Design Principles
|
||
|
||
1. **Single Source of Truth**: All models receive identical input structure
|
||
2. **Fixed Feature Size**: `get_feature_vector()` always returns exactly 7,850 features
|
||
3. **Extensibility**: New features can be added without breaking existing models
|
||
4. **No Synthetic Data**: All features must come from real market data or be zero-padded
|
||
5. **Multi-Timeframe**: Supports multiple timeframes for comprehensive market analysis
|
||
6. **Cross-Model Feeding**: Includes predictions from other models for ensemble approaches
|
||
|
||
---
|
||
|
||
## Data Structure
|
||
|
||
### Core Fields
|
||
|
||
```python
|
||
@dataclass
|
||
class BaseDataInput:
|
||
symbol: str # Primary trading symbol (e.g., 'ETH/USDT')
|
||
timestamp: datetime # Current timestamp
|
||
```
|
||
|
||
### Multi-Timeframe OHLCV Data (Primary Symbol - ETH)
|
||
|
||
```python
|
||
ohlcv_1s: List[OHLCVBar] # 300 frames of 1-second bars
|
||
ohlcv_1m: List[OHLCVBar] # 300 frames of 1-minute bars
|
||
ohlcv_1h: List[OHLCVBar] # 300 frames of 1-hour bars
|
||
ohlcv_1d: List[OHLCVBar] # 300 frames of 1-day bars
|
||
```
|
||
|
||
**OHLCVBar Structure:**
|
||
```python
|
||
@dataclass
|
||
class OHLCVBar:
|
||
symbol: str
|
||
timestamp: datetime
|
||
open: float
|
||
high: float
|
||
low: float
|
||
close: float
|
||
volume: float
|
||
timeframe: str
|
||
indicators: Dict[str, float] = field(default_factory=dict)
|
||
|
||
# Enhanced TA properties (computed on-demand)
|
||
@property
|
||
def body_size(self) -> float: ...
|
||
@property
|
||
def upper_wick(self) -> float: ...
|
||
@property
|
||
def lower_wick(self) -> float: ...
|
||
@property
|
||
def total_range(self) -> float: ...
|
||
@property
|
||
def is_bullish(self) -> bool: ...
|
||
@property
|
||
def is_bearish(self) -> bool: ...
|
||
@property
|
||
def is_doji(self) -> bool: ...
|
||
|
||
# Enhanced TA methods
|
||
def get_body_to_range_ratio(self) -> float: ...
|
||
def get_upper_wick_ratio(self) -> float: ...
|
||
def get_lower_wick_ratio(self) -> float: ...
|
||
def get_relative_size(self, reference_bars, method='avg') -> float: ...
|
||
def get_candle_pattern(self) -> str: ...
|
||
def get_ta_features(self, reference_bars=None) -> Dict[str, float]: ...
|
||
```
|
||
|
||
**See**: `docs/CANDLE_TA_FEATURES_REFERENCE.md` for complete TA feature documentation
|
||
|
||
### Reference Symbol Data (BTC)
|
||
|
||
```python
|
||
btc_ohlcv_1s: List[OHLCVBar] # 300 seconds of 1-second BTC bars
|
||
```
|
||
|
||
Used for correlation analysis and market-wide context.
|
||
|
||
### Consolidated Order Book (COB) Data
|
||
|
||
```python
|
||
cob_data: Optional[COBData] # Real-time order book snapshot
|
||
```
|
||
|
||
**COBData Structure:**
|
||
```python
|
||
@dataclass
|
||
class COBData:
|
||
symbol: str
|
||
timestamp: datetime
|
||
current_price: float
|
||
bucket_size: float # $1 for ETH, $10 for BTC
|
||
price_buckets: Dict[float, Dict[str, float]] # ±20 buckets around current price
|
||
bid_ask_imbalance: Dict[float, float] # Imbalance ratio per bucket
|
||
volume_weighted_prices: Dict[float, float] # VWAP within each bucket
|
||
order_flow_metrics: Dict[str, float] # Order flow indicators
|
||
|
||
# Moving averages of COB imbalance for ±5 buckets
|
||
ma_1s_imbalance: Dict[float, float] # 1-second MA
|
||
ma_5s_imbalance: Dict[float, float] # 5-second MA
|
||
ma_15s_imbalance: Dict[float, float] # 15-second MA
|
||
ma_60s_imbalance: Dict[float, float] # 60-second MA
|
||
```
|
||
|
||
**Price Bucket Details:**
|
||
Each bucket contains:
|
||
- `bid_volume`: Total bid volume in USD
|
||
- `ask_volume`: Total ask volume in USD
|
||
- `total_volume`: Combined volume
|
||
- `imbalance`: (bid_volume - ask_volume) / total_volume
|
||
|
||
### COB Heatmap (Time-Series)
|
||
|
||
```python
|
||
cob_heatmap_times: List[datetime] # Timestamps for each snapshot
|
||
cob_heatmap_prices: List[float] # Price levels tracked
|
||
cob_heatmap_values: List[List[float]] # 2D array: time × price buckets
|
||
```
|
||
|
||
Provides temporal evolution of order book liquidity and imbalance.
|
||
|
||
### Technical Indicators
|
||
|
||
```python
|
||
technical_indicators: Dict[str, float] # Calculated indicators
|
||
```
|
||
|
||
Common indicators include:
|
||
- `sma_5`, `sma_20`, `sma_50`, `sma_200`: Simple moving averages
|
||
- `ema_12`, `ema_26`: Exponential moving averages
|
||
- `rsi`: Relative Strength Index
|
||
- `macd`, `macd_signal`, `macd_hist`: MACD components
|
||
- `bb_upper`, `bb_middle`, `bb_lower`: Bollinger Bands
|
||
- `atr`: Average True Range
|
||
- `volatility`: Historical volatility
|
||
- `volume_ratio`: Current volume vs average
|
||
- `price_change_5m`, `price_change_15m`, `price_change_1h`: Price changes
|
||
|
||
### Pivot Points
|
||
|
||
```python
|
||
pivot_points: List[PivotPoint] # Williams Market Structure pivots
|
||
```
|
||
|
||
**PivotPoint Structure:**
|
||
```python
|
||
@dataclass
|
||
class PivotPoint:
|
||
symbol: str
|
||
timestamp: datetime
|
||
price: float
|
||
type: str # 'high' or 'low'
|
||
level: int # Pivot level (1, 2, 3, etc.)
|
||
confidence: float # Confidence score (0.0 to 1.0)
|
||
```
|
||
|
||
### Cross-Model Predictions
|
||
|
||
```python
|
||
last_predictions: Dict[str, ModelOutput] # Previous predictions from all models
|
||
```
|
||
|
||
Enables ensemble approaches and cross-model feeding. Keys are model names (e.g., 'cnn_v1', 'rl_agent', 'transformer').
|
||
|
||
### Market Microstructure
|
||
|
||
```python
|
||
market_microstructure: Dict[str, Any] # Additional market state data
|
||
```
|
||
|
||
May include:
|
||
- Spread metrics
|
||
- Liquidity depth
|
||
- Order arrival rates
|
||
- Trade flow toxicity
|
||
- Market impact estimates
|
||
|
||
### Position Information
|
||
|
||
```python
|
||
position_info: Dict[str, Any] # Current trading position state
|
||
```
|
||
|
||
Contains:
|
||
- `has_position`: Boolean indicating if position is open
|
||
- `position_pnl`: Current profit/loss
|
||
- `position_size`: Size of position
|
||
- `entry_price`: Entry price of position
|
||
- `time_in_position_minutes`: Duration of position
|
||
|
||
---
|
||
|
||
## Feature Vector Conversion
|
||
|
||
The `get_feature_vector()` method converts the rich `BaseDataInput` structure into a **fixed-size numpy array** suitable for neural network input.
|
||
|
||
**Key Features:**
|
||
- **Automatic Normalization**: All OHLCV data normalized to 0-1 range by default
|
||
- **Independent Normalization**: Primary symbol and BTC normalized separately
|
||
- **Daily Range**: Uses daily (longest timeframe) min/max for widest coverage
|
||
- **Cached Bounds**: Normalization boundaries cached for performance and denormalization
|
||
- **Fixed Size**: 7,850 features (standard) or 22,850 features (with candle TA)
|
||
|
||
### Feature Vector Breakdown
|
||
|
||
| Component | Features | Description |
|
||
|-----------|----------|-------------|
|
||
| **OHLCV ETH (4 timeframes)** | 6,000 | 300 frames × 4 timeframes × 5 values (OHLCV) |
|
||
| **OHLCV BTC (1s)** | 1,500 | 300 frames × 5 values (OHLCV) |
|
||
| **COB Features** | 200 | Price buckets + MAs + heatmap aggregates |
|
||
| **Technical Indicators** | 100 | Calculated indicators |
|
||
| **Last Predictions** | 45 | Cross-model predictions (9 models × 5 features) |
|
||
| **Position Info** | 5 | Position state |
|
||
| **TOTAL** | **7,850** | Fixed size |
|
||
|
||
### Normalization
|
||
|
||
#### NormalizationBounds Class
|
||
|
||
```python
|
||
@dataclass
|
||
class NormalizationBounds:
|
||
"""Normalization boundaries for price and volume data"""
|
||
price_min: float
|
||
price_max: float
|
||
volume_min: float
|
||
volume_max: float
|
||
symbol: str
|
||
timeframe: str = 'all'
|
||
|
||
def normalize_price(self, price: float) -> float:
|
||
"""Normalize price to 0-1 range"""
|
||
return (price - self.price_min) / (self.price_max - self.price_min)
|
||
|
||
def denormalize_price(self, normalized: float) -> float:
|
||
"""Denormalize price from 0-1 range back to original"""
|
||
return normalized * (self.price_max - self.price_min) + self.price_min
|
||
|
||
def normalize_volume(self, volume: float) -> float:
|
||
"""Normalize volume to 0-1 range"""
|
||
return (volume - self.volume_min) / (self.volume_max - self.volume_min)
|
||
|
||
def denormalize_volume(self, normalized: float) -> float:
|
||
"""Denormalize volume from 0-1 range back to original"""
|
||
return normalized * (self.volume_max - self.volume_min) + self.volume_min
|
||
```
|
||
|
||
#### How Normalization Works
|
||
|
||
1. **Primary Symbol (ETH)**: Uses daily (1d) timeframe data to compute min/max
|
||
- Ensures all shorter timeframes (1s, 1m, 1h) fit within 0-1 range
|
||
- Daily has widest price range, so all intraday prices normalize properly
|
||
|
||
2. **Reference Symbol (BTC)**: Uses its own 1s data to compute independent min/max
|
||
- BTC and ETH have different price scales
|
||
- Independent normalization ensures both are in 0-1 range
|
||
|
||
3. **Caching**: Bounds computed once and cached for performance
|
||
- Access via `get_normalization_bounds()` and `get_btc_normalization_bounds()`
|
||
- Useful for denormalizing model predictions back to actual prices
|
||
|
||
#### Usage Examples
|
||
|
||
```python
|
||
# Get feature vector with normalization (default)
|
||
features = base_data.get_feature_vector(normalize=True)
|
||
# All OHLCV values are now in 0-1 range
|
||
|
||
# Get raw features without normalization
|
||
features_raw = base_data.get_feature_vector(normalize=False)
|
||
# OHLCV values are in original price/volume units
|
||
|
||
# Access normalization bounds for denormalization
|
||
bounds = base_data.get_normalization_bounds()
|
||
print(f"Price range: {bounds.price_min:.2f} - {bounds.price_max:.2f}")
|
||
|
||
# Denormalize a model prediction
|
||
predicted_normalized = 0.75 # Model output
|
||
predicted_price = bounds.denormalize_price(predicted_normalized)
|
||
print(f"Predicted price: ${predicted_price:.2f}")
|
||
|
||
# BTC bounds (independent)
|
||
btc_bounds = base_data.get_btc_normalization_bounds()
|
||
print(f"BTC range: {btc_bounds.price_min:.2f} - {btc_bounds.price_max:.2f}")
|
||
```
|
||
|
||
### Feature Vector Implementation
|
||
|
||
```python
|
||
def get_feature_vector(self, include_candle_ta: bool = False, normalize: bool = True) -> np.ndarray:
|
||
"""
|
||
Convert BaseDataInput to standardized feature vector for models
|
||
|
||
Args:
|
||
include_candle_ta: If True, include enhanced candle TA features
|
||
normalize: If True, normalize OHLCV to 0-1 range (default: True)
|
||
|
||
Returns:
|
||
np.ndarray: FIXED SIZE standardized feature vector (7850 or 22850 features)
|
||
"""
|
||
FIXED_FEATURE_SIZE = 22850 if include_candle_ta else 7850
|
||
features = []
|
||
|
||
# Get normalization bounds (cached)
|
||
if normalize:
|
||
norm_bounds = self._compute_normalization_bounds()
|
||
btc_norm_bounds = self._compute_btc_normalization_bounds()
|
||
|
||
# 1. OHLCV features for ETH (6000 features, normalized to 0-1)
|
||
for ohlcv_list in [self.ohlcv_1s, self.ohlcv_1m, self.ohlcv_1h, self.ohlcv_1d]:
|
||
ohlcv_frames = ohlcv_list[-300:] if len(ohlcv_list) >= 300 else ohlcv_list
|
||
for bar in ohlcv_frames:
|
||
if normalize:
|
||
features.extend([
|
||
norm_bounds.normalize_price(bar.open),
|
||
norm_bounds.normalize_price(bar.high),
|
||
norm_bounds.normalize_price(bar.low),
|
||
norm_bounds.normalize_price(bar.close),
|
||
norm_bounds.normalize_volume(bar.volume)
|
||
])
|
||
else:
|
||
features.extend([bar.open, bar.high, bar.low, bar.close, bar.volume])
|
||
frames_needed = 300 - len(ohlcv_frames)
|
||
if frames_needed > 0:
|
||
features.extend([0.0] * (frames_needed * 5))
|
||
|
||
# 2. BTC OHLCV features (1500 features, normalized independently)
|
||
btc_frames = self.btc_ohlcv_1s[-300:] if len(self.btc_ohlcv_1s) >= 300 else self.btc_ohlcv_1s
|
||
for bar in btc_frames:
|
||
if normalize:
|
||
features.extend([
|
||
btc_norm_bounds.normalize_price(bar.open),
|
||
btc_norm_bounds.normalize_price(bar.high),
|
||
btc_norm_bounds.normalize_price(bar.low),
|
||
btc_norm_bounds.normalize_price(bar.close),
|
||
btc_norm_bounds.normalize_volume(bar.volume)
|
||
])
|
||
else:
|
||
features.extend([bar.open, bar.high, bar.low, bar.close, bar.volume])
|
||
btc_frames_needed = 300 - len(btc_frames)
|
||
if btc_frames_needed > 0:
|
||
features.extend([0.0] * (btc_frames_needed * 5))
|
||
|
||
# 3. COB features (200 features)
|
||
cob_features = []
|
||
if self.cob_data:
|
||
# Price bucket features (up to 160 features: 40 buckets × 4 metrics)
|
||
price_keys = sorted(self.cob_data.price_buckets.keys())[:40]
|
||
for price in price_keys:
|
||
bucket_data = self.cob_data.price_buckets[price]
|
||
cob_features.extend([
|
||
bucket_data.get('bid_volume', 0.0),
|
||
bucket_data.get('ask_volume', 0.0),
|
||
bucket_data.get('total_volume', 0.0),
|
||
bucket_data.get('imbalance', 0.0)
|
||
])
|
||
|
||
# Moving averages (up to 10 features)
|
||
ma_features = []
|
||
for ma_dict in [self.cob_data.ma_1s_imbalance, self.cob_data.ma_5s_imbalance]:
|
||
for price in sorted(list(ma_dict.keys())[:5]):
|
||
ma_features.append(ma_dict[price])
|
||
if len(ma_features) >= 10:
|
||
break
|
||
if len(ma_features) >= 10:
|
||
break
|
||
cob_features.extend(ma_features)
|
||
|
||
# Heatmap aggregates (remaining space)
|
||
if self.cob_heatmap_values and self.cob_heatmap_prices:
|
||
z = np.array(self.cob_heatmap_values, dtype=float)
|
||
if z.ndim == 2 and z.size > 0:
|
||
window_rows = z[-300:] if z.shape[0] >= 300 else z
|
||
window_rows = np.nan_to_num(window_rows, nan=0.0)
|
||
per_bucket_mean = window_rows.mean(axis=0).tolist()
|
||
space_left = 200 - len(cob_features)
|
||
if space_left > 0:
|
||
cob_features.extend(per_bucket_mean[:space_left])
|
||
|
||
# Pad COB features to exactly 200
|
||
cob_features.extend([0.0] * (200 - len(cob_features)))
|
||
features.extend(cob_features[:200])
|
||
|
||
# 4. Technical indicators (100 features)
|
||
indicator_values = list(self.technical_indicators.values())
|
||
features.extend(indicator_values[:100])
|
||
features.extend([0.0] * max(0, 100 - len(indicator_values)))
|
||
|
||
# 5. Last predictions (45 features)
|
||
prediction_features = []
|
||
for model_output in self.last_predictions.values():
|
||
prediction_features.extend([
|
||
model_output.confidence,
|
||
model_output.predictions.get('buy_probability', 0.0),
|
||
model_output.predictions.get('sell_probability', 0.0),
|
||
model_output.predictions.get('hold_probability', 0.0),
|
||
model_output.predictions.get('expected_reward', 0.0)
|
||
])
|
||
features.extend(prediction_features[:45])
|
||
features.extend([0.0] * max(0, 45 - len(prediction_features)))
|
||
|
||
# 6. Position info (5 features)
|
||
position_features = [
|
||
1.0 if self.position_info.get('has_position', False) else 0.0,
|
||
self.position_info.get('position_pnl', 0.0),
|
||
self.position_info.get('position_size', 0.0),
|
||
self.position_info.get('entry_price', 0.0),
|
||
self.position_info.get('time_in_position_minutes', 0.0)
|
||
]
|
||
features.extend(position_features)
|
||
|
||
# Ensure exactly FIXED_FEATURE_SIZE
|
||
if len(features) > FIXED_FEATURE_SIZE:
|
||
features = features[:FIXED_FEATURE_SIZE]
|
||
elif len(features) < FIXED_FEATURE_SIZE:
|
||
features.extend([0.0] * (FIXED_FEATURE_SIZE - len(features)))
|
||
|
||
assert len(features) == FIXED_FEATURE_SIZE
|
||
return np.array(features, dtype=np.float32)
|
||
```
|
||
|
||
---
|
||
|
||
## Extensibility
|
||
|
||
### Adding New Features
|
||
|
||
The `BaseDataInput` structure is designed for extensibility. To add new features:
|
||
|
||
#### 1. Add New Field to BaseDataInput
|
||
|
||
```python
|
||
@dataclass
|
||
class BaseDataInput:
|
||
# ... existing fields ...
|
||
|
||
# NEW: Add your new feature
|
||
sentiment_data: Dict[str, float] = field(default_factory=dict)
|
||
```
|
||
|
||
#### 2. Update get_feature_vector()
|
||
|
||
**Option A: Add to existing feature slots (if space available)**
|
||
|
||
```python
|
||
def get_feature_vector(self) -> np.ndarray:
|
||
# ... existing code ...
|
||
|
||
# Add sentiment features to technical indicators section
|
||
sentiment_features = [
|
||
self.sentiment_data.get('twitter_sentiment', 0.0),
|
||
self.sentiment_data.get('news_sentiment', 0.0),
|
||
self.sentiment_data.get('fear_greed_index', 0.0)
|
||
]
|
||
indicator_values.extend(sentiment_features)
|
||
# ... rest of code ...
|
||
```
|
||
|
||
**Option B: Increase FIXED_FEATURE_SIZE (requires model retraining)**
|
||
|
||
```python
|
||
def get_feature_vector(self) -> np.ndarray:
|
||
FIXED_FEATURE_SIZE = 7900 # Increased from 7850
|
||
|
||
# ... existing features (7850) ...
|
||
|
||
# NEW: Sentiment features (50 features)
|
||
sentiment_features = []
|
||
for key in sorted(self.sentiment_data.keys())[:50]:
|
||
sentiment_features.append(self.sentiment_data[key])
|
||
features.extend(sentiment_features[:50])
|
||
features.extend([0.0] * max(0, 50 - len(sentiment_features)))
|
||
|
||
# ... ensure FIXED_FEATURE_SIZE ...
|
||
```
|
||
|
||
#### 3. Update Data Provider
|
||
|
||
Ensure your data provider populates the new field:
|
||
|
||
```python
|
||
def build_base_data_input(self, symbol: str) -> BaseDataInput:
|
||
# ... existing code ...
|
||
|
||
# NEW: Add sentiment data
|
||
sentiment_data = self._get_sentiment_data(symbol)
|
||
|
||
return BaseDataInput(
|
||
# ... existing fields ...
|
||
sentiment_data=sentiment_data
|
||
)
|
||
```
|
||
|
||
### Best Practices for Extension
|
||
|
||
1. **Maintain Fixed Size**: If adding features, either:
|
||
- Use existing padding space
|
||
- Increase `FIXED_FEATURE_SIZE` and retrain all models
|
||
|
||
2. **Zero Padding**: Always pad missing data with zeros, never synthetic data
|
||
|
||
3. **Validation**: Update `validate()` method if new fields are required
|
||
|
||
4. **Documentation**: Update this document with new feature descriptions
|
||
|
||
5. **Backward Compatibility**: Consider versioning if making breaking changes
|
||
|
||
---
|
||
|
||
## Current Usage Status
|
||
|
||
### Models Using BaseDataInput
|
||
|
||
✅ **StandardizedCNN** (`NN/models/standardized_cnn.py`)
|
||
- Uses `get_feature_vector()` directly
|
||
- Expected input: 7,834 features (close to 7,850)
|
||
|
||
✅ **Orchestrator** (`core/orchestrator.py`)
|
||
- Builds BaseDataInput via `data_provider.build_base_data_input()`
|
||
- Passes to all models
|
||
|
||
✅ **UnifiedTrainingManager** (`core/unified_training_manager_v2.py`)
|
||
- Converts BaseDataInput to DQN state via `get_feature_vector()`
|
||
|
||
✅ **Dashboard** (`web/clean_dashboard.py`)
|
||
- Creates BaseDataInput for CNN predictions
|
||
- Uses `get_feature_vector()` for feature extraction
|
||
|
||
### Alternative Implementations Found
|
||
|
||
⚠️ **ModelInputData** (`core/unified_model_data_interface.py`)
|
||
- **Status**: Legacy/alternative interface
|
||
- **Usage**: Limited, primarily for model-specific preprocessing
|
||
- **Recommendation**: Migrate to BaseDataInput for consistency
|
||
|
||
⚠️ **MockBaseDataInput** (`COBY/integration/orchestrator_adapter.py`)
|
||
- **Status**: Temporary adapter for COBY integration
|
||
- **Usage**: Provides BaseDataInput interface for COBY data
|
||
- **Recommendation**: Replace with proper BaseDataInput construction
|
||
|
||
### Models NOT Using BaseDataInput
|
||
|
||
❌ **RealtimeRLCOBTrader** (`core/realtime_rl_cob_trader.py`)
|
||
- Uses custom `_extract_features()` method
|
||
- **Recommendation**: Migrate to BaseDataInput
|
||
|
||
❌ **Some legacy models** may use direct feature extraction
|
||
- **Recommendation**: Audit and migrate to BaseDataInput
|
||
|
||
---
|
||
|
||
## Validation
|
||
|
||
The `validate()` method ensures data quality:
|
||
|
||
```python
|
||
def validate(self) -> bool:
|
||
"""
|
||
Validate that the BaseDataInput contains required data
|
||
|
||
Returns:
|
||
bool: True if valid, False otherwise
|
||
"""
|
||
# Check minimum OHLCV data
|
||
if len(self.ohlcv_1s) < 100:
|
||
return False
|
||
if len(self.btc_ohlcv_1s) < 100:
|
||
return False
|
||
|
||
# Check timestamp
|
||
if not self.timestamp:
|
||
return False
|
||
|
||
# Check symbol format
|
||
if not self.symbol or '/' not in self.symbol:
|
||
return False
|
||
|
||
return True
|
||
```
|
||
|
||
---
|
||
|
||
## Related Classes
|
||
|
||
### ModelOutput
|
||
|
||
Output structure for model predictions:
|
||
|
||
```python
|
||
@dataclass
|
||
class ModelOutput:
|
||
model_type: str # 'cnn', 'rl', 'lstm', 'transformer'
|
||
model_name: str # Specific model identifier
|
||
symbol: str
|
||
timestamp: datetime
|
||
confidence: float
|
||
predictions: Dict[str, Any] # Model-specific predictions
|
||
hidden_states: Optional[Dict[str, Any]] # For cross-model feeding
|
||
metadata: Dict[str, Any] # Additional info
|
||
```
|
||
|
||
### COBSnapshot
|
||
|
||
Raw consolidated order book data (transformed into COBData):
|
||
|
||
```python
|
||
@dataclass
|
||
class COBSnapshot:
|
||
symbol: str
|
||
timestamp: datetime
|
||
consolidated_bids: List[ConsolidatedOrderBookLevel]
|
||
consolidated_asks: List[ConsolidatedOrderBookLevel]
|
||
exchanges_active: List[str]
|
||
volume_weighted_mid: float
|
||
total_bid_liquidity: float
|
||
total_ask_liquidity: float
|
||
spread_bps: float
|
||
liquidity_imbalance: float
|
||
price_buckets: Dict[str, Dict[str, float]]
|
||
```
|
||
|
||
### PredictionSnapshot
|
||
|
||
Stores predictions with inputs for future training:
|
||
|
||
```python
|
||
@dataclass
|
||
class PredictionSnapshot:
|
||
prediction_id: str
|
||
symbol: str
|
||
prediction_time: datetime
|
||
target_horizon_minutes: int
|
||
target_time: datetime
|
||
current_price: float
|
||
predicted_min_price: float
|
||
predicted_max_price: float
|
||
confidence: float
|
||
model_inputs: Dict[str, Any] # Includes BaseDataInput features
|
||
market_state: Dict[str, Any]
|
||
technical_indicators: Dict[str, Any]
|
||
pivot_analysis: Dict[str, Any]
|
||
actual_min_price: Optional[float]
|
||
actual_max_price: Optional[float]
|
||
outcome_known: bool
|
||
```
|
||
|
||
---
|
||
|
||
## Migration Guide
|
||
|
||
### For Models Not Using BaseDataInput
|
||
|
||
1. **Identify current input method**
|
||
```python
|
||
# OLD
|
||
features = self._extract_features(symbol, data)
|
||
```
|
||
|
||
2. **Update to use BaseDataInput**
|
||
```python
|
||
# NEW
|
||
base_data = self.data_provider.build_base_data_input(symbol)
|
||
if base_data and base_data.validate():
|
||
features = base_data.get_feature_vector()
|
||
```
|
||
|
||
3. **Update model interface**
|
||
```python
|
||
# OLD
|
||
def predict(self, features: np.ndarray) -> Dict:
|
||
|
||
# NEW
|
||
def predict(self, base_input: BaseDataInput) -> ModelOutput:
|
||
features = base_input.get_feature_vector()
|
||
# ... prediction logic ...
|
||
```
|
||
|
||
4. **Test thoroughly**
|
||
- Verify feature vector size matches expectations
|
||
- Check for NaN or infinite values
|
||
- Validate predictions are reasonable
|
||
|
||
---
|
||
|
||
## Performance Considerations
|
||
|
||
### Memory Usage
|
||
|
||
- **BaseDataInput object**: ~2-5 MB per instance
|
||
- **Feature vector**: 7,850 × 4 bytes = 31.4 KB
|
||
- **Recommendation**: Cache BaseDataInput for 1-2 seconds, regenerate feature vectors as needed
|
||
|
||
### Computation Time
|
||
|
||
- **Building BaseDataInput**: ~5-10 ms
|
||
- **get_feature_vector()**: ~1-2 ms
|
||
- **Total overhead**: Negligible for real-time trading
|
||
|
||
### Optimization Tips
|
||
|
||
1. **Reuse OHLCV data**: Cache OHLCV bars across multiple BaseDataInput instances
|
||
2. **Lazy evaluation**: Only compute features when `get_feature_vector()` is called
|
||
3. **Batch processing**: Process multiple symbols in parallel
|
||
4. **Avoid deep copies**: Use references where possible
|
||
|
||
---
|
||
|
||
## Testing
|
||
|
||
### Unit Tests
|
||
|
||
```python
|
||
def test_base_data_input_feature_vector():
|
||
"""Test that feature vector has correct size"""
|
||
base_data = create_test_base_data_input()
|
||
features = base_data.get_feature_vector()
|
||
|
||
assert len(features) == 7850
|
||
assert features.dtype == np.float32
|
||
assert not np.isnan(features).any()
|
||
assert not np.isinf(features).any()
|
||
|
||
def test_base_data_input_validation():
|
||
"""Test validation logic"""
|
||
base_data = create_test_base_data_input()
|
||
assert base_data.validate() == True
|
||
|
||
# Test with insufficient data
|
||
base_data.ohlcv_1s = []
|
||
assert base_data.validate() == False
|
||
```
|
||
|
||
### Integration Tests
|
||
|
||
```python
|
||
def test_model_with_base_data_input():
|
||
"""Test model prediction with BaseDataInput"""
|
||
orchestrator = create_test_orchestrator()
|
||
base_data = orchestrator.data_provider.build_base_data_input('ETH/USDT')
|
||
|
||
assert base_data is not None
|
||
assert base_data.validate()
|
||
|
||
# Test CNN prediction
|
||
cnn_output = orchestrator.cnn_model.predict_from_base_input(base_data)
|
||
assert isinstance(cnn_output, ModelOutput)
|
||
assert 0.0 <= cnn_output.confidence <= 1.0
|
||
```
|
||
|
||
---
|
||
|
||
## Future Enhancements
|
||
|
||
### Planned Features
|
||
|
||
1. **Multi-Symbol Support**: Extend to support multiple correlated symbols
|
||
2. **Alternative Data**: Add social sentiment, on-chain metrics, macro indicators
|
||
3. **Feature Importance**: Track which features contribute most to predictions
|
||
4. **Compression**: Implement feature compression for faster transmission
|
||
5. **Versioning**: Add version field for backward compatibility
|
||
|
||
### Research Directions
|
||
|
||
1. **Adaptive Feature Selection**: Dynamically select relevant features per market regime
|
||
2. **Hierarchical Features**: Group related features for better model interpretability
|
||
3. **Temporal Attention**: Weight recent data more heavily than historical
|
||
4. **Cross-Asset Features**: Include correlations with other asset classes
|
||
|
||
---
|
||
|
||
## Conclusion
|
||
|
||
`BaseDataInput` is the cornerstone of the multi-modal trading system, providing:
|
||
|
||
- ✅ **Consistency**: All models use the same input format
|
||
- ✅ **Extensibility**: Easy to add new features without breaking existing code
|
||
- ✅ **Performance**: Fixed-size feature vectors enable efficient computation
|
||
- ✅ **Quality**: Validation ensures data integrity
|
||
- ✅ **Flexibility**: Supports multiple timeframes, order book data, and cross-model feeding
|
||
|
||
**All new models MUST use BaseDataInput** to ensure system-wide consistency and maintainability.
|
||
|
||
---
|
||
|
||
## References
|
||
|
||
- **Implementation**: `core/data_models.py`
|
||
- **Data Provider**: `core/standardized_data_provider.py`
|
||
- **Model Example**: `NN/models/standardized_cnn.py`
|
||
- **Training**: `core/unified_training_manager_v2.py`
|
||
- **FIFO Queue System**: `docs/fifo_queue_system.md`
|