498 lines
13 KiB
Markdown
498 lines
13 KiB
Markdown
# BaseDataInput Normalization Guide
|
|
|
|
## Overview
|
|
|
|
All OHLCV data in `BaseDataInput` is automatically normalized to the 0-1 range to ensure consistent model training and inference across different price scales and timeframes.
|
|
|
|
**Key Benefits:**
|
|
- ✅ Consistent input scale for neural networks
|
|
- ✅ Prevents gradient issues from large price values
|
|
- ✅ Enables transfer learning across different symbols
|
|
- ✅ Simplifies model architecture (no need for input scaling layers)
|
|
- ✅ Easy denormalization for predictions
|
|
|
|
---
|
|
|
|
## How It Works
|
|
|
|
### 1. Normalization Strategy
|
|
|
|
**Primary Symbol (e.g., ETH/USDT)**:
|
|
- Uses **daily (1d) timeframe** to compute min/max bounds
|
|
- Daily has the widest price range, ensuring all shorter timeframes fit within 0-1
|
|
- All timeframes (1s, 1m, 1h, 1d) normalized using same bounds
|
|
|
|
**Reference Symbol (BTC/USDT)**:
|
|
- Uses **its own 1s data** to compute independent min/max bounds
|
|
- BTC and ETH have different price scales (e.g., $2000 vs $40000)
|
|
- Independent normalization ensures both are properly scaled to 0-1
|
|
|
|
### 2. Normalization Formula
|
|
|
|
```python
|
|
# Price normalization
|
|
normalized_price = (price - price_min) / (price_max - price_min)
|
|
|
|
# Volume normalization
|
|
normalized_volume = (volume - volume_min) / (volume_max - volume_min)
|
|
|
|
# Result: 0.0 to 1.0 range
|
|
# 0.0 = minimum price/volume in dataset
|
|
# 1.0 = maximum price/volume in dataset
|
|
```
|
|
|
|
### 3. Denormalization Formula
|
|
|
|
```python
|
|
# Price denormalization
|
|
original_price = normalized_price * (price_max - price_min) + price_min
|
|
|
|
# Volume denormalization
|
|
original_volume = normalized_volume * (volume_max - volume_min) + volume_min
|
|
```
|
|
|
|
---
|
|
|
|
## NormalizationBounds Class
|
|
|
|
### Structure
|
|
|
|
```python
|
|
@dataclass
|
|
class NormalizationBounds:
|
|
"""Normalization boundaries for price and volume data"""
|
|
price_min: float # Minimum price in dataset
|
|
price_max: float # Maximum price in dataset
|
|
volume_min: float # Minimum volume in dataset
|
|
volume_max: float # Maximum volume in dataset
|
|
symbol: str # Symbol these bounds apply to
|
|
timeframe: str # Timeframe used ('all' for multi-timeframe)
|
|
```
|
|
|
|
### Methods
|
|
|
|
```python
|
|
# Normalize price to 0-1
|
|
normalized = bounds.normalize_price(2500.0) # Returns: 0.75 (example)
|
|
|
|
# Denormalize back to original
|
|
original = bounds.denormalize_price(0.75) # Returns: 2500.0
|
|
|
|
# Normalize volume
|
|
normalized_vol = bounds.normalize_volume(1000.0)
|
|
|
|
# Denormalize volume
|
|
original_vol = bounds.denormalize_volume(0.5)
|
|
|
|
# Get ranges
|
|
price_range = bounds.get_price_range() # price_max - price_min
|
|
volume_range = bounds.get_volume_range() # volume_max - volume_min
|
|
```
|
|
|
|
---
|
|
|
|
## Usage Examples
|
|
|
|
### Basic Usage
|
|
|
|
```python
|
|
from core.data_models import BaseDataInput
|
|
|
|
# Build BaseDataInput
|
|
base_data = data_provider.build_base_data_input('ETH/USDT')
|
|
|
|
# Get normalized features (default)
|
|
features = base_data.get_feature_vector(normalize=True)
|
|
# All OHLCV values are now 0.0 to 1.0
|
|
|
|
# Get raw features (no normalization)
|
|
features_raw = base_data.get_feature_vector(normalize=False)
|
|
# OHLCV values are in original units ($, volume)
|
|
```
|
|
|
|
### Accessing Normalization Bounds
|
|
|
|
```python
|
|
# Get bounds for primary symbol
|
|
bounds = base_data.get_normalization_bounds()
|
|
|
|
print(f"Symbol: {bounds.symbol}")
|
|
print(f"Price range: ${bounds.price_min:.2f} - ${bounds.price_max:.2f}")
|
|
print(f"Volume range: {bounds.volume_min:.2f} - {bounds.volume_max:.2f}")
|
|
|
|
# Example output:
|
|
# Symbol: ETH/USDT
|
|
# Price range: $2000.00 - $2500.00
|
|
# Volume range: 100.00 - 10000.00
|
|
|
|
# Get bounds for BTC (independent)
|
|
btc_bounds = base_data.get_btc_normalization_bounds()
|
|
print(f"BTC range: ${btc_bounds.price_min:.2f} - ${btc_bounds.price_max:.2f}")
|
|
|
|
# Example output:
|
|
# BTC range: $38000.00 - $42000.00
|
|
```
|
|
|
|
### Denormalizing Model Predictions
|
|
|
|
```python
|
|
# Model predicts normalized price
|
|
model_output = model.predict(features) # Returns: 0.75 (normalized)
|
|
|
|
# Denormalize to actual price
|
|
bounds = base_data.get_normalization_bounds()
|
|
predicted_price = bounds.denormalize_price(model_output)
|
|
|
|
print(f"Model output (normalized): {model_output:.4f}")
|
|
print(f"Predicted price: ${predicted_price:.2f}")
|
|
|
|
# Example output:
|
|
# Model output (normalized): 0.7500
|
|
# Predicted price: $2375.00
|
|
```
|
|
|
|
### Training with Normalized Data
|
|
|
|
```python
|
|
# Training loop
|
|
for epoch in range(num_epochs):
|
|
base_data = data_provider.build_base_data_input('ETH/USDT')
|
|
|
|
# Get normalized features
|
|
features = base_data.get_feature_vector(normalize=True)
|
|
|
|
# Get normalized target (next close price)
|
|
bounds = base_data.get_normalization_bounds()
|
|
target_price = base_data.ohlcv_1m[-1].close
|
|
target_normalized = bounds.normalize_price(target_price)
|
|
|
|
# Train model
|
|
loss = model.train_step(features, target_normalized)
|
|
|
|
# Denormalize prediction for logging
|
|
prediction_normalized = model.predict(features)
|
|
prediction_price = bounds.denormalize_price(prediction_normalized)
|
|
|
|
print(f"Epoch {epoch}: Loss={loss:.4f}, Predicted=${prediction_price:.2f}")
|
|
```
|
|
|
|
### Inference with Denormalization
|
|
|
|
```python
|
|
def predict_next_price(symbol: str) -> float:
|
|
"""Predict next price and return in original units"""
|
|
|
|
# Get current data
|
|
base_data = data_provider.build_base_data_input(symbol)
|
|
|
|
# Get normalized features
|
|
features = base_data.get_feature_vector(normalize=True)
|
|
|
|
# Model prediction (normalized)
|
|
prediction_normalized = model.predict(features)
|
|
|
|
# Denormalize to actual price
|
|
bounds = base_data.get_normalization_bounds()
|
|
prediction_price = bounds.denormalize_price(prediction_normalized)
|
|
|
|
return prediction_price
|
|
|
|
# Usage
|
|
next_price = predict_next_price('ETH/USDT')
|
|
print(f"Predicted next price: ${next_price:.2f}")
|
|
```
|
|
|
|
---
|
|
|
|
## Why Daily Timeframe for Bounds?
|
|
|
|
### Problem: Different Timeframes, Different Ranges
|
|
|
|
```
|
|
1s timeframe: $2100 - $2110 (range: $10)
|
|
1m timeframe: $2095 - $2115 (range: $20)
|
|
1h timeframe: $2050 - $2150 (range: $100)
|
|
1d timeframe: $2000 - $2500 (range: $500) ← Widest range
|
|
```
|
|
|
|
### Solution: Use Daily Min/Max
|
|
|
|
By using daily (longest timeframe) min/max:
|
|
- All shorter timeframes fit within 0-1 range
|
|
- No clipping or out-of-range values
|
|
- Consistent normalization across all timeframes
|
|
|
|
```python
|
|
# Daily bounds: $2000 - $2500
|
|
|
|
# 1s candle: close = $2100
|
|
normalized = (2100 - 2000) / (2500 - 2000) = 0.20 ✓
|
|
|
|
# 1m candle: close = $2250
|
|
normalized = (2250 - 2000) / (2500 - 2000) = 0.50 ✓
|
|
|
|
# 1h candle: close = $2400
|
|
normalized = (2400 - 2000) / (2500 - 2000) = 0.80 ✓
|
|
|
|
# 1d candle: close = $2500
|
|
normalized = (2500 - 2000) / (2500 - 2000) = 1.00 ✓
|
|
```
|
|
|
|
---
|
|
|
|
## Independent BTC Normalization
|
|
|
|
### Why Independent?
|
|
|
|
ETH and BTC have vastly different price scales:
|
|
|
|
```
|
|
ETH: $2000 - $2500 (range: $500)
|
|
BTC: $38000 - $42000 (range: $4000)
|
|
```
|
|
|
|
If we used the same bounds:
|
|
- ETH would be compressed to 0.00 - 0.06 range (bad!)
|
|
- BTC would use 0.90 - 1.00 range (bad!)
|
|
|
|
### Solution: Independent Bounds
|
|
|
|
```python
|
|
# ETH bounds
|
|
eth_bounds = base_data.get_normalization_bounds()
|
|
# price_min: $2000, price_max: $2500
|
|
|
|
# BTC bounds (independent)
|
|
btc_bounds = base_data.get_btc_normalization_bounds()
|
|
# price_min: $38000, price_max: $42000
|
|
|
|
# Both normalized to full 0-1 range
|
|
eth_normalized = eth_bounds.normalize_price(2250) # 0.50
|
|
btc_normalized = btc_bounds.normalize_price(40000) # 0.50
|
|
```
|
|
|
|
---
|
|
|
|
## Caching for Performance
|
|
|
|
Normalization bounds are computed once and cached:
|
|
|
|
```python
|
|
# First call: computes bounds
|
|
bounds = base_data.get_normalization_bounds() # ~1-2 ms
|
|
|
|
# Subsequent calls: returns cached bounds
|
|
bounds = base_data.get_normalization_bounds() # ~0.001 ms (1000x faster!)
|
|
```
|
|
|
|
**Implementation:**
|
|
```python
|
|
@dataclass
|
|
class BaseDataInput:
|
|
# Cached bounds (computed on first access)
|
|
_normalization_bounds: Optional[NormalizationBounds] = None
|
|
_btc_normalization_bounds: Optional[NormalizationBounds] = None
|
|
|
|
def get_normalization_bounds(self) -> NormalizationBounds:
|
|
"""Get bounds (cached)"""
|
|
if self._normalization_bounds is None:
|
|
self._normalization_bounds = self._compute_normalization_bounds()
|
|
return self._normalization_bounds
|
|
```
|
|
|
|
---
|
|
|
|
## Edge Cases
|
|
|
|
### 1. No Price Movement (price_min == price_max)
|
|
|
|
```python
|
|
# All prices are $2000
|
|
price_min = 2000.0
|
|
price_max = 2000.0
|
|
|
|
# Normalization returns 0.5 (middle)
|
|
normalized = bounds.normalize_price(2000.0) # Returns: 0.5
|
|
```
|
|
|
|
### 2. Zero Volume
|
|
|
|
```python
|
|
# All volumes are 0
|
|
volume_min = 0.0
|
|
volume_max = 0.0
|
|
|
|
# Normalization returns 0.5
|
|
normalized = bounds.normalize_volume(0.0) # Returns: 0.5
|
|
```
|
|
|
|
### 3. Insufficient Data
|
|
|
|
```python
|
|
# Less than 100 candles
|
|
if len(base_data.ohlcv_1s) < 100:
|
|
# BaseDataInput.validate() returns False
|
|
# Don't use for training/inference
|
|
```
|
|
|
|
---
|
|
|
|
## Best Practices
|
|
|
|
### ✅ DO
|
|
|
|
1. **Always use normalized features for training**
|
|
```python
|
|
features = base_data.get_feature_vector(normalize=True)
|
|
```
|
|
|
|
2. **Store bounds with model checkpoints**
|
|
```python
|
|
checkpoint = {
|
|
'model_state': model.state_dict(),
|
|
'normalization_bounds': {
|
|
'price_min': bounds.price_min,
|
|
'price_max': bounds.price_max,
|
|
'volume_min': bounds.volume_min,
|
|
'volume_max': bounds.volume_max
|
|
}
|
|
}
|
|
```
|
|
|
|
3. **Denormalize predictions for display/trading**
|
|
```python
|
|
prediction_price = bounds.denormalize_price(model_output)
|
|
```
|
|
|
|
4. **Use same bounds for training and inference**
|
|
```python
|
|
# Training
|
|
bounds = base_data.get_normalization_bounds()
|
|
save_bounds(bounds)
|
|
|
|
# Inference (later)
|
|
bounds = load_bounds()
|
|
prediction = bounds.denormalize_price(model_output)
|
|
```
|
|
|
|
### ❌ DON'T
|
|
|
|
1. **Don't mix normalized and raw features**
|
|
```python
|
|
# BAD: Inconsistent
|
|
features_norm = base_data.get_feature_vector(normalize=True)
|
|
features_raw = base_data.get_feature_vector(normalize=False)
|
|
combined = np.concatenate([features_norm, features_raw]) # DON'T DO THIS
|
|
```
|
|
|
|
2. **Don't use different bounds for training vs inference**
|
|
```python
|
|
# BAD: Different bounds
|
|
# Training
|
|
bounds_train = base_data_train.get_normalization_bounds()
|
|
|
|
# Inference (different data, different bounds!)
|
|
bounds_infer = base_data_infer.get_normalization_bounds() # WRONG!
|
|
```
|
|
|
|
3. **Don't forget to denormalize predictions**
|
|
```python
|
|
# BAD: Normalized prediction used directly
|
|
prediction = model.predict(features) # 0.75
|
|
place_order(price=prediction) # WRONG! Should be $2375, not $0.75
|
|
```
|
|
|
|
---
|
|
|
|
## Testing Normalization
|
|
|
|
### Unit Tests
|
|
|
|
```python
|
|
def test_normalization():
|
|
"""Test normalization and denormalization"""
|
|
bounds = NormalizationBounds(
|
|
price_min=2000.0,
|
|
price_max=2500.0,
|
|
volume_min=100.0,
|
|
volume_max=1000.0,
|
|
symbol='ETH/USDT'
|
|
)
|
|
|
|
# Test price normalization
|
|
assert bounds.normalize_price(2000.0) == 0.0
|
|
assert bounds.normalize_price(2500.0) == 1.0
|
|
assert bounds.normalize_price(2250.0) == 0.5
|
|
|
|
# Test price denormalization
|
|
assert bounds.denormalize_price(0.0) == 2000.0
|
|
assert bounds.denormalize_price(1.0) == 2500.0
|
|
assert bounds.denormalize_price(0.5) == 2250.0
|
|
|
|
# Test round-trip
|
|
original = 2375.0
|
|
normalized = bounds.normalize_price(original)
|
|
denormalized = bounds.denormalize_price(normalized)
|
|
assert abs(denormalized - original) < 0.01
|
|
|
|
def test_feature_vector_normalization():
|
|
"""Test feature vector normalization"""
|
|
base_data = create_test_base_data_input()
|
|
|
|
# Get normalized features
|
|
features_norm = base_data.get_feature_vector(normalize=True)
|
|
|
|
# Check all OHLCV values are in 0-1 range
|
|
ohlcv_features = features_norm[:7500] # First 7500 are OHLCV
|
|
assert np.all(ohlcv_features >= 0.0)
|
|
assert np.all(ohlcv_features <= 1.0)
|
|
|
|
# Get raw features
|
|
features_raw = base_data.get_feature_vector(normalize=False)
|
|
|
|
# Raw features should be > 1.0 (actual prices)
|
|
assert np.any(features_raw[:7500] > 1.0)
|
|
```
|
|
|
|
---
|
|
|
|
## Performance
|
|
|
|
### Computation Time
|
|
|
|
| Operation | Time | Notes |
|
|
|-----------|------|-------|
|
|
| Compute bounds (first time) | ~1-2 ms | Scans all OHLCV data |
|
|
| Get cached bounds | ~0.001 ms | Returns cached object |
|
|
| Normalize single value | ~0.0001 ms | Simple arithmetic |
|
|
| Normalize 7850 features | ~0.5 ms | Vectorized operations |
|
|
|
|
### Memory Usage
|
|
|
|
| Item | Size | Notes |
|
|
|------|------|-------|
|
|
| NormalizationBounds object | ~100 bytes | 4 floats + 2 strings |
|
|
| Cached in BaseDataInput | ~200 bytes | 2 bounds objects |
|
|
| Negligible overhead | <1 KB | Per BaseDataInput instance |
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
✅ **Automatic**: Normalization happens by default
|
|
✅ **Consistent**: Same bounds across all timeframes
|
|
✅ **Independent**: ETH and BTC normalized separately
|
|
✅ **Cached**: Bounds computed once, reused
|
|
✅ **Reversible**: Easy denormalization for predictions
|
|
✅ **Fast**: <1ms overhead
|
|
|
|
**Result**: Clean 0-1 range inputs for neural networks, with easy conversion back to real prices for trading.
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- **Implementation**: `core/data_models.py` - `NormalizationBounds` and `BaseDataInput`
|
|
- **Specification**: `docs/BASE_DATA_INPUT_SPECIFICATION.md`
|
|
- **Usage Guide**: `docs/BASE_DATA_INPUT_USAGE_AUDIT.md`
|