Files
gogo2/docs/NORMALIZATION_GUIDE.md
2025-10-31 00:44:08 +02:00

498 lines
13 KiB
Markdown

# BaseDataInput Normalization Guide
## Overview
All OHLCV data in `BaseDataInput` is automatically normalized to the 0-1 range to ensure consistent model training and inference across different price scales and timeframes.
**Key Benefits:**
- ✅ Consistent input scale for neural networks
- ✅ Prevents gradient issues from large price values
- ✅ Enables transfer learning across different symbols
- ✅ Simplifies model architecture (no need for input scaling layers)
- ✅ Easy denormalization for predictions
---
## How It Works
### 1. Normalization Strategy
**Primary Symbol (e.g., ETH/USDT)**:
- Uses **daily (1d) timeframe** to compute min/max bounds
- Daily has the widest price range, ensuring all shorter timeframes fit within 0-1
- All timeframes (1s, 1m, 1h, 1d) normalized using same bounds
**Reference Symbol (BTC/USDT)**:
- Uses **its own 1s data** to compute independent min/max bounds
- BTC and ETH have different price scales (e.g., $2000 vs $40000)
- Independent normalization ensures both are properly scaled to 0-1
### 2. Normalization Formula
```python
# Price normalization
normalized_price = (price - price_min) / (price_max - price_min)
# Volume normalization
normalized_volume = (volume - volume_min) / (volume_max - volume_min)
# Result: 0.0 to 1.0 range
# 0.0 = minimum price/volume in dataset
# 1.0 = maximum price/volume in dataset
```
### 3. Denormalization Formula
```python
# Price denormalization
original_price = normalized_price * (price_max - price_min) + price_min
# Volume denormalization
original_volume = normalized_volume * (volume_max - volume_min) + volume_min
```
---
## NormalizationBounds Class
### Structure
```python
@dataclass
class NormalizationBounds:
"""Normalization boundaries for price and volume data"""
price_min: float # Minimum price in dataset
price_max: float # Maximum price in dataset
volume_min: float # Minimum volume in dataset
volume_max: float # Maximum volume in dataset
symbol: str # Symbol these bounds apply to
timeframe: str # Timeframe used ('all' for multi-timeframe)
```
### Methods
```python
# Normalize price to 0-1
normalized = bounds.normalize_price(2500.0) # Returns: 0.75 (example)
# Denormalize back to original
original = bounds.denormalize_price(0.75) # Returns: 2500.0
# Normalize volume
normalized_vol = bounds.normalize_volume(1000.0)
# Denormalize volume
original_vol = bounds.denormalize_volume(0.5)
# Get ranges
price_range = bounds.get_price_range() # price_max - price_min
volume_range = bounds.get_volume_range() # volume_max - volume_min
```
---
## Usage Examples
### Basic Usage
```python
from core.data_models import BaseDataInput
# Build BaseDataInput
base_data = data_provider.build_base_data_input('ETH/USDT')
# Get normalized features (default)
features = base_data.get_feature_vector(normalize=True)
# All OHLCV values are now 0.0 to 1.0
# Get raw features (no normalization)
features_raw = base_data.get_feature_vector(normalize=False)
# OHLCV values are in original units ($, volume)
```
### Accessing Normalization Bounds
```python
# Get bounds for primary symbol
bounds = base_data.get_normalization_bounds()
print(f"Symbol: {bounds.symbol}")
print(f"Price range: ${bounds.price_min:.2f} - ${bounds.price_max:.2f}")
print(f"Volume range: {bounds.volume_min:.2f} - {bounds.volume_max:.2f}")
# Example output:
# Symbol: ETH/USDT
# Price range: $2000.00 - $2500.00
# Volume range: 100.00 - 10000.00
# Get bounds for BTC (independent)
btc_bounds = base_data.get_btc_normalization_bounds()
print(f"BTC range: ${btc_bounds.price_min:.2f} - ${btc_bounds.price_max:.2f}")
# Example output:
# BTC range: $38000.00 - $42000.00
```
### Denormalizing Model Predictions
```python
# Model predicts normalized price
model_output = model.predict(features) # Returns: 0.75 (normalized)
# Denormalize to actual price
bounds = base_data.get_normalization_bounds()
predicted_price = bounds.denormalize_price(model_output)
print(f"Model output (normalized): {model_output:.4f}")
print(f"Predicted price: ${predicted_price:.2f}")
# Example output:
# Model output (normalized): 0.7500
# Predicted price: $2375.00
```
### Training with Normalized Data
```python
# Training loop
for epoch in range(num_epochs):
base_data = data_provider.build_base_data_input('ETH/USDT')
# Get normalized features
features = base_data.get_feature_vector(normalize=True)
# Get normalized target (next close price)
bounds = base_data.get_normalization_bounds()
target_price = base_data.ohlcv_1m[-1].close
target_normalized = bounds.normalize_price(target_price)
# Train model
loss = model.train_step(features, target_normalized)
# Denormalize prediction for logging
prediction_normalized = model.predict(features)
prediction_price = bounds.denormalize_price(prediction_normalized)
print(f"Epoch {epoch}: Loss={loss:.4f}, Predicted=${prediction_price:.2f}")
```
### Inference with Denormalization
```python
def predict_next_price(symbol: str) -> float:
"""Predict next price and return in original units"""
# Get current data
base_data = data_provider.build_base_data_input(symbol)
# Get normalized features
features = base_data.get_feature_vector(normalize=True)
# Model prediction (normalized)
prediction_normalized = model.predict(features)
# Denormalize to actual price
bounds = base_data.get_normalization_bounds()
prediction_price = bounds.denormalize_price(prediction_normalized)
return prediction_price
# Usage
next_price = predict_next_price('ETH/USDT')
print(f"Predicted next price: ${next_price:.2f}")
```
---
## Why Daily Timeframe for Bounds?
### Problem: Different Timeframes, Different Ranges
```
1s timeframe: $2100 - $2110 (range: $10)
1m timeframe: $2095 - $2115 (range: $20)
1h timeframe: $2050 - $2150 (range: $100)
1d timeframe: $2000 - $2500 (range: $500) ← Widest range
```
### Solution: Use Daily Min/Max
By using daily (longest timeframe) min/max:
- All shorter timeframes fit within 0-1 range
- No clipping or out-of-range values
- Consistent normalization across all timeframes
```python
# Daily bounds: $2000 - $2500
# 1s candle: close = $2100
normalized = (2100 - 2000) / (2500 - 2000) = 0.20
# 1m candle: close = $2250
normalized = (2250 - 2000) / (2500 - 2000) = 0.50
# 1h candle: close = $2400
normalized = (2400 - 2000) / (2500 - 2000) = 0.80
# 1d candle: close = $2500
normalized = (2500 - 2000) / (2500 - 2000) = 1.00
```
---
## Independent BTC Normalization
### Why Independent?
ETH and BTC have vastly different price scales:
```
ETH: $2000 - $2500 (range: $500)
BTC: $38000 - $42000 (range: $4000)
```
If we used the same bounds:
- ETH would be compressed to 0.00 - 0.06 range (bad!)
- BTC would use 0.90 - 1.00 range (bad!)
### Solution: Independent Bounds
```python
# ETH bounds
eth_bounds = base_data.get_normalization_bounds()
# price_min: $2000, price_max: $2500
# BTC bounds (independent)
btc_bounds = base_data.get_btc_normalization_bounds()
# price_min: $38000, price_max: $42000
# Both normalized to full 0-1 range
eth_normalized = eth_bounds.normalize_price(2250) # 0.50
btc_normalized = btc_bounds.normalize_price(40000) # 0.50
```
---
## Caching for Performance
Normalization bounds are computed once and cached:
```python
# First call: computes bounds
bounds = base_data.get_normalization_bounds() # ~1-2 ms
# Subsequent calls: returns cached bounds
bounds = base_data.get_normalization_bounds() # ~0.001 ms (1000x faster!)
```
**Implementation:**
```python
@dataclass
class BaseDataInput:
# Cached bounds (computed on first access)
_normalization_bounds: Optional[NormalizationBounds] = None
_btc_normalization_bounds: Optional[NormalizationBounds] = None
def get_normalization_bounds(self) -> NormalizationBounds:
"""Get bounds (cached)"""
if self._normalization_bounds is None:
self._normalization_bounds = self._compute_normalization_bounds()
return self._normalization_bounds
```
---
## Edge Cases
### 1. No Price Movement (price_min == price_max)
```python
# All prices are $2000
price_min = 2000.0
price_max = 2000.0
# Normalization returns 0.5 (middle)
normalized = bounds.normalize_price(2000.0) # Returns: 0.5
```
### 2. Zero Volume
```python
# All volumes are 0
volume_min = 0.0
volume_max = 0.0
# Normalization returns 0.5
normalized = bounds.normalize_volume(0.0) # Returns: 0.5
```
### 3. Insufficient Data
```python
# Less than 100 candles
if len(base_data.ohlcv_1s) < 100:
# BaseDataInput.validate() returns False
# Don't use for training/inference
```
---
## Best Practices
### ✅ DO
1. **Always use normalized features for training**
```python
features = base_data.get_feature_vector(normalize=True)
```
2. **Store bounds with model checkpoints**
```python
checkpoint = {
'model_state': model.state_dict(),
'normalization_bounds': {
'price_min': bounds.price_min,
'price_max': bounds.price_max,
'volume_min': bounds.volume_min,
'volume_max': bounds.volume_max
}
}
```
3. **Denormalize predictions for display/trading**
```python
prediction_price = bounds.denormalize_price(model_output)
```
4. **Use same bounds for training and inference**
```python
# Training
bounds = base_data.get_normalization_bounds()
save_bounds(bounds)
# Inference (later)
bounds = load_bounds()
prediction = bounds.denormalize_price(model_output)
```
### ❌ DON'T
1. **Don't mix normalized and raw features**
```python
# BAD: Inconsistent
features_norm = base_data.get_feature_vector(normalize=True)
features_raw = base_data.get_feature_vector(normalize=False)
combined = np.concatenate([features_norm, features_raw]) # DON'T DO THIS
```
2. **Don't use different bounds for training vs inference**
```python
# BAD: Different bounds
# Training
bounds_train = base_data_train.get_normalization_bounds()
# Inference (different data, different bounds!)
bounds_infer = base_data_infer.get_normalization_bounds() # WRONG!
```
3. **Don't forget to denormalize predictions**
```python
# BAD: Normalized prediction used directly
prediction = model.predict(features) # 0.75
place_order(price=prediction) # WRONG! Should be $2375, not $0.75
```
---
## Testing Normalization
### Unit Tests
```python
def test_normalization():
"""Test normalization and denormalization"""
bounds = NormalizationBounds(
price_min=2000.0,
price_max=2500.0,
volume_min=100.0,
volume_max=1000.0,
symbol='ETH/USDT'
)
# Test price normalization
assert bounds.normalize_price(2000.0) == 0.0
assert bounds.normalize_price(2500.0) == 1.0
assert bounds.normalize_price(2250.0) == 0.5
# Test price denormalization
assert bounds.denormalize_price(0.0) == 2000.0
assert bounds.denormalize_price(1.0) == 2500.0
assert bounds.denormalize_price(0.5) == 2250.0
# Test round-trip
original = 2375.0
normalized = bounds.normalize_price(original)
denormalized = bounds.denormalize_price(normalized)
assert abs(denormalized - original) < 0.01
def test_feature_vector_normalization():
"""Test feature vector normalization"""
base_data = create_test_base_data_input()
# Get normalized features
features_norm = base_data.get_feature_vector(normalize=True)
# Check all OHLCV values are in 0-1 range
ohlcv_features = features_norm[:7500] # First 7500 are OHLCV
assert np.all(ohlcv_features >= 0.0)
assert np.all(ohlcv_features <= 1.0)
# Get raw features
features_raw = base_data.get_feature_vector(normalize=False)
# Raw features should be > 1.0 (actual prices)
assert np.any(features_raw[:7500] > 1.0)
```
---
## Performance
### Computation Time
| Operation | Time | Notes |
|-----------|------|-------|
| Compute bounds (first time) | ~1-2 ms | Scans all OHLCV data |
| Get cached bounds | ~0.001 ms | Returns cached object |
| Normalize single value | ~0.0001 ms | Simple arithmetic |
| Normalize 7850 features | ~0.5 ms | Vectorized operations |
### Memory Usage
| Item | Size | Notes |
|------|------|-------|
| NormalizationBounds object | ~100 bytes | 4 floats + 2 strings |
| Cached in BaseDataInput | ~200 bytes | 2 bounds objects |
| Negligible overhead | <1 KB | Per BaseDataInput instance |
---
## Summary
✅ **Automatic**: Normalization happens by default
✅ **Consistent**: Same bounds across all timeframes
✅ **Independent**: ETH and BTC normalized separately
✅ **Cached**: Bounds computed once, reused
✅ **Reversible**: Easy denormalization for predictions
✅ **Fast**: <1ms overhead
**Result**: Clean 0-1 range inputs for neural networks, with easy conversion back to real prices for trading.
---
## References
- **Implementation**: `core/data_models.py` - `NormalizationBounds` and `BaseDataInput`
- **Specification**: `docs/BASE_DATA_INPUT_SPECIFICATION.md`
- **Usage Guide**: `docs/BASE_DATA_INPUT_USAGE_AUDIT.md`