13 KiB
BaseDataInput Normalization Guide
Overview
All OHLCV data in BaseDataInput is automatically normalized to the 0-1 range to ensure consistent model training and inference across different price scales and timeframes.
Key Benefits:
- ✅ Consistent input scale for neural networks
- ✅ Prevents gradient issues from large price values
- ✅ Enables transfer learning across different symbols
- ✅ Simplifies model architecture (no need for input scaling layers)
- ✅ Easy denormalization for predictions
How It Works
1. Normalization Strategy
Primary Symbol (e.g., ETH/USDT):
- Uses daily (1d) timeframe to compute min/max bounds
- Daily has the widest price range, ensuring all shorter timeframes fit within 0-1
- All timeframes (1s, 1m, 1h, 1d) normalized using same bounds
Reference Symbol (BTC/USDT):
- Uses its own 1s data to compute independent min/max bounds
- BTC and ETH have different price scales (e.g., $2000 vs $40000)
- Independent normalization ensures both are properly scaled to 0-1
2. Normalization Formula
# Price normalization
normalized_price = (price - price_min) / (price_max - price_min)
# Volume normalization
normalized_volume = (volume - volume_min) / (volume_max - volume_min)
# Result: 0.0 to 1.0 range
# 0.0 = minimum price/volume in dataset
# 1.0 = maximum price/volume in dataset
3. Denormalization Formula
# Price denormalization
original_price = normalized_price * (price_max - price_min) + price_min
# Volume denormalization
original_volume = normalized_volume * (volume_max - volume_min) + volume_min
NormalizationBounds Class
Structure
@dataclass
class NormalizationBounds:
"""Normalization boundaries for price and volume data"""
price_min: float # Minimum price in dataset
price_max: float # Maximum price in dataset
volume_min: float # Minimum volume in dataset
volume_max: float # Maximum volume in dataset
symbol: str # Symbol these bounds apply to
timeframe: str # Timeframe used ('all' for multi-timeframe)
Methods
# Normalize price to 0-1
normalized = bounds.normalize_price(2500.0) # Returns: 0.75 (example)
# Denormalize back to original
original = bounds.denormalize_price(0.75) # Returns: 2500.0
# Normalize volume
normalized_vol = bounds.normalize_volume(1000.0)
# Denormalize volume
original_vol = bounds.denormalize_volume(0.5)
# Get ranges
price_range = bounds.get_price_range() # price_max - price_min
volume_range = bounds.get_volume_range() # volume_max - volume_min
Usage Examples
Basic Usage
from core.data_models import BaseDataInput
# Build BaseDataInput
base_data = data_provider.build_base_data_input('ETH/USDT')
# Get normalized features (default)
features = base_data.get_feature_vector(normalize=True)
# All OHLCV values are now 0.0 to 1.0
# Get raw features (no normalization)
features_raw = base_data.get_feature_vector(normalize=False)
# OHLCV values are in original units ($, volume)
Accessing Normalization Bounds
# Get bounds for primary symbol
bounds = base_data.get_normalization_bounds()
print(f"Symbol: {bounds.symbol}")
print(f"Price range: ${bounds.price_min:.2f} - ${bounds.price_max:.2f}")
print(f"Volume range: {bounds.volume_min:.2f} - {bounds.volume_max:.2f}")
# Example output:
# Symbol: ETH/USDT
# Price range: $2000.00 - $2500.00
# Volume range: 100.00 - 10000.00
# Get bounds for BTC (independent)
btc_bounds = base_data.get_btc_normalization_bounds()
print(f"BTC range: ${btc_bounds.price_min:.2f} - ${btc_bounds.price_max:.2f}")
# Example output:
# BTC range: $38000.00 - $42000.00
Denormalizing Model Predictions
# Model predicts normalized price
model_output = model.predict(features) # Returns: 0.75 (normalized)
# Denormalize to actual price
bounds = base_data.get_normalization_bounds()
predicted_price = bounds.denormalize_price(model_output)
print(f"Model output (normalized): {model_output:.4f}")
print(f"Predicted price: ${predicted_price:.2f}")
# Example output:
# Model output (normalized): 0.7500
# Predicted price: $2375.00
Training with Normalized Data
# Training loop
for epoch in range(num_epochs):
base_data = data_provider.build_base_data_input('ETH/USDT')
# Get normalized features
features = base_data.get_feature_vector(normalize=True)
# Get normalized target (next close price)
bounds = base_data.get_normalization_bounds()
target_price = base_data.ohlcv_1m[-1].close
target_normalized = bounds.normalize_price(target_price)
# Train model
loss = model.train_step(features, target_normalized)
# Denormalize prediction for logging
prediction_normalized = model.predict(features)
prediction_price = bounds.denormalize_price(prediction_normalized)
print(f"Epoch {epoch}: Loss={loss:.4f}, Predicted=${prediction_price:.2f}")
Inference with Denormalization
def predict_next_price(symbol: str) -> float:
"""Predict next price and return in original units"""
# Get current data
base_data = data_provider.build_base_data_input(symbol)
# Get normalized features
features = base_data.get_feature_vector(normalize=True)
# Model prediction (normalized)
prediction_normalized = model.predict(features)
# Denormalize to actual price
bounds = base_data.get_normalization_bounds()
prediction_price = bounds.denormalize_price(prediction_normalized)
return prediction_price
# Usage
next_price = predict_next_price('ETH/USDT')
print(f"Predicted next price: ${next_price:.2f}")
Why Daily Timeframe for Bounds?
Problem: Different Timeframes, Different Ranges
1s timeframe: $2100 - $2110 (range: $10)
1m timeframe: $2095 - $2115 (range: $20)
1h timeframe: $2050 - $2150 (range: $100)
1d timeframe: $2000 - $2500 (range: $500) ← Widest range
Solution: Use Daily Min/Max
By using daily (longest timeframe) min/max:
- All shorter timeframes fit within 0-1 range
- No clipping or out-of-range values
- Consistent normalization across all timeframes
# Daily bounds: $2000 - $2500
# 1s candle: close = $2100
normalized = (2100 - 2000) / (2500 - 2000) = 0.20 ✓
# 1m candle: close = $2250
normalized = (2250 - 2000) / (2500 - 2000) = 0.50 ✓
# 1h candle: close = $2400
normalized = (2400 - 2000) / (2500 - 2000) = 0.80 ✓
# 1d candle: close = $2500
normalized = (2500 - 2000) / (2500 - 2000) = 1.00 ✓
Independent BTC Normalization
Why Independent?
ETH and BTC have vastly different price scales:
ETH: $2000 - $2500 (range: $500)
BTC: $38000 - $42000 (range: $4000)
If we used the same bounds:
- ETH would be compressed to 0.00 - 0.06 range (bad!)
- BTC would use 0.90 - 1.00 range (bad!)
Solution: Independent Bounds
# ETH bounds
eth_bounds = base_data.get_normalization_bounds()
# price_min: $2000, price_max: $2500
# BTC bounds (independent)
btc_bounds = base_data.get_btc_normalization_bounds()
# price_min: $38000, price_max: $42000
# Both normalized to full 0-1 range
eth_normalized = eth_bounds.normalize_price(2250) # 0.50
btc_normalized = btc_bounds.normalize_price(40000) # 0.50
Caching for Performance
Normalization bounds are computed once and cached:
# First call: computes bounds
bounds = base_data.get_normalization_bounds() # ~1-2 ms
# Subsequent calls: returns cached bounds
bounds = base_data.get_normalization_bounds() # ~0.001 ms (1000x faster!)
Implementation:
@dataclass
class BaseDataInput:
# Cached bounds (computed on first access)
_normalization_bounds: Optional[NormalizationBounds] = None
_btc_normalization_bounds: Optional[NormalizationBounds] = None
def get_normalization_bounds(self) -> NormalizationBounds:
"""Get bounds (cached)"""
if self._normalization_bounds is None:
self._normalization_bounds = self._compute_normalization_bounds()
return self._normalization_bounds
Edge Cases
1. No Price Movement (price_min == price_max)
# All prices are $2000
price_min = 2000.0
price_max = 2000.0
# Normalization returns 0.5 (middle)
normalized = bounds.normalize_price(2000.0) # Returns: 0.5
2. Zero Volume
# All volumes are 0
volume_min = 0.0
volume_max = 0.0
# Normalization returns 0.5
normalized = bounds.normalize_volume(0.0) # Returns: 0.5
3. Insufficient Data
# Less than 100 candles
if len(base_data.ohlcv_1s) < 100:
# BaseDataInput.validate() returns False
# Don't use for training/inference
Best Practices
✅ DO
-
Always use normalized features for training
features = base_data.get_feature_vector(normalize=True) -
Store bounds with model checkpoints
checkpoint = { 'model_state': model.state_dict(), 'normalization_bounds': { 'price_min': bounds.price_min, 'price_max': bounds.price_max, 'volume_min': bounds.volume_min, 'volume_max': bounds.volume_max } } -
Denormalize predictions for display/trading
prediction_price = bounds.denormalize_price(model_output) -
Use same bounds for training and inference
# Training bounds = base_data.get_normalization_bounds() save_bounds(bounds) # Inference (later) bounds = load_bounds() prediction = bounds.denormalize_price(model_output)
❌ DON'T
-
Don't mix normalized and raw features
# BAD: Inconsistent features_norm = base_data.get_feature_vector(normalize=True) features_raw = base_data.get_feature_vector(normalize=False) combined = np.concatenate([features_norm, features_raw]) # DON'T DO THIS -
Don't use different bounds for training vs inference
# BAD: Different bounds # Training bounds_train = base_data_train.get_normalization_bounds() # Inference (different data, different bounds!) bounds_infer = base_data_infer.get_normalization_bounds() # WRONG! -
Don't forget to denormalize predictions
# BAD: Normalized prediction used directly prediction = model.predict(features) # 0.75 place_order(price=prediction) # WRONG! Should be $2375, not $0.75
Testing Normalization
Unit Tests
def test_normalization():
"""Test normalization and denormalization"""
bounds = NormalizationBounds(
price_min=2000.0,
price_max=2500.0,
volume_min=100.0,
volume_max=1000.0,
symbol='ETH/USDT'
)
# Test price normalization
assert bounds.normalize_price(2000.0) == 0.0
assert bounds.normalize_price(2500.0) == 1.0
assert bounds.normalize_price(2250.0) == 0.5
# Test price denormalization
assert bounds.denormalize_price(0.0) == 2000.0
assert bounds.denormalize_price(1.0) == 2500.0
assert bounds.denormalize_price(0.5) == 2250.0
# Test round-trip
original = 2375.0
normalized = bounds.normalize_price(original)
denormalized = bounds.denormalize_price(normalized)
assert abs(denormalized - original) < 0.01
def test_feature_vector_normalization():
"""Test feature vector normalization"""
base_data = create_test_base_data_input()
# Get normalized features
features_norm = base_data.get_feature_vector(normalize=True)
# Check all OHLCV values are in 0-1 range
ohlcv_features = features_norm[:7500] # First 7500 are OHLCV
assert np.all(ohlcv_features >= 0.0)
assert np.all(ohlcv_features <= 1.0)
# Get raw features
features_raw = base_data.get_feature_vector(normalize=False)
# Raw features should be > 1.0 (actual prices)
assert np.any(features_raw[:7500] > 1.0)
Performance
Computation Time
| Operation | Time | Notes |
|---|---|---|
| Compute bounds (first time) | ~1-2 ms | Scans all OHLCV data |
| Get cached bounds | ~0.001 ms | Returns cached object |
| Normalize single value | ~0.0001 ms | Simple arithmetic |
| Normalize 7850 features | ~0.5 ms | Vectorized operations |
Memory Usage
| Item | Size | Notes |
|---|---|---|
| NormalizationBounds object | ~100 bytes | 4 floats + 2 strings |
| Cached in BaseDataInput | ~200 bytes | 2 bounds objects |
| Negligible overhead | <1 KB | Per BaseDataInput instance |
Summary
✅ Automatic: Normalization happens by default
✅ Consistent: Same bounds across all timeframes
✅ Independent: ETH and BTC normalized separately
✅ Cached: Bounds computed once, reused
✅ Reversible: Easy denormalization for predictions
✅ Fast: <1ms overhead
Result: Clean 0-1 range inputs for neural networks, with easy conversion back to real prices for trading.
References
- Implementation:
core/data_models.py-NormalizationBoundsandBaseDataInput - Specification:
docs/BASE_DATA_INPUT_SPECIFICATION.md - Usage Guide:
docs/BASE_DATA_INPUT_USAGE_AUDIT.md