# BaseDataInput Normalization Guide ## Overview All OHLCV data in `BaseDataInput` is automatically normalized to the 0-1 range to ensure consistent model training and inference across different price scales and timeframes. **Key Benefits:** - ✅ Consistent input scale for neural networks - ✅ Prevents gradient issues from large price values - ✅ Enables transfer learning across different symbols - ✅ Simplifies model architecture (no need for input scaling layers) - ✅ Easy denormalization for predictions --- ## How It Works ### 1. Normalization Strategy **Primary Symbol (e.g., ETH/USDT)**: - Uses **daily (1d) timeframe** to compute min/max bounds - Daily has the widest price range, ensuring all shorter timeframes fit within 0-1 - All timeframes (1s, 1m, 1h, 1d) normalized using same bounds **Reference Symbol (BTC/USDT)**: - Uses **its own 1s data** to compute independent min/max bounds - BTC and ETH have different price scales (e.g., $2000 vs $40000) - Independent normalization ensures both are properly scaled to 0-1 ### 2. Normalization Formula ```python # Price normalization normalized_price = (price - price_min) / (price_max - price_min) # Volume normalization normalized_volume = (volume - volume_min) / (volume_max - volume_min) # Result: 0.0 to 1.0 range # 0.0 = minimum price/volume in dataset # 1.0 = maximum price/volume in dataset ``` ### 3. Denormalization Formula ```python # Price denormalization original_price = normalized_price * (price_max - price_min) + price_min # Volume denormalization original_volume = normalized_volume * (volume_max - volume_min) + volume_min ``` --- ## NormalizationBounds Class ### Structure ```python @dataclass class NormalizationBounds: """Normalization boundaries for price and volume data""" price_min: float # Minimum price in dataset price_max: float # Maximum price in dataset volume_min: float # Minimum volume in dataset volume_max: float # Maximum volume in dataset symbol: str # Symbol these bounds apply to timeframe: str # Timeframe used ('all' for multi-timeframe) ``` ### Methods ```python # Normalize price to 0-1 normalized = bounds.normalize_price(2500.0) # Returns: 0.75 (example) # Denormalize back to original original = bounds.denormalize_price(0.75) # Returns: 2500.0 # Normalize volume normalized_vol = bounds.normalize_volume(1000.0) # Denormalize volume original_vol = bounds.denormalize_volume(0.5) # Get ranges price_range = bounds.get_price_range() # price_max - price_min volume_range = bounds.get_volume_range() # volume_max - volume_min ``` --- ## Usage Examples ### Basic Usage ```python from core.data_models import BaseDataInput # Build BaseDataInput base_data = data_provider.build_base_data_input('ETH/USDT') # Get normalized features (default) features = base_data.get_feature_vector(normalize=True) # All OHLCV values are now 0.0 to 1.0 # Get raw features (no normalization) features_raw = base_data.get_feature_vector(normalize=False) # OHLCV values are in original units ($, volume) ``` ### Accessing Normalization Bounds ```python # Get bounds for primary symbol bounds = base_data.get_normalization_bounds() print(f"Symbol: {bounds.symbol}") print(f"Price range: ${bounds.price_min:.2f} - ${bounds.price_max:.2f}") print(f"Volume range: {bounds.volume_min:.2f} - {bounds.volume_max:.2f}") # Example output: # Symbol: ETH/USDT # Price range: $2000.00 - $2500.00 # Volume range: 100.00 - 10000.00 # Get bounds for BTC (independent) btc_bounds = base_data.get_btc_normalization_bounds() print(f"BTC range: ${btc_bounds.price_min:.2f} - ${btc_bounds.price_max:.2f}") # Example output: # BTC range: $38000.00 - $42000.00 ``` ### Denormalizing Model Predictions ```python # Model predicts normalized price model_output = model.predict(features) # Returns: 0.75 (normalized) # Denormalize to actual price bounds = base_data.get_normalization_bounds() predicted_price = bounds.denormalize_price(model_output) print(f"Model output (normalized): {model_output:.4f}") print(f"Predicted price: ${predicted_price:.2f}") # Example output: # Model output (normalized): 0.7500 # Predicted price: $2375.00 ``` ### Training with Normalized Data ```python # Training loop for epoch in range(num_epochs): base_data = data_provider.build_base_data_input('ETH/USDT') # Get normalized features features = base_data.get_feature_vector(normalize=True) # Get normalized target (next close price) bounds = base_data.get_normalization_bounds() target_price = base_data.ohlcv_1m[-1].close target_normalized = bounds.normalize_price(target_price) # Train model loss = model.train_step(features, target_normalized) # Denormalize prediction for logging prediction_normalized = model.predict(features) prediction_price = bounds.denormalize_price(prediction_normalized) print(f"Epoch {epoch}: Loss={loss:.4f}, Predicted=${prediction_price:.2f}") ``` ### Inference with Denormalization ```python def predict_next_price(symbol: str) -> float: """Predict next price and return in original units""" # Get current data base_data = data_provider.build_base_data_input(symbol) # Get normalized features features = base_data.get_feature_vector(normalize=True) # Model prediction (normalized) prediction_normalized = model.predict(features) # Denormalize to actual price bounds = base_data.get_normalization_bounds() prediction_price = bounds.denormalize_price(prediction_normalized) return prediction_price # Usage next_price = predict_next_price('ETH/USDT') print(f"Predicted next price: ${next_price:.2f}") ``` --- ## Why Daily Timeframe for Bounds? ### Problem: Different Timeframes, Different Ranges ``` 1s timeframe: $2100 - $2110 (range: $10) 1m timeframe: $2095 - $2115 (range: $20) 1h timeframe: $2050 - $2150 (range: $100) 1d timeframe: $2000 - $2500 (range: $500) ← Widest range ``` ### Solution: Use Daily Min/Max By using daily (longest timeframe) min/max: - All shorter timeframes fit within 0-1 range - No clipping or out-of-range values - Consistent normalization across all timeframes ```python # Daily bounds: $2000 - $2500 # 1s candle: close = $2100 normalized = (2100 - 2000) / (2500 - 2000) = 0.20 ✓ # 1m candle: close = $2250 normalized = (2250 - 2000) / (2500 - 2000) = 0.50 ✓ # 1h candle: close = $2400 normalized = (2400 - 2000) / (2500 - 2000) = 0.80 ✓ # 1d candle: close = $2500 normalized = (2500 - 2000) / (2500 - 2000) = 1.00 ✓ ``` --- ## Independent BTC Normalization ### Why Independent? ETH and BTC have vastly different price scales: ``` ETH: $2000 - $2500 (range: $500) BTC: $38000 - $42000 (range: $4000) ``` If we used the same bounds: - ETH would be compressed to 0.00 - 0.06 range (bad!) - BTC would use 0.90 - 1.00 range (bad!) ### Solution: Independent Bounds ```python # ETH bounds eth_bounds = base_data.get_normalization_bounds() # price_min: $2000, price_max: $2500 # BTC bounds (independent) btc_bounds = base_data.get_btc_normalization_bounds() # price_min: $38000, price_max: $42000 # Both normalized to full 0-1 range eth_normalized = eth_bounds.normalize_price(2250) # 0.50 btc_normalized = btc_bounds.normalize_price(40000) # 0.50 ``` --- ## Caching for Performance Normalization bounds are computed once and cached: ```python # First call: computes bounds bounds = base_data.get_normalization_bounds() # ~1-2 ms # Subsequent calls: returns cached bounds bounds = base_data.get_normalization_bounds() # ~0.001 ms (1000x faster!) ``` **Implementation:** ```python @dataclass class BaseDataInput: # Cached bounds (computed on first access) _normalization_bounds: Optional[NormalizationBounds] = None _btc_normalization_bounds: Optional[NormalizationBounds] = None def get_normalization_bounds(self) -> NormalizationBounds: """Get bounds (cached)""" if self._normalization_bounds is None: self._normalization_bounds = self._compute_normalization_bounds() return self._normalization_bounds ``` --- ## Edge Cases ### 1. No Price Movement (price_min == price_max) ```python # All prices are $2000 price_min = 2000.0 price_max = 2000.0 # Normalization returns 0.5 (middle) normalized = bounds.normalize_price(2000.0) # Returns: 0.5 ``` ### 2. Zero Volume ```python # All volumes are 0 volume_min = 0.0 volume_max = 0.0 # Normalization returns 0.5 normalized = bounds.normalize_volume(0.0) # Returns: 0.5 ``` ### 3. Insufficient Data ```python # Less than 100 candles if len(base_data.ohlcv_1s) < 100: # BaseDataInput.validate() returns False # Don't use for training/inference ``` --- ## Best Practices ### ✅ DO 1. **Always use normalized features for training** ```python features = base_data.get_feature_vector(normalize=True) ``` 2. **Store bounds with model checkpoints** ```python checkpoint = { 'model_state': model.state_dict(), 'normalization_bounds': { 'price_min': bounds.price_min, 'price_max': bounds.price_max, 'volume_min': bounds.volume_min, 'volume_max': bounds.volume_max } } ``` 3. **Denormalize predictions for display/trading** ```python prediction_price = bounds.denormalize_price(model_output) ``` 4. **Use same bounds for training and inference** ```python # Training bounds = base_data.get_normalization_bounds() save_bounds(bounds) # Inference (later) bounds = load_bounds() prediction = bounds.denormalize_price(model_output) ``` ### ❌ DON'T 1. **Don't mix normalized and raw features** ```python # BAD: Inconsistent features_norm = base_data.get_feature_vector(normalize=True) features_raw = base_data.get_feature_vector(normalize=False) combined = np.concatenate([features_norm, features_raw]) # DON'T DO THIS ``` 2. **Don't use different bounds for training vs inference** ```python # BAD: Different bounds # Training bounds_train = base_data_train.get_normalization_bounds() # Inference (different data, different bounds!) bounds_infer = base_data_infer.get_normalization_bounds() # WRONG! ``` 3. **Don't forget to denormalize predictions** ```python # BAD: Normalized prediction used directly prediction = model.predict(features) # 0.75 place_order(price=prediction) # WRONG! Should be $2375, not $0.75 ``` --- ## Testing Normalization ### Unit Tests ```python def test_normalization(): """Test normalization and denormalization""" bounds = NormalizationBounds( price_min=2000.0, price_max=2500.0, volume_min=100.0, volume_max=1000.0, symbol='ETH/USDT' ) # Test price normalization assert bounds.normalize_price(2000.0) == 0.0 assert bounds.normalize_price(2500.0) == 1.0 assert bounds.normalize_price(2250.0) == 0.5 # Test price denormalization assert bounds.denormalize_price(0.0) == 2000.0 assert bounds.denormalize_price(1.0) == 2500.0 assert bounds.denormalize_price(0.5) == 2250.0 # Test round-trip original = 2375.0 normalized = bounds.normalize_price(original) denormalized = bounds.denormalize_price(normalized) assert abs(denormalized - original) < 0.01 def test_feature_vector_normalization(): """Test feature vector normalization""" base_data = create_test_base_data_input() # Get normalized features features_norm = base_data.get_feature_vector(normalize=True) # Check all OHLCV values are in 0-1 range ohlcv_features = features_norm[:7500] # First 7500 are OHLCV assert np.all(ohlcv_features >= 0.0) assert np.all(ohlcv_features <= 1.0) # Get raw features features_raw = base_data.get_feature_vector(normalize=False) # Raw features should be > 1.0 (actual prices) assert np.any(features_raw[:7500] > 1.0) ``` --- ## Performance ### Computation Time | Operation | Time | Notes | |-----------|------|-------| | Compute bounds (first time) | ~1-2 ms | Scans all OHLCV data | | Get cached bounds | ~0.001 ms | Returns cached object | | Normalize single value | ~0.0001 ms | Simple arithmetic | | Normalize 7850 features | ~0.5 ms | Vectorized operations | ### Memory Usage | Item | Size | Notes | |------|------|-------| | NormalizationBounds object | ~100 bytes | 4 floats + 2 strings | | Cached in BaseDataInput | ~200 bytes | 2 bounds objects | | Negligible overhead | <1 KB | Per BaseDataInput instance | --- ## Summary ✅ **Automatic**: Normalization happens by default ✅ **Consistent**: Same bounds across all timeframes ✅ **Independent**: ETH and BTC normalized separately ✅ **Cached**: Bounds computed once, reused ✅ **Reversible**: Easy denormalization for predictions ✅ **Fast**: <1ms overhead **Result**: Clean 0-1 range inputs for neural networks, with easy conversion back to real prices for trading. --- ## References - **Implementation**: `core/data_models.py` - `NormalizationBounds` and `BaseDataInput` - **Specification**: `docs/BASE_DATA_INPUT_SPECIFICATION.md` - **Usage Guide**: `docs/BASE_DATA_INPUT_USAGE_AUDIT.md`