update requirements

This commit is contained in:
Dobromir Popov
2025-10-09 15:22:49 +03:00
parent a86e07f556
commit 6cf4d902df
3 changed files with 450 additions and 57 deletions

View File

@@ -2,30 +2,150 @@
## Introduction
The Multi-Modal Trading System is an advanced algorithmic trading platform that combines Convolutional Neural Networks (CNN) and Reinforcement Learning (RL) models orchestrated by a decision-making module. The system processes multi-timeframe and multi-symbol market data (primarily ETH and BTC) to generate trading actions. The system is designed to adapt to current market conditions through continuous learning from past experiences, with the CNN module trained on historical data to predict pivot points and the RL module optimizing trading decisions based on these predictions and market data.
The Multi-Modal Trading System is an advanced algorithmic trading platform that combines Convolutional Neural Networks (CNN) and Reinforcement Learning (RL) models orchestrated by a decision-making module. The system processes multi-timeframe and multi-symbol market data (primarily ETH and BTC) to generate trading actions.
**Current System Architecture:**
- **COBY System**: Standalone multi-exchange data aggregation system with TimescaleDB storage, Redis caching, and WebSocket distribution
- **Core Data Provider**: Unified data provider (`core/data_provider.py`) with automatic data maintenance, Williams Market Structure pivot points, and COB integration
- **Enhanced COB WebSocket**: Real-time order book streaming (`core/enhanced_cob_websocket.py`) with multiple Binance streams (depth, ticker, aggTrade)
- **Standardized Data Provider**: Extension layer (`core/standardized_data_provider.py`) providing unified BaseDataInput format for all models
- **Model Output Manager**: Centralized storage for cross-model feeding with extensible ModelOutput format
- **Orchestrator**: Central coordination hub managing data subscriptions, model inference, and training pipelines
The system is designed to adapt to current market conditions through continuous learning from past experiences, with the CNN module trained on historical data to predict pivot points and the RL module optimizing trading decisions based on these predictions and market data.
## Requirements
### Requirement 1: Data Collection and Processing
### Requirement 1: Data Collection and Processing Backbone
**User Story:** As a trader, I want the system to collect and process multi-timeframe and multi-symbol market data, so that the models have comprehensive market information for making accurate trading decisions.
**User Story:** As a trader, I want a robust, multi-layered data collection system that provides real-time and historical market data from multiple sources, so that the models have comprehensive, reliable market information for making accurate trading decisions.
#### Current Implementation Status
**IMPLEMENTED:**
- ✅ Core DataProvider with automatic data maintenance (1500 candles cached per symbol/timeframe)
- ✅ Multi-exchange COB integration via EnhancedCOBWebSocket (Binance depth@100ms, ticker, aggTrade streams)
- ✅ Williams Market Structure pivot point calculation with monthly data analysis
- ✅ Pivot-based normalization system with PivotBounds caching
- ✅ Real-time tick aggregation with RealTimeTickAggregator
- ✅ COB 1s aggregation with price buckets ($1 for ETH, $10 for BTC)
- ✅ Multi-timeframe imbalance calculations (1s, 5s, 15s, 60s MA)
- ✅ Centralized data distribution with subscriber management
- ✅ COBY standalone system with TimescaleDB storage and Redis caching
**PARTIALLY IMPLEMENTED:**
- ⚠️ COB raw tick storage (30 min buffer) - implemented but needs validation
- ⚠️ Training data collection callbacks - structure exists but needs integration
- ⚠️ Cross-exchange COB consolidation - COBY system separate from core
**NEEDS ENHANCEMENT:**
- ❌ Unified integration between COBY and core DataProvider
- ❌ Configurable price range for COB imbalance (currently hardcoded $5 ETH, $50 BTC)
- ❌ COB heatmap matrix generation for model inputs
- ❌ Validation of 600-bar caching for backtesting support
#### Acceptance Criteria
0. NEVER USE GENERATED/SYNTHETIC DATA or mock implementations and UI. If somethings is not implemented yet, it should be obvious.
1. WHEN the system starts THEN it SHALL collect and process data for both ETH and BTC symbols.
2. WHEN collecting data THEN the system SHALL store the following for the primary symbol (ETH):
- 300 seconds of raw tick data - price and COB snapshot for all prices +- 1% on fine reslolution buckets (1$ for ETH, 10$ for BTC)
- 300 seconds of 1-second OHLCV data + 1s aggregated COB data
- 300 bars of OHLCV + indicators for each timeframe (1s, 1m, 1h, 1d)
3. WHEN collecting data THEN the system SHALL store similar data for the reference symbol (BTC).
4. WHEN processing data THEN the system SHALL calculate standard technical indicators for all timeframes.
5. WHEN processing data THEN the system SHALL calculate pivot points for all timeframes according to the specified methodology.
6. WHEN new data arrives THEN the system SHALL update its data cache in real-time.
7. IF tick data is not available THEN the system SHALL substitute with the lowest available timeframe data.
8. WHEN normalizing data THEN the system SHALL normalize to the max and min of the highest timeframe to maintain relationships between different timeframes.
9. data is cached for longer (let's start with double the model inputs so 600 bars) to support performing backtesting when we know the current predictions outcomes so we can generate test cases.
10. In general all models have access to the whole data we collect in a central data provider implementation. only some are specialized. All models should also take as input the last output of evey other model (also cached in the data provider). there should be a room for adding more models in the other models data input so we can extend the system without having to loose existing models and trained W&B
0. NEVER USE GENERATED/SYNTHETIC DATA or mock implementations and UI. If something is not implemented yet, it should be obvious.
1. WHEN the system starts THEN it SHALL initialize both core DataProvider and COBY system for comprehensive data coverage.
2. WHEN collecting data THEN the system SHALL maintain in DataProvider:
- 1500 candles of OHLCV data per timeframe (1s, 1m, 1h, 1d) for ETH and BTC
- 300 seconds (5 min) of COB 1s aggregated data with price buckets
- 180,000 raw COB ticks (30 min buffer at ~100 ticks/second)
- Williams Market Structure pivot points with 5 levels
- Technical indicators calculated on all timeframes
3. WHEN collecting COB data THEN the system SHALL use EnhancedCOBWebSocket with:
- Binance depth@100ms stream for high-frequency order book updates
- Binance ticker stream for 24hr statistics and volume
- Binance aggTrade stream for large order detection
- Automatic reconnection with exponential backoff
- Proper order book synchronization with REST API snapshots
4. WHEN aggregating COB data THEN the system SHALL create 1s buckets with:
- ±20 price buckets around current price ($1 for ETH, $10 for BTC)
- Bid/ask volumes and imbalances per bucket
- Multi-timeframe MA of imbalances (1s, 5s, 15s, 60s) for ±5 buckets
- Volume-weighted prices within buckets
5. WHEN processing data THEN the system SHALL calculate Williams Market Structure pivot points using:
- Recursive pivot detection with configurable min_pivot_distance
- 5 levels of trend analysis
- Monthly 1s data for comprehensive analysis
- Pivot-based normalization bounds for model inputs
6. WHEN new data arrives THEN the system SHALL update caches in real-time with:
- Automatic data maintenance worker updating every half-candle period
- Thread-safe access to cached data
- Subscriber notification system for real-time distribution
7. WHEN normalizing data THEN the system SHALL use pivot-based normalization:
- PivotBounds derived from Williams Market Structure
- Price normalization using pivot support/resistance levels
- Distance calculations to nearest support/resistance
8. WHEN storing data THEN the system SHALL cache 1500 bars (not 600) to support:
- Model inputs (300 bars)
- Backtesting with 3x historical context
- Prediction outcome validation
9. WHEN distributing data THEN the system SHALL provide centralized access via:
- StandardizedDataProvider.get_base_data_input() for unified model inputs
- Subscriber callbacks for real-time updates
- ModelOutputManager for cross-model feeding
10. WHEN integrating COBY THEN the system SHALL maintain separation:
- COBY as standalone multi-exchange aggregation system
- Core DataProvider for real-time trading operations
- Future: unified interface for accessing both systems
### Requirement 1.1: Standardized Data Provider Architecture
**User Story:** As a model developer, I want a standardized data provider that delivers consistent, validated input data in a unified format, so that all models receive the same high-quality data structure and can be easily extended.
#### Current Implementation Status
**IMPLEMENTED:**
- ✅ StandardizedDataProvider extending core DataProvider
- ✅ BaseDataInput dataclass with comprehensive fields
- ✅ OHLCVBar, COBData, PivotPoint, ModelOutput dataclasses
- ✅ ModelOutputManager for extensible cross-model feeding
- ✅ COB moving average calculation with thread-safe access
- ✅ Input validation before model inference
- ✅ Live price fetching with multiple fallbacks
**NEEDS ENHANCEMENT:**
- ❌ COB heatmap matrix integration in BaseDataInput
- ❌ Comprehensive data completeness validation
- ❌ Automatic data quality scoring
- ❌ Missing data interpolation strategies
#### Acceptance Criteria
1. WHEN a model requests data THEN StandardizedDataProvider SHALL return BaseDataInput containing:
- 300 frames of OHLCV for each timeframe (1s, 1m, 1h, 1d) for primary symbol
- 300 frames of 1s OHLCV for BTC reference symbol
- COBData with ±20 price buckets and MA (1s, 5s, 15s, 60s) for ±5 buckets
- Technical indicators dictionary
- List of PivotPoint objects from Williams Market Structure
- Dictionary of last predictions from all models (ModelOutput format)
- Market microstructure data including order flow metrics
2. WHEN BaseDataInput is created THEN it SHALL validate:
- Minimum 100 frames of data for each required timeframe
- Non-null COB data with valid price buckets
- Valid timestamp and symbol
- Data completeness score > 0.8
3. WHEN COB data is processed THEN the system SHALL calculate:
- Bid/ask imbalance for each price bucket
- Moving averages (1s, 5s, 15s, 60s) of imbalance for ±5 buckets around current price
- Volume-weighted prices within buckets
- Order flow metrics (aggressive buy/sell ratios)
4. WHEN models output predictions THEN ModelOutputManager SHALL store:
- Standardized ModelOutput with model_type, model_name, symbol, timestamp
- Model-specific predictions dictionary
- Hidden states for cross-model feeding (optional)
- Metadata for extensibility
5. WHEN retrieving model outputs THEN the system SHALL provide:
- Current outputs for all models by symbol
- Historical outputs with configurable retention (default 1000)
- Efficient query by model_name, symbol, timestamp
6. WHEN data is unavailable THEN the system SHALL:
- Return None instead of synthetic data
- Log specific missing components
- Provide data completeness metrics
- NOT proceed with model inference on incomplete data
### Requirement 2: CNN Model Implementation