8.8 KiB
Requirements Document
Introduction
This feature aims to unify all data storage and retrieval methods across the trading system into a single, coherent interface. Currently, the system uses multiple storage approaches (Parquet files, pickle files, in-memory caches, TimescaleDB) and has fragmented data access patterns. This creates complexity, inconsistency, and performance issues.
The unified data storage system will provide a single endpoint for retrieving inference data, supporting both real-time streaming data and historical backtesting/annotation scenarios. It will consolidate storage methods into the most efficient approach and ensure all components use consistent data access patterns.
Requirements
Requirement 1: Unified Data Retrieval Interface
User Story: As a developer, I want a single method to retrieve inference data regardless of whether I need real-time or historical data, so that I can simplify my code and ensure consistency.
Acceptance Criteria
- WHEN a component requests inference data THEN the system SHALL provide a unified
get_inference_data()method that accepts a timestamp parameter - WHEN timestamp is None or "latest" THEN the system SHALL return the most recent cached real-time data
- WHEN timestamp is a specific datetime THEN the system SHALL return historical data from local storage at that timestamp
- WHEN requesting inference data THEN the system SHALL return data in a standardized format with all required features (OHLCV, technical indicators, COB data, order book imbalances)
- WHEN the requested timestamp is not available THEN the system SHALL return the nearest available data point with a warning
Requirement 2: Consolidated Storage Backend
User Story: As a system architect, I want all market data stored using a single, optimized storage method, so that I can reduce complexity and improve performance.
Acceptance Criteria
- WHEN storing candlestick data THEN the system SHALL use TimescaleDB as the primary storage backend
- WHEN storing raw order book ticks THEN the system SHALL use TimescaleDB with appropriate compression
- WHEN storing aggregated 1s/1m data THEN the system SHALL use TimescaleDB hypertables for efficient time-series queries
- WHEN the system starts THEN it SHALL migrate existing Parquet and pickle files to TimescaleDB
- WHEN data is written THEN the system SHALL ensure atomic writes with proper error handling
- WHEN querying data THEN the system SHALL leverage TimescaleDB's time-series optimizations for fast retrieval
Requirement 3: Multi-Timeframe Data Storage
User Story: As a trading model, I need access to multiple timeframes (1s, 1m, 5m, 15m, 1h, 1d) of candlestick data, so that I can perform multi-timeframe analysis.
Acceptance Criteria
- WHEN storing candlestick data THEN the system SHALL store all configured timeframes (1s, 1m, 5m, 15m, 1h, 1d)
- WHEN aggregating data THEN the system SHALL use TimescaleDB continuous aggregates to automatically generate higher timeframes from 1s data
- WHEN requesting multi-timeframe data THEN the system SHALL return aligned timestamps across all timeframes
- WHEN a timeframe is missing data THEN the system SHALL generate it from lower timeframes if available
- WHEN storing timeframe data THEN the system SHALL maintain at least 1500 candles per timeframe for each symbol
Requirement 4: Raw Order Book and Trade Data Storage
User Story: As a machine learning model, I need access to raw 1s and 1m aggregated order book and trade book data, so that I can analyze market microstructure.
Acceptance Criteria
- WHEN receiving order book updates THEN the system SHALL store raw ticks in TimescaleDB with full bid/ask depth
- WHEN aggregating order book data THEN the system SHALL create 1s aggregations with $1 price buckets
- WHEN aggregating order book data THEN the system SHALL create 1m aggregations with $10 price buckets
- WHEN storing trade data THEN the system SHALL store individual trades with price, size, side, and timestamp
- WHEN storing order book data THEN the system SHALL maintain 30 minutes of raw data and 24 hours of aggregated data
- WHEN querying order book data THEN the system SHALL provide efficient access to imbalance metrics across multiple timeframes (1s, 5s, 15s, 60s)
Requirement 5: Real-Time Data Caching
User Story: As a real-time trading system, I need low-latency access to the latest market data, so that I can make timely trading decisions.
Acceptance Criteria
- WHEN receiving real-time data THEN the system SHALL maintain an in-memory cache of the last 5 minutes of data
- WHEN requesting latest data THEN the system SHALL serve from cache with <10ms latency
- WHEN cache is updated THEN the system SHALL asynchronously persist to TimescaleDB without blocking
- WHEN cache reaches capacity THEN the system SHALL evict oldest data while maintaining continuity
- WHEN system restarts THEN the system SHALL rebuild cache from TimescaleDB automatically
Requirement 6: Historical Data Access for Backtesting
User Story: As a backtesting system, I need efficient access to historical data at any timestamp, so that I can simulate trading strategies accurately.
Acceptance Criteria
- WHEN requesting historical data THEN the system SHALL query TimescaleDB with timestamp-based indexing
- WHEN requesting a time range THEN the system SHALL return all data points within that range efficiently
- WHEN requesting data with context window THEN the system SHALL return ±N minutes of surrounding data
- WHEN backtesting THEN the system SHALL support sequential data access without loading entire dataset into memory
- WHEN querying historical data THEN the system SHALL return results in <100ms for typical queries (single timestamp, single symbol)
Requirement 7: Data Annotation Support
User Story: As a data annotator, I need to retrieve historical market data at specific timestamps to manually label trading signals, so that I can create training datasets.
Acceptance Criteria
- WHEN annotating data THEN the system SHALL provide the same
get_inference_data()interface with timestamp parameter - WHEN retrieving annotation data THEN the system SHALL include ±5 minutes of context data
- WHEN loading annotation sessions THEN the system SHALL support efficient random access to any timestamp
- WHEN displaying charts THEN the system SHALL provide multi-timeframe data aligned to the annotation timestamp
- WHEN saving annotations THEN the system SHALL link annotations to exact timestamps in the database
Requirement 8: Data Migration and Backward Compatibility
User Story: As a system administrator, I want existing data migrated to the new storage system without data loss, so that I can maintain historical continuity.
Acceptance Criteria
- WHEN migration starts THEN the system SHALL detect existing Parquet files in cache directory
- WHEN migrating Parquet data THEN the system SHALL import all data into TimescaleDB with proper timestamps
- WHEN migration completes THEN the system SHALL verify data integrity by comparing record counts
- WHEN migration fails THEN the system SHALL rollback changes and preserve original files
- WHEN migration succeeds THEN the system SHALL optionally archive old Parquet files
- WHEN accessing data during migration THEN the system SHALL continue serving from existing storage
Requirement 9: Performance and Scalability
User Story: As a system operator, I need the data storage system to handle high-frequency data ingestion and queries efficiently, so that the system remains responsive under load.
Acceptance Criteria
- WHEN ingesting real-time data THEN the system SHALL handle at least 1000 updates per second per symbol
- WHEN querying data THEN the system SHALL return single-timestamp queries in <100ms
- WHEN querying time ranges THEN the system SHALL return 1 hour of 1s data in <500ms
- WHEN storing data THEN the system SHALL use batch writes to optimize database performance
- WHEN database grows THEN the system SHALL use TimescaleDB compression to reduce storage size by 80%+
- WHEN running multiple queries THEN the system SHALL support concurrent access without performance degradation
Requirement 10: Data Consistency and Validation
User Story: As a trading system, I need to ensure all data is consistent and validated, so that models receive accurate information.
Acceptance Criteria
- WHEN storing data THEN the system SHALL validate timestamps are in UTC timezone
- WHEN storing OHLCV data THEN the system SHALL validate high >= low and high >= open/close
- WHEN storing order book data THEN the system SHALL validate bids < asks
- WHEN detecting invalid data THEN the system SHALL log warnings and reject the data point
- WHEN querying data THEN the system SHALL ensure all timeframes are properly aligned
- WHEN data gaps exist THEN the system SHALL identify and log missing periods