uni data storage
This commit is contained in:
134
.kiro/specs/unified-data-storage/requirements.md
Normal file
134
.kiro/specs/unified-data-storage/requirements.md
Normal file
@@ -0,0 +1,134 @@
|
||||
# Requirements Document
|
||||
|
||||
## Introduction
|
||||
|
||||
This feature aims to unify all data storage and retrieval methods across the trading system into a single, coherent interface. Currently, the system uses multiple storage approaches (Parquet files, pickle files, in-memory caches, TimescaleDB) and has fragmented data access patterns. This creates complexity, inconsistency, and performance issues.
|
||||
|
||||
The unified data storage system will provide a single endpoint for retrieving inference data, supporting both real-time streaming data and historical backtesting/annotation scenarios. It will consolidate storage methods into the most efficient approach and ensure all components use consistent data access patterns.
|
||||
|
||||
## Requirements
|
||||
|
||||
### Requirement 1: Unified Data Retrieval Interface
|
||||
|
||||
**User Story:** As a developer, I want a single method to retrieve inference data regardless of whether I need real-time or historical data, so that I can simplify my code and ensure consistency.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN a component requests inference data THEN the system SHALL provide a unified `get_inference_data()` method that accepts a timestamp parameter
|
||||
2. WHEN timestamp is None or "latest" THEN the system SHALL return the most recent cached real-time data
|
||||
3. WHEN timestamp is a specific datetime THEN the system SHALL return historical data from local storage at that timestamp
|
||||
4. WHEN requesting inference data THEN the system SHALL return data in a standardized format with all required features (OHLCV, technical indicators, COB data, order book imbalances)
|
||||
5. WHEN the requested timestamp is not available THEN the system SHALL return the nearest available data point with a warning
|
||||
|
||||
### Requirement 2: Consolidated Storage Backend
|
||||
|
||||
**User Story:** As a system architect, I want all market data stored using a single, optimized storage method, so that I can reduce complexity and improve performance.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN storing candlestick data THEN the system SHALL use TimescaleDB as the primary storage backend
|
||||
2. WHEN storing raw order book ticks THEN the system SHALL use TimescaleDB with appropriate compression
|
||||
3. WHEN storing aggregated 1s/1m data THEN the system SHALL use TimescaleDB hypertables for efficient time-series queries
|
||||
4. WHEN the system starts THEN it SHALL migrate existing Parquet and pickle files to TimescaleDB
|
||||
5. WHEN data is written THEN the system SHALL ensure atomic writes with proper error handling
|
||||
6. WHEN querying data THEN the system SHALL leverage TimescaleDB's time-series optimizations for fast retrieval
|
||||
|
||||
### Requirement 3: Multi-Timeframe Data Storage
|
||||
|
||||
**User Story:** As a trading model, I need access to multiple timeframes (1s, 1m, 5m, 15m, 1h, 1d) of candlestick data, so that I can perform multi-timeframe analysis.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN storing candlestick data THEN the system SHALL store all configured timeframes (1s, 1m, 5m, 15m, 1h, 1d)
|
||||
2. WHEN aggregating data THEN the system SHALL use TimescaleDB continuous aggregates to automatically generate higher timeframes from 1s data
|
||||
3. WHEN requesting multi-timeframe data THEN the system SHALL return aligned timestamps across all timeframes
|
||||
4. WHEN a timeframe is missing data THEN the system SHALL generate it from lower timeframes if available
|
||||
5. WHEN storing timeframe data THEN the system SHALL maintain at least 1500 candles per timeframe for each symbol
|
||||
|
||||
### Requirement 4: Raw Order Book and Trade Data Storage
|
||||
|
||||
**User Story:** As a machine learning model, I need access to raw 1s and 1m aggregated order book and trade book data, so that I can analyze market microstructure.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN receiving order book updates THEN the system SHALL store raw ticks in TimescaleDB with full bid/ask depth
|
||||
2. WHEN aggregating order book data THEN the system SHALL create 1s aggregations with $1 price buckets
|
||||
3. WHEN aggregating order book data THEN the system SHALL create 1m aggregations with $10 price buckets
|
||||
4. WHEN storing trade data THEN the system SHALL store individual trades with price, size, side, and timestamp
|
||||
5. WHEN storing order book data THEN the system SHALL maintain 30 minutes of raw data and 24 hours of aggregated data
|
||||
6. WHEN querying order book data THEN the system SHALL provide efficient access to imbalance metrics across multiple timeframes (1s, 5s, 15s, 60s)
|
||||
|
||||
### Requirement 5: Real-Time Data Caching
|
||||
|
||||
**User Story:** As a real-time trading system, I need low-latency access to the latest market data, so that I can make timely trading decisions.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN receiving real-time data THEN the system SHALL maintain an in-memory cache of the last 5 minutes of data
|
||||
2. WHEN requesting latest data THEN the system SHALL serve from cache with <10ms latency
|
||||
3. WHEN cache is updated THEN the system SHALL asynchronously persist to TimescaleDB without blocking
|
||||
4. WHEN cache reaches capacity THEN the system SHALL evict oldest data while maintaining continuity
|
||||
5. WHEN system restarts THEN the system SHALL rebuild cache from TimescaleDB automatically
|
||||
|
||||
### Requirement 6: Historical Data Access for Backtesting
|
||||
|
||||
**User Story:** As a backtesting system, I need efficient access to historical data at any timestamp, so that I can simulate trading strategies accurately.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN requesting historical data THEN the system SHALL query TimescaleDB with timestamp-based indexing
|
||||
2. WHEN requesting a time range THEN the system SHALL return all data points within that range efficiently
|
||||
3. WHEN requesting data with context window THEN the system SHALL return ±N minutes of surrounding data
|
||||
4. WHEN backtesting THEN the system SHALL support sequential data access without loading entire dataset into memory
|
||||
5. WHEN querying historical data THEN the system SHALL return results in <100ms for typical queries (single timestamp, single symbol)
|
||||
|
||||
### Requirement 7: Data Annotation Support
|
||||
|
||||
**User Story:** As a data annotator, I need to retrieve historical market data at specific timestamps to manually label trading signals, so that I can create training datasets.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN annotating data THEN the system SHALL provide the same `get_inference_data()` interface with timestamp parameter
|
||||
2. WHEN retrieving annotation data THEN the system SHALL include ±5 minutes of context data
|
||||
3. WHEN loading annotation sessions THEN the system SHALL support efficient random access to any timestamp
|
||||
4. WHEN displaying charts THEN the system SHALL provide multi-timeframe data aligned to the annotation timestamp
|
||||
5. WHEN saving annotations THEN the system SHALL link annotations to exact timestamps in the database
|
||||
|
||||
### Requirement 8: Data Migration and Backward Compatibility
|
||||
|
||||
**User Story:** As a system administrator, I want existing data migrated to the new storage system without data loss, so that I can maintain historical continuity.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN migration starts THEN the system SHALL detect existing Parquet files in cache directory
|
||||
2. WHEN migrating Parquet data THEN the system SHALL import all data into TimescaleDB with proper timestamps
|
||||
3. WHEN migration completes THEN the system SHALL verify data integrity by comparing record counts
|
||||
4. WHEN migration fails THEN the system SHALL rollback changes and preserve original files
|
||||
5. WHEN migration succeeds THEN the system SHALL optionally archive old Parquet files
|
||||
6. WHEN accessing data during migration THEN the system SHALL continue serving from existing storage
|
||||
|
||||
### Requirement 9: Performance and Scalability
|
||||
|
||||
**User Story:** As a system operator, I need the data storage system to handle high-frequency data ingestion and queries efficiently, so that the system remains responsive under load.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN ingesting real-time data THEN the system SHALL handle at least 1000 updates per second per symbol
|
||||
2. WHEN querying data THEN the system SHALL return single-timestamp queries in <100ms
|
||||
3. WHEN querying time ranges THEN the system SHALL return 1 hour of 1s data in <500ms
|
||||
4. WHEN storing data THEN the system SHALL use batch writes to optimize database performance
|
||||
5. WHEN database grows THEN the system SHALL use TimescaleDB compression to reduce storage size by 80%+
|
||||
6. WHEN running multiple queries THEN the system SHALL support concurrent access without performance degradation
|
||||
|
||||
### Requirement 10: Data Consistency and Validation
|
||||
|
||||
**User Story:** As a trading system, I need to ensure all data is consistent and validated, so that models receive accurate information.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN storing data THEN the system SHALL validate timestamps are in UTC timezone
|
||||
2. WHEN storing OHLCV data THEN the system SHALL validate high >= low and high >= open/close
|
||||
3. WHEN storing order book data THEN the system SHALL validate bids < asks
|
||||
4. WHEN detecting invalid data THEN the system SHALL log warnings and reject the data point
|
||||
5. WHEN querying data THEN the system SHALL ensure all timeframes are properly aligned
|
||||
6. WHEN data gaps exist THEN the system SHALL identify and log missing periods
|
||||
Reference in New Issue
Block a user