uni data storage

2025-10-20 09:48:59 +03:00
parent 002d0f7858
commit f464a412dc
12 changed files with 2905 additions and 181 deletions
--- a/.kiro/specs/unified-data-storage/requirements.md
+++ b/.kiro/specs/unified-data-storage/requirements.md
@@ -0,0 +1,134 @@
+# Requirements Document
+
+## Introduction
+
+This feature aims to unify all data storage and retrieval methods across the trading system into a single, coherent interface. Currently, the system uses multiple storage approaches (Parquet files, pickle files, in-memory caches, TimescaleDB) and has fragmented data access patterns. This creates complexity, inconsistency, and performance issues.
+
+The unified data storage system will provide a single endpoint for retrieving inference data, supporting both real-time streaming data and historical backtesting/annotation scenarios. It will consolidate storage methods into the most efficient approach and ensure all components use consistent data access patterns.
+
+## Requirements
+
+### Requirement 1: Unified Data Retrieval Interface
+
+**User Story:** As a developer, I want a single method to retrieve inference data regardless of whether I need real-time or historical data, so that I can simplify my code and ensure consistency.
+
+#### Acceptance Criteria
+
+1. WHEN a component requests inference data THEN the system SHALL provide a unified `get_inference_data()` method that accepts a timestamp parameter
+2. WHEN timestamp is None or "latest" THEN the system SHALL return the most recent cached real-time data
+3. WHEN timestamp is a specific datetime THEN the system SHALL return historical data from local storage at that timestamp
+4. WHEN requesting inference data THEN the system SHALL return data in a standardized format with all required features (OHLCV, technical indicators, COB data, order book imbalances)
+5. WHEN the requested timestamp is not available THEN the system SHALL return the nearest available data point with a warning
+
+### Requirement 2: Consolidated Storage Backend
+
+**User Story:** As a system architect, I want all market data stored using a single, optimized storage method, so that I can reduce complexity and improve performance.
+
+#### Acceptance Criteria
+
+1. WHEN storing candlestick data THEN the system SHALL use TimescaleDB as the primary storage backend
+2. WHEN storing raw order book ticks THEN the system SHALL use TimescaleDB with appropriate compression
+3. WHEN storing aggregated 1s/1m data THEN the system SHALL use TimescaleDB hypertables for efficient time-series queries
+4. WHEN the system starts THEN it SHALL migrate existing Parquet and pickle files to TimescaleDB
+5. WHEN data is written THEN the system SHALL ensure atomic writes with proper error handling
+6. WHEN querying data THEN the system SHALL leverage TimescaleDB's time-series optimizations for fast retrieval
+
+### Requirement 3: Multi-Timeframe Data Storage
+
+**User Story:** As a trading model, I need access to multiple timeframes (1s, 1m, 5m, 15m, 1h, 1d) of candlestick data, so that I can perform multi-timeframe analysis.
+
+#### Acceptance Criteria
+
+1. WHEN storing candlestick data THEN the system SHALL store all configured timeframes (1s, 1m, 5m, 15m, 1h, 1d)
+2. WHEN aggregating data THEN the system SHALL use TimescaleDB continuous aggregates to automatically generate higher timeframes from 1s data
+3. WHEN requesting multi-timeframe data THEN the system SHALL return aligned timestamps across all timeframes
+4. WHEN a timeframe is missing data THEN the system SHALL generate it from lower timeframes if available
+5. WHEN storing timeframe data THEN the system SHALL maintain at least 1500 candles per timeframe for each symbol
+
+### Requirement 4: Raw Order Book and Trade Data Storage
+
+**User Story:** As a machine learning model, I need access to raw 1s and 1m aggregated order book and trade book data, so that I can analyze market microstructure.
+
+#### Acceptance Criteria
+
+1. WHEN receiving order book updates THEN the system SHALL store raw ticks in TimescaleDB with full bid/ask depth
+2. WHEN aggregating order book data THEN the system SHALL create 1s aggregations with $1 price buckets
+3. WHEN aggregating order book data THEN the system SHALL create 1m aggregations with $10 price buckets
+4. WHEN storing trade data THEN the system SHALL store individual trades with price, size, side, and timestamp
+5. WHEN storing order book data THEN the system SHALL maintain 30 minutes of raw data and 24 hours of aggregated data
+6. WHEN querying order book data THEN the system SHALL provide efficient access to imbalance metrics across multiple timeframes (1s, 5s, 15s, 60s)
+
+### Requirement 5: Real-Time Data Caching
+
+**User Story:** As a real-time trading system, I need low-latency access to the latest market data, so that I can make timely trading decisions.
+
+#### Acceptance Criteria
+
+1. WHEN receiving real-time data THEN the system SHALL maintain an in-memory cache of the last 5 minutes of data
+2. WHEN requesting latest data THEN the system SHALL serve from cache with <10ms latency
+3. WHEN cache is updated THEN the system SHALL asynchronously persist to TimescaleDB without blocking
+4. WHEN cache reaches capacity THEN the system SHALL evict oldest data while maintaining continuity
+5. WHEN system restarts THEN the system SHALL rebuild cache from TimescaleDB automatically
+
+### Requirement 6: Historical Data Access for Backtesting
+
+**User Story:** As a backtesting system, I need efficient access to historical data at any timestamp, so that I can simulate trading strategies accurately.
+
+#### Acceptance Criteria
+
+1. WHEN requesting historical data THEN the system SHALL query TimescaleDB with timestamp-based indexing
+2. WHEN requesting a time range THEN the system SHALL return all data points within that range efficiently
+3. WHEN requesting data with context window THEN the system SHALL return ±N minutes of surrounding data
+4. WHEN backtesting THEN the system SHALL support sequential data access without loading entire dataset into memory
+5. WHEN querying historical data THEN the system SHALL return results in <100ms for typical queries (single timestamp, single symbol)
+
+### Requirement 7: Data Annotation Support
+
+**User Story:** As a data annotator, I need to retrieve historical market data at specific timestamps to manually label trading signals, so that I can create training datasets.
+
+#### Acceptance Criteria
+
+1. WHEN annotating data THEN the system SHALL provide the same `get_inference_data()` interface with timestamp parameter
+2. WHEN retrieving annotation data THEN the system SHALL include ±5 minutes of context data
+3. WHEN loading annotation sessions THEN the system SHALL support efficient random access to any timestamp
+4. WHEN displaying charts THEN the system SHALL provide multi-timeframe data aligned to the annotation timestamp
+5. WHEN saving annotations THEN the system SHALL link annotations to exact timestamps in the database
+
+### Requirement 8: Data Migration and Backward Compatibility
+
+**User Story:** As a system administrator, I want existing data migrated to the new storage system without data loss, so that I can maintain historical continuity.
+
+#### Acceptance Criteria
+
+1. WHEN migration starts THEN the system SHALL detect existing Parquet files in cache directory
+2. WHEN migrating Parquet data THEN the system SHALL import all data into TimescaleDB with proper timestamps
+3. WHEN migration completes THEN the system SHALL verify data integrity by comparing record counts
+4. WHEN migration fails THEN the system SHALL rollback changes and preserve original files
+5. WHEN migration succeeds THEN the system SHALL optionally archive old Parquet files
+6. WHEN accessing data during migration THEN the system SHALL continue serving from existing storage
+
+### Requirement 9: Performance and Scalability
+
+**User Story:** As a system operator, I need the data storage system to handle high-frequency data ingestion and queries efficiently, so that the system remains responsive under load.
+
+#### Acceptance Criteria
+
+1. WHEN ingesting real-time data THEN the system SHALL handle at least 1000 updates per second per symbol
+2. WHEN querying data THEN the system SHALL return single-timestamp queries in <100ms
+3. WHEN querying time ranges THEN the system SHALL return 1 hour of 1s data in <500ms
+4. WHEN storing data THEN the system SHALL use batch writes to optimize database performance
+5. WHEN database grows THEN the system SHALL use TimescaleDB compression to reduce storage size by 80%+
+6. WHEN running multiple queries THEN the system SHALL support concurrent access without performance degradation
+
+### Requirement 10: Data Consistency and Validation
+
+**User Story:** As a trading system, I need to ensure all data is consistent and validated, so that models receive accurate information.
+
+#### Acceptance Criteria
+
+1. WHEN storing data THEN the system SHALL validate timestamps are in UTC timezone
+2. WHEN storing OHLCV data THEN the system SHALL validate high >= low and high >= open/close
+3. WHEN storing order book data THEN the system SHALL validate bids < asks
+4. WHEN detecting invalid data THEN the system SHALL log warnings and reject the data point
+5. WHEN querying data THEN the system SHALL ensure all timeframes are properly aligned
+6. WHEN data gaps exist THEN the system SHALL identify and log missing periods