Files
gogo2/.kiro/specs/unified-data-storage/requirements.md
Dobromir Popov f464a412dc uni data storage
2025-10-20 09:48:59 +03:00

8.8 KiB

Requirements Document

Introduction

This feature aims to unify all data storage and retrieval methods across the trading system into a single, coherent interface. Currently, the system uses multiple storage approaches (Parquet files, pickle files, in-memory caches, TimescaleDB) and has fragmented data access patterns. This creates complexity, inconsistency, and performance issues.

The unified data storage system will provide a single endpoint for retrieving inference data, supporting both real-time streaming data and historical backtesting/annotation scenarios. It will consolidate storage methods into the most efficient approach and ensure all components use consistent data access patterns.

Requirements

Requirement 1: Unified Data Retrieval Interface

User Story: As a developer, I want a single method to retrieve inference data regardless of whether I need real-time or historical data, so that I can simplify my code and ensure consistency.

Acceptance Criteria

  1. WHEN a component requests inference data THEN the system SHALL provide a unified get_inference_data() method that accepts a timestamp parameter
  2. WHEN timestamp is None or "latest" THEN the system SHALL return the most recent cached real-time data
  3. WHEN timestamp is a specific datetime THEN the system SHALL return historical data from local storage at that timestamp
  4. WHEN requesting inference data THEN the system SHALL return data in a standardized format with all required features (OHLCV, technical indicators, COB data, order book imbalances)
  5. WHEN the requested timestamp is not available THEN the system SHALL return the nearest available data point with a warning

Requirement 2: Consolidated Storage Backend

User Story: As a system architect, I want all market data stored using a single, optimized storage method, so that I can reduce complexity and improve performance.

Acceptance Criteria

  1. WHEN storing candlestick data THEN the system SHALL use TimescaleDB as the primary storage backend
  2. WHEN storing raw order book ticks THEN the system SHALL use TimescaleDB with appropriate compression
  3. WHEN storing aggregated 1s/1m data THEN the system SHALL use TimescaleDB hypertables for efficient time-series queries
  4. WHEN the system starts THEN it SHALL migrate existing Parquet and pickle files to TimescaleDB
  5. WHEN data is written THEN the system SHALL ensure atomic writes with proper error handling
  6. WHEN querying data THEN the system SHALL leverage TimescaleDB's time-series optimizations for fast retrieval

Requirement 3: Multi-Timeframe Data Storage

User Story: As a trading model, I need access to multiple timeframes (1s, 1m, 5m, 15m, 1h, 1d) of candlestick data, so that I can perform multi-timeframe analysis.

Acceptance Criteria

  1. WHEN storing candlestick data THEN the system SHALL store all configured timeframes (1s, 1m, 5m, 15m, 1h, 1d)
  2. WHEN aggregating data THEN the system SHALL use TimescaleDB continuous aggregates to automatically generate higher timeframes from 1s data
  3. WHEN requesting multi-timeframe data THEN the system SHALL return aligned timestamps across all timeframes
  4. WHEN a timeframe is missing data THEN the system SHALL generate it from lower timeframes if available
  5. WHEN storing timeframe data THEN the system SHALL maintain at least 1500 candles per timeframe for each symbol

Requirement 4: Raw Order Book and Trade Data Storage

User Story: As a machine learning model, I need access to raw 1s and 1m aggregated order book and trade book data, so that I can analyze market microstructure.

Acceptance Criteria

  1. WHEN receiving order book updates THEN the system SHALL store raw ticks in TimescaleDB with full bid/ask depth
  2. WHEN aggregating order book data THEN the system SHALL create 1s aggregations with $1 price buckets
  3. WHEN aggregating order book data THEN the system SHALL create 1m aggregations with $10 price buckets
  4. WHEN storing trade data THEN the system SHALL store individual trades with price, size, side, and timestamp
  5. WHEN storing order book data THEN the system SHALL maintain 30 minutes of raw data and 24 hours of aggregated data
  6. WHEN querying order book data THEN the system SHALL provide efficient access to imbalance metrics across multiple timeframes (1s, 5s, 15s, 60s)

Requirement 5: Real-Time Data Caching

User Story: As a real-time trading system, I need low-latency access to the latest market data, so that I can make timely trading decisions.

Acceptance Criteria

  1. WHEN receiving real-time data THEN the system SHALL maintain an in-memory cache of the last 5 minutes of data
  2. WHEN requesting latest data THEN the system SHALL serve from cache with <10ms latency
  3. WHEN cache is updated THEN the system SHALL asynchronously persist to TimescaleDB without blocking
  4. WHEN cache reaches capacity THEN the system SHALL evict oldest data while maintaining continuity
  5. WHEN system restarts THEN the system SHALL rebuild cache from TimescaleDB automatically

Requirement 6: Historical Data Access for Backtesting

User Story: As a backtesting system, I need efficient access to historical data at any timestamp, so that I can simulate trading strategies accurately.

Acceptance Criteria

  1. WHEN requesting historical data THEN the system SHALL query TimescaleDB with timestamp-based indexing
  2. WHEN requesting a time range THEN the system SHALL return all data points within that range efficiently
  3. WHEN requesting data with context window THEN the system SHALL return ±N minutes of surrounding data
  4. WHEN backtesting THEN the system SHALL support sequential data access without loading entire dataset into memory
  5. WHEN querying historical data THEN the system SHALL return results in <100ms for typical queries (single timestamp, single symbol)

Requirement 7: Data Annotation Support

User Story: As a data annotator, I need to retrieve historical market data at specific timestamps to manually label trading signals, so that I can create training datasets.

Acceptance Criteria

  1. WHEN annotating data THEN the system SHALL provide the same get_inference_data() interface with timestamp parameter
  2. WHEN retrieving annotation data THEN the system SHALL include ±5 minutes of context data
  3. WHEN loading annotation sessions THEN the system SHALL support efficient random access to any timestamp
  4. WHEN displaying charts THEN the system SHALL provide multi-timeframe data aligned to the annotation timestamp
  5. WHEN saving annotations THEN the system SHALL link annotations to exact timestamps in the database

Requirement 8: Data Migration and Backward Compatibility

User Story: As a system administrator, I want existing data migrated to the new storage system without data loss, so that I can maintain historical continuity.

Acceptance Criteria

  1. WHEN migration starts THEN the system SHALL detect existing Parquet files in cache directory
  2. WHEN migrating Parquet data THEN the system SHALL import all data into TimescaleDB with proper timestamps
  3. WHEN migration completes THEN the system SHALL verify data integrity by comparing record counts
  4. WHEN migration fails THEN the system SHALL rollback changes and preserve original files
  5. WHEN migration succeeds THEN the system SHALL optionally archive old Parquet files
  6. WHEN accessing data during migration THEN the system SHALL continue serving from existing storage

Requirement 9: Performance and Scalability

User Story: As a system operator, I need the data storage system to handle high-frequency data ingestion and queries efficiently, so that the system remains responsive under load.

Acceptance Criteria

  1. WHEN ingesting real-time data THEN the system SHALL handle at least 1000 updates per second per symbol
  2. WHEN querying data THEN the system SHALL return single-timestamp queries in <100ms
  3. WHEN querying time ranges THEN the system SHALL return 1 hour of 1s data in <500ms
  4. WHEN storing data THEN the system SHALL use batch writes to optimize database performance
  5. WHEN database grows THEN the system SHALL use TimescaleDB compression to reduce storage size by 80%+
  6. WHEN running multiple queries THEN the system SHALL support concurrent access without performance degradation

Requirement 10: Data Consistency and Validation

User Story: As a trading system, I need to ensure all data is consistent and validated, so that models receive accurate information.

Acceptance Criteria

  1. WHEN storing data THEN the system SHALL validate timestamps are in UTC timezone
  2. WHEN storing OHLCV data THEN the system SHALL validate high >= low and high >= open/close
  3. WHEN storing order book data THEN the system SHALL validate bids < asks
  4. WHEN detecting invalid data THEN the system SHALL log warnings and reject the data point
  5. WHEN querying data THEN the system SHALL ensure all timeframes are properly aligned
  6. WHEN data gaps exist THEN the system SHALL identify and log missing periods