Files

Dobromir Popov f464a412dc uni data storage

2025-10-20 09:48:59 +03:00

8.8 KiB

Raw Blame History

Requirements Document

Introduction

This feature aims to unify all data storage and retrieval methods across the trading system into a single, coherent interface. Currently, the system uses multiple storage approaches (Parquet files, pickle files, in-memory caches, TimescaleDB) and has fragmented data access patterns. This creates complexity, inconsistency, and performance issues.

The unified data storage system will provide a single endpoint for retrieving inference data, supporting both real-time streaming data and historical backtesting/annotation scenarios. It will consolidate storage methods into the most efficient approach and ensure all components use consistent data access patterns.

Requirements

Requirement 1: Unified Data Retrieval Interface

User Story: As a developer, I want a single method to retrieve inference data regardless of whether I need real-time or historical data, so that I can simplify my code and ensure consistency.

Acceptance Criteria

WHEN a component requests inference data THEN the system SHALL provide a unified get_inference_data() method that accepts a timestamp parameter
WHEN timestamp is None or "latest" THEN the system SHALL return the most recent cached real-time data
WHEN timestamp is a specific datetime THEN the system SHALL return historical data from local storage at that timestamp
WHEN requesting inference data THEN the system SHALL return data in a standardized format with all required features (OHLCV, technical indicators, COB data, order book imbalances)
WHEN the requested timestamp is not available THEN the system SHALL return the nearest available data point with a warning

Requirement 2: Consolidated Storage Backend

User Story: As a system architect, I want all market data stored using a single, optimized storage method, so that I can reduce complexity and improve performance.

Acceptance Criteria

WHEN storing candlestick data THEN the system SHALL use TimescaleDB as the primary storage backend
WHEN storing raw order book ticks THEN the system SHALL use TimescaleDB with appropriate compression
WHEN storing aggregated 1s/1m data THEN the system SHALL use TimescaleDB hypertables for efficient time-series queries
WHEN the system starts THEN it SHALL migrate existing Parquet and pickle files to TimescaleDB
WHEN data is written THEN the system SHALL ensure atomic writes with proper error handling
WHEN querying data THEN the system SHALL leverage TimescaleDB's time-series optimizations for fast retrieval

Requirement 3: Multi-Timeframe Data Storage

User Story: As a trading model, I need access to multiple timeframes (1s, 1m, 5m, 15m, 1h, 1d) of candlestick data, so that I can perform multi-timeframe analysis.

Acceptance Criteria

WHEN storing candlestick data THEN the system SHALL store all configured timeframes (1s, 1m, 5m, 15m, 1h, 1d)
WHEN aggregating data THEN the system SHALL use TimescaleDB continuous aggregates to automatically generate higher timeframes from 1s data
WHEN requesting multi-timeframe data THEN the system SHALL return aligned timestamps across all timeframes
WHEN a timeframe is missing data THEN the system SHALL generate it from lower timeframes if available
WHEN storing timeframe data THEN the system SHALL maintain at least 1500 candles per timeframe for each symbol

Requirement 4: Raw Order Book and Trade Data Storage

User Story: As a machine learning model, I need access to raw 1s and 1m aggregated order book and trade book data, so that I can analyze market microstructure.

Acceptance Criteria

WHEN receiving order book updates THEN the system SHALL store raw ticks in TimescaleDB with full bid/ask depth
WHEN aggregating order book data THEN the system SHALL create 1s aggregations with $1 price buckets
WHEN aggregating order book data THEN the system SHALL create 1m aggregations with $10 price buckets
WHEN storing trade data THEN the system SHALL store individual trades with price, size, side, and timestamp
WHEN storing order book data THEN the system SHALL maintain 30 minutes of raw data and 24 hours of aggregated data
WHEN querying order book data THEN the system SHALL provide efficient access to imbalance metrics across multiple timeframes (1s, 5s, 15s, 60s)

Requirement 5: Real-Time Data Caching

User Story: As a real-time trading system, I need low-latency access to the latest market data, so that I can make timely trading decisions.

Acceptance Criteria

WHEN receiving real-time data THEN the system SHALL maintain an in-memory cache of the last 5 minutes of data
WHEN requesting latest data THEN the system SHALL serve from cache with <10ms latency
WHEN cache is updated THEN the system SHALL asynchronously persist to TimescaleDB without blocking
WHEN cache reaches capacity THEN the system SHALL evict oldest data while maintaining continuity
WHEN system restarts THEN the system SHALL rebuild cache from TimescaleDB automatically

Requirement 6: Historical Data Access for Backtesting

User Story: As a backtesting system, I need efficient access to historical data at any timestamp, so that I can simulate trading strategies accurately.

Acceptance Criteria

WHEN requesting historical data THEN the system SHALL query TimescaleDB with timestamp-based indexing
WHEN requesting a time range THEN the system SHALL return all data points within that range efficiently
WHEN requesting data with context window THEN the system SHALL return ±N minutes of surrounding data
WHEN backtesting THEN the system SHALL support sequential data access without loading entire dataset into memory
WHEN querying historical data THEN the system SHALL return results in <100ms for typical queries (single timestamp, single symbol)

Requirement 7: Data Annotation Support

User Story: As a data annotator, I need to retrieve historical market data at specific timestamps to manually label trading signals, so that I can create training datasets.

Acceptance Criteria

WHEN annotating data THEN the system SHALL provide the same get_inference_data() interface with timestamp parameter
WHEN retrieving annotation data THEN the system SHALL include ±5 minutes of context data
WHEN loading annotation sessions THEN the system SHALL support efficient random access to any timestamp
WHEN displaying charts THEN the system SHALL provide multi-timeframe data aligned to the annotation timestamp
WHEN saving annotations THEN the system SHALL link annotations to exact timestamps in the database

Requirement 8: Data Migration and Backward Compatibility

User Story: As a system administrator, I want existing data migrated to the new storage system without data loss, so that I can maintain historical continuity.

Acceptance Criteria

WHEN migration starts THEN the system SHALL detect existing Parquet files in cache directory
WHEN migrating Parquet data THEN the system SHALL import all data into TimescaleDB with proper timestamps
WHEN migration completes THEN the system SHALL verify data integrity by comparing record counts
WHEN migration fails THEN the system SHALL rollback changes and preserve original files
WHEN migration succeeds THEN the system SHALL optionally archive old Parquet files
WHEN accessing data during migration THEN the system SHALL continue serving from existing storage

Requirement 9: Performance and Scalability

User Story: As a system operator, I need the data storage system to handle high-frequency data ingestion and queries efficiently, so that the system remains responsive under load.

Acceptance Criteria

WHEN ingesting real-time data THEN the system SHALL handle at least 1000 updates per second per symbol
WHEN querying data THEN the system SHALL return single-timestamp queries in <100ms
WHEN querying time ranges THEN the system SHALL return 1 hour of 1s data in <500ms
WHEN storing data THEN the system SHALL use batch writes to optimize database performance
WHEN database grows THEN the system SHALL use TimescaleDB compression to reduce storage size by 80%+
WHEN running multiple queries THEN the system SHALL support concurrent access without performance degradation

Requirement 10: Data Consistency and Validation

User Story: As a trading system, I need to ensure all data is consistent and validated, so that models receive accurate information.

Acceptance Criteria

WHEN storing data THEN the system SHALL validate timestamps are in UTC timezone
WHEN storing OHLCV data THEN the system SHALL validate high >= low and high >= open/close
WHEN storing order book data THEN the system SHALL validate bids < asks
WHEN detecting invalid data THEN the system SHALL log warnings and reject the data point
WHEN querying data THEN the system SHALL ensure all timeframes are properly aligned
WHEN data gaps exist THEN the system SHALL identify and log missing periods

8.8 KiB Raw Blame History

Requirements Document

Introduction

Requirements

Requirement 1: Unified Data Retrieval Interface

Acceptance Criteria

Requirement 2: Consolidated Storage Backend

Acceptance Criteria

Requirement 3: Multi-Timeframe Data Storage

Acceptance Criteria

Requirement 4: Raw Order Book and Trade Data Storage

Acceptance Criteria

Requirement 5: Real-Time Data Caching

Acceptance Criteria

Requirement 6: Historical Data Access for Backtesting

Acceptance Criteria

Requirement 7: Data Annotation Support

Acceptance Criteria

Requirement 8: Data Migration and Backward Compatibility

Acceptance Criteria

Requirement 9: Performance and Scalability

Acceptance Criteria

Requirement 10: Data Consistency and Validation

Acceptance Criteria

8.8 KiB

Raw Blame History