# Unified Data Storage Setup Guide ## Overview The unified data storage system consolidates all market data storage into a single TimescaleDB backend, replacing fragmented Parquet files, pickle files, and in-memory caches. ## Prerequisites ### 1. PostgreSQL with TimescaleDB You need PostgreSQL 12+ with TimescaleDB extension installed. #### Installation Options **Option A: Docker (Recommended)** ```bash docker run -d --name timescaledb \ -p 5432:5432 \ -e POSTGRES_PASSWORD=postgres \ -e POSTGRES_DB=trading_data \ timescale/timescaledb:latest-pg14 ``` **Option B: Local Installation** - Follow TimescaleDB installation guide: https://docs.timescale.com/install/latest/ - Create database: `createdb trading_data` ### 2. Python Dependencies Ensure you have the required Python packages: ```bash pip install asyncpg ``` ## Database Configuration Update your `config.yaml` with database connection details: ```yaml database: host: localhost port: 5432 name: trading_data user: postgres password: postgres pool_size: 20 ``` ## Setup Process ### Step 1: Run Setup Script ```bash python scripts/setup_unified_storage.py ``` This script will: 1. Connect to the database 2. Verify TimescaleDB extension 3. Create all required tables 4. Convert tables to hypertables 5. Create indexes for performance 6. Set up continuous aggregates 7. Configure compression policies 8. Configure retention policies 9. Verify the setup 10. Run basic operation tests ### Step 2: Verify Setup The setup script will display schema information: ``` === Schema Information === Migrations applied: 8 Tables created: 5 Hypertables: 5 Continuous aggregates: 5 === Table Sizes === ohlcv_data: 8192 bytes order_book_snapshots: 8192 bytes order_book_1s_agg: 8192 bytes order_book_imbalances: 8192 bytes trade_events: 8192 bytes === Hypertables === ohlcv_data: 0 chunks, compression=enabled order_book_snapshots: 0 chunks, compression=enabled order_book_1s_agg: 0 chunks, compression=enabled order_book_imbalances: 0 chunks, compression=enabled trade_events: 0 chunks, compression=enabled === Continuous Aggregates === ohlcv_1m_continuous: 8192 bytes ohlcv_5m_continuous: 8192 bytes ohlcv_15m_continuous: 8192 bytes ohlcv_1h_continuous: 8192 bytes ohlcv_1d_continuous: 8192 bytes ``` ## Database Schema ### Tables #### 1. ohlcv_data Stores candlestick data for all timeframes with pre-calculated technical indicators. **Columns:** - `timestamp` (TIMESTAMPTZ): Candle timestamp - `symbol` (VARCHAR): Trading pair (e.g., 'ETH/USDT') - `timeframe` (VARCHAR): Timeframe (1s, 1m, 5m, 15m, 1h, 1d) - `open_price`, `high_price`, `low_price`, `close_price` (DECIMAL): OHLC prices - `volume` (DECIMAL): Trading volume - `trade_count` (INTEGER): Number of trades - Technical indicators: `rsi_14`, `macd`, `macd_signal`, `bb_upper`, `bb_middle`, `bb_lower`, etc. **Primary Key:** `(timestamp, symbol, timeframe)` #### 2. order_book_snapshots Stores raw order book snapshots. **Columns:** - `timestamp` (TIMESTAMPTZ): Snapshot timestamp - `symbol` (VARCHAR): Trading pair - `exchange` (VARCHAR): Exchange name - `bids` (JSONB): Bid levels (top 50) - `asks` (JSONB): Ask levels (top 50) - `mid_price`, `spread`, `bid_volume`, `ask_volume` (DECIMAL): Calculated metrics **Primary Key:** `(timestamp, symbol, exchange)` #### 3. order_book_1s_agg Stores 1-second aggregated order book data with $1 price buckets. **Columns:** - `timestamp` (TIMESTAMPTZ): Aggregation timestamp - `symbol` (VARCHAR): Trading pair - `price_bucket` (DECIMAL): Price bucket ($1 increments) - `bid_volume`, `ask_volume` (DECIMAL): Aggregated volumes - `bid_count`, `ask_count` (INTEGER): Number of orders - `imbalance` (DECIMAL): Order book imbalance **Primary Key:** `(timestamp, symbol, price_bucket)` #### 4. order_book_imbalances Stores multi-timeframe order book imbalance metrics. **Columns:** - `timestamp` (TIMESTAMPTZ): Calculation timestamp - `symbol` (VARCHAR): Trading pair - `imbalance_1s`, `imbalance_5s`, `imbalance_15s`, `imbalance_60s` (DECIMAL): Imbalances - `volume_imbalance_1s`, `volume_imbalance_5s`, etc. (DECIMAL): Volume-weighted imbalances - `price_range` (DECIMAL): Price range used for calculation **Primary Key:** `(timestamp, symbol)` #### 5. trade_events Stores individual trade events. **Columns:** - `timestamp` (TIMESTAMPTZ): Trade timestamp - `symbol` (VARCHAR): Trading pair - `exchange` (VARCHAR): Exchange name - `price` (DECIMAL): Trade price - `size` (DECIMAL): Trade size - `side` (VARCHAR): Trade side ('buy' or 'sell') - `trade_id` (VARCHAR): Unique trade identifier **Primary Key:** `(timestamp, symbol, exchange, trade_id)` ### Continuous Aggregates Continuous aggregates automatically generate higher timeframe data from lower timeframes: 1. **ohlcv_1m_continuous**: 1-minute candles from 1-second data 2. **ohlcv_5m_continuous**: 5-minute candles from 1-minute data 3. **ohlcv_15m_continuous**: 15-minute candles from 5-minute data 4. **ohlcv_1h_continuous**: 1-hour candles from 15-minute data 5. **ohlcv_1d_continuous**: 1-day candles from 1-hour data ### Compression Policies Data is automatically compressed to save storage: - **ohlcv_data**: Compress after 7 days - **order_book_snapshots**: Compress after 1 day - **order_book_1s_agg**: Compress after 2 days - **order_book_imbalances**: Compress after 2 days - **trade_events**: Compress after 7 days Expected compression ratio: **>80%** ### Retention Policies Old data is automatically deleted: - **ohlcv_data**: Retain for 2 years - **order_book_snapshots**: Retain for 30 days - **order_book_1s_agg**: Retain for 60 days - **order_book_imbalances**: Retain for 60 days - **trade_events**: Retain for 90 days ## Performance Optimization ### Indexes All tables have optimized indexes for common query patterns: - Symbol + timestamp queries - Timeframe-specific queries - Exchange-specific queries - Multi-column composite indexes ### Query Performance Targets - **Cache reads**: <10ms - **Single timestamp queries**: <100ms - **Time range queries (1 hour)**: <500ms - **Ingestion throughput**: >1000 ops/sec ### Best Practices 1. **Use time_bucket for aggregations**: ```sql SELECT time_bucket('1 minute', timestamp) AS bucket, symbol, avg(close_price) AS avg_price FROM ohlcv_data WHERE symbol = 'ETH/USDT' AND timestamp >= NOW() - INTERVAL '1 hour' GROUP BY bucket, symbol; ``` 2. **Query specific timeframes**: ```sql SELECT * FROM ohlcv_data WHERE symbol = 'ETH/USDT' AND timeframe = '1m' AND timestamp >= NOW() - INTERVAL '1 day' ORDER BY timestamp DESC; ``` 3. **Use continuous aggregates for historical data**: ```sql SELECT * FROM ohlcv_1h_continuous WHERE symbol = 'ETH/USDT' AND timestamp >= NOW() - INTERVAL '7 days' ORDER BY timestamp DESC; ``` ## Monitoring ### Check Database Size ```sql SELECT hypertable_name, pg_size_pretty(total_bytes) AS total_size, pg_size_pretty(compressed_total_bytes) AS compressed_size, ROUND((1 - compressed_total_bytes::numeric / total_bytes::numeric) * 100, 2) AS compression_ratio FROM timescaledb_information.hypertables WHERE hypertable_schema = 'public'; ``` ### Check Chunk Information ```sql SELECT hypertable_name, num_chunks, num_compressed_chunks, compression_enabled FROM timescaledb_information.hypertables WHERE hypertable_schema = 'public'; ``` ### Check Continuous Aggregate Status ```sql SELECT view_name, materialization_hypertable_name, pg_size_pretty(total_bytes) AS size FROM timescaledb_information.continuous_aggregates WHERE view_schema = 'public'; ``` ## Troubleshooting ### TimescaleDB Extension Not Found If you see "TimescaleDB extension not found": 1. Ensure TimescaleDB is installed 2. Connect to database and run: `CREATE EXTENSION timescaledb;` 3. Restart the setup script ### Connection Refused If you see "connection refused": 1. Check PostgreSQL is running: `pg_isready` 2. Verify connection details in `config.yaml` 3. Check firewall settings ### Permission Denied If you see "permission denied": 1. Ensure database user has CREATE privileges 2. Grant privileges: `GRANT ALL PRIVILEGES ON DATABASE trading_data TO postgres;` ### Slow Queries If queries are slow: 1. Check if indexes exist: `\di` in psql 2. Analyze query plan: `EXPLAIN ANALYZE ` 3. Ensure compression is enabled 4. Consider adding more specific indexes ## Next Steps After setup is complete: 1. **Implement data models** (Task 2) 2. **Implement cache layer** (Task 3) 3. **Implement database connection layer** (Task 4) 4. **Start data migration** from Parquet files (Task 7) ## Support For issues or questions: - Check TimescaleDB docs: https://docs.timescale.com/ - Review PostgreSQL logs: `tail -f /var/log/postgresql/postgresql-*.log` - Enable debug logging in setup script