Files
gogo2/docs/UNIFIED_STORAGE_SETUP.md
Dobromir Popov f464a412dc uni data storage
2025-10-20 09:48:59 +03:00

8.7 KiB

Unified Data Storage Setup Guide

Overview

The unified data storage system consolidates all market data storage into a single TimescaleDB backend, replacing fragmented Parquet files, pickle files, and in-memory caches.

Prerequisites

1. PostgreSQL with TimescaleDB

You need PostgreSQL 12+ with TimescaleDB extension installed.

Installation Options

Option A: Docker (Recommended)

docker run -d --name timescaledb \
  -p 5432:5432 \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=trading_data \
  timescale/timescaledb:latest-pg14

Option B: Local Installation

2. Python Dependencies

Ensure you have the required Python packages:

pip install asyncpg

Database Configuration

Update your config.yaml with database connection details:

database:
  host: localhost
  port: 5432
  name: trading_data
  user: postgres
  password: postgres
  pool_size: 20

Setup Process

Step 1: Run Setup Script

python scripts/setup_unified_storage.py

This script will:

  1. Connect to the database
  2. Verify TimescaleDB extension
  3. Create all required tables
  4. Convert tables to hypertables
  5. Create indexes for performance
  6. Set up continuous aggregates
  7. Configure compression policies
  8. Configure retention policies
  9. Verify the setup
  10. Run basic operation tests

Step 2: Verify Setup

The setup script will display schema information:

=== Schema Information ===
Migrations applied: 8
Tables created: 5
Hypertables: 5
Continuous aggregates: 5

=== Table Sizes ===
  ohlcv_data: 8192 bytes
  order_book_snapshots: 8192 bytes
  order_book_1s_agg: 8192 bytes
  order_book_imbalances: 8192 bytes
  trade_events: 8192 bytes

=== Hypertables ===
  ohlcv_data: 0 chunks, compression=enabled
  order_book_snapshots: 0 chunks, compression=enabled
  order_book_1s_agg: 0 chunks, compression=enabled
  order_book_imbalances: 0 chunks, compression=enabled
  trade_events: 0 chunks, compression=enabled

=== Continuous Aggregates ===
  ohlcv_1m_continuous: 8192 bytes
  ohlcv_5m_continuous: 8192 bytes
  ohlcv_15m_continuous: 8192 bytes
  ohlcv_1h_continuous: 8192 bytes
  ohlcv_1d_continuous: 8192 bytes

Database Schema

Tables

1. ohlcv_data

Stores candlestick data for all timeframes with pre-calculated technical indicators.

Columns:

  • timestamp (TIMESTAMPTZ): Candle timestamp
  • symbol (VARCHAR): Trading pair (e.g., 'ETH/USDT')
  • timeframe (VARCHAR): Timeframe (1s, 1m, 5m, 15m, 1h, 1d)
  • open_price, high_price, low_price, close_price (DECIMAL): OHLC prices
  • volume (DECIMAL): Trading volume
  • trade_count (INTEGER): Number of trades
  • Technical indicators: rsi_14, macd, macd_signal, bb_upper, bb_middle, bb_lower, etc.

Primary Key: (timestamp, symbol, timeframe)

2. order_book_snapshots

Stores raw order book snapshots.

Columns:

  • timestamp (TIMESTAMPTZ): Snapshot timestamp
  • symbol (VARCHAR): Trading pair
  • exchange (VARCHAR): Exchange name
  • bids (JSONB): Bid levels (top 50)
  • asks (JSONB): Ask levels (top 50)
  • mid_price, spread, bid_volume, ask_volume (DECIMAL): Calculated metrics

Primary Key: (timestamp, symbol, exchange)

3. order_book_1s_agg

Stores 1-second aggregated order book data with $1 price buckets.

Columns:

  • timestamp (TIMESTAMPTZ): Aggregation timestamp
  • symbol (VARCHAR): Trading pair
  • price_bucket (DECIMAL): Price bucket ($1 increments)
  • bid_volume, ask_volume (DECIMAL): Aggregated volumes
  • bid_count, ask_count (INTEGER): Number of orders
  • imbalance (DECIMAL): Order book imbalance

Primary Key: (timestamp, symbol, price_bucket)

4. order_book_imbalances

Stores multi-timeframe order book imbalance metrics.

Columns:

  • timestamp (TIMESTAMPTZ): Calculation timestamp
  • symbol (VARCHAR): Trading pair
  • imbalance_1s, imbalance_5s, imbalance_15s, imbalance_60s (DECIMAL): Imbalances
  • volume_imbalance_1s, volume_imbalance_5s, etc. (DECIMAL): Volume-weighted imbalances
  • price_range (DECIMAL): Price range used for calculation

Primary Key: (timestamp, symbol)

5. trade_events

Stores individual trade events.

Columns:

  • timestamp (TIMESTAMPTZ): Trade timestamp
  • symbol (VARCHAR): Trading pair
  • exchange (VARCHAR): Exchange name
  • price (DECIMAL): Trade price
  • size (DECIMAL): Trade size
  • side (VARCHAR): Trade side ('buy' or 'sell')
  • trade_id (VARCHAR): Unique trade identifier

Primary Key: (timestamp, symbol, exchange, trade_id)

Continuous Aggregates

Continuous aggregates automatically generate higher timeframe data from lower timeframes:

  1. ohlcv_1m_continuous: 1-minute candles from 1-second data
  2. ohlcv_5m_continuous: 5-minute candles from 1-minute data
  3. ohlcv_15m_continuous: 15-minute candles from 5-minute data
  4. ohlcv_1h_continuous: 1-hour candles from 15-minute data
  5. ohlcv_1d_continuous: 1-day candles from 1-hour data

Compression Policies

Data is automatically compressed to save storage:

  • ohlcv_data: Compress after 7 days
  • order_book_snapshots: Compress after 1 day
  • order_book_1s_agg: Compress after 2 days
  • order_book_imbalances: Compress after 2 days
  • trade_events: Compress after 7 days

Expected compression ratio: >80%

Retention Policies

Old data is automatically deleted:

  • ohlcv_data: Retain for 2 years
  • order_book_snapshots: Retain for 30 days
  • order_book_1s_agg: Retain for 60 days
  • order_book_imbalances: Retain for 60 days
  • trade_events: Retain for 90 days

Performance Optimization

Indexes

All tables have optimized indexes for common query patterns:

  • Symbol + timestamp queries
  • Timeframe-specific queries
  • Exchange-specific queries
  • Multi-column composite indexes

Query Performance Targets

  • Cache reads: <10ms
  • Single timestamp queries: <100ms
  • Time range queries (1 hour): <500ms
  • Ingestion throughput: >1000 ops/sec

Best Practices

  1. Use time_bucket for aggregations:

    SELECT time_bucket('1 minute', timestamp) AS bucket,
           symbol,
           avg(close_price) AS avg_price
    FROM ohlcv_data
    WHERE symbol = 'ETH/USDT'
      AND timestamp >= NOW() - INTERVAL '1 hour'
    GROUP BY bucket, symbol;
    
  2. Query specific timeframes:

    SELECT * FROM ohlcv_data
    WHERE symbol = 'ETH/USDT'
      AND timeframe = '1m'
      AND timestamp >= NOW() - INTERVAL '1 day'
    ORDER BY timestamp DESC;
    
  3. Use continuous aggregates for historical data:

    SELECT * FROM ohlcv_1h_continuous
    WHERE symbol = 'ETH/USDT'
      AND timestamp >= NOW() - INTERVAL '7 days'
    ORDER BY timestamp DESC;
    

Monitoring

Check Database Size

SELECT 
    hypertable_name,
    pg_size_pretty(total_bytes) AS total_size,
    pg_size_pretty(compressed_total_bytes) AS compressed_size,
    ROUND((1 - compressed_total_bytes::numeric / total_bytes::numeric) * 100, 2) AS compression_ratio
FROM timescaledb_information.hypertables
WHERE hypertable_schema = 'public';

Check Chunk Information

SELECT 
    hypertable_name,
    num_chunks,
    num_compressed_chunks,
    compression_enabled
FROM timescaledb_information.hypertables
WHERE hypertable_schema = 'public';

Check Continuous Aggregate Status

SELECT 
    view_name,
    materialization_hypertable_name,
    pg_size_pretty(total_bytes) AS size
FROM timescaledb_information.continuous_aggregates
WHERE view_schema = 'public';

Troubleshooting

TimescaleDB Extension Not Found

If you see "TimescaleDB extension not found":

  1. Ensure TimescaleDB is installed
  2. Connect to database and run: CREATE EXTENSION timescaledb;
  3. Restart the setup script

Connection Refused

If you see "connection refused":

  1. Check PostgreSQL is running: pg_isready
  2. Verify connection details in config.yaml
  3. Check firewall settings

Permission Denied

If you see "permission denied":

  1. Ensure database user has CREATE privileges
  2. Grant privileges: GRANT ALL PRIVILEGES ON DATABASE trading_data TO postgres;

Slow Queries

If queries are slow:

  1. Check if indexes exist: \di in psql
  2. Analyze query plan: EXPLAIN ANALYZE <your query>
  3. Ensure compression is enabled
  4. Consider adding more specific indexes

Next Steps

After setup is complete:

  1. Implement data models (Task 2)
  2. Implement cache layer (Task 3)
  3. Implement database connection layer (Task 4)
  4. Start data migration from Parquet files (Task 7)

Support

For issues or questions:

  • Check TimescaleDB docs: https://docs.timescale.com/
  • Review PostgreSQL logs: tail -f /var/log/postgresql/postgresql-*.log
  • Enable debug logging in setup script