338 lines
8.7 KiB
Markdown
338 lines
8.7 KiB
Markdown
# Unified Data Storage Setup Guide
|
|
|
|
## Overview
|
|
|
|
The unified data storage system consolidates all market data storage into a single TimescaleDB backend, replacing fragmented Parquet files, pickle files, and in-memory caches.
|
|
|
|
## Prerequisites
|
|
|
|
### 1. PostgreSQL with TimescaleDB
|
|
|
|
You need PostgreSQL 12+ with TimescaleDB extension installed.
|
|
|
|
#### Installation Options
|
|
|
|
**Option A: Docker (Recommended)**
|
|
```bash
|
|
docker run -d --name timescaledb \
|
|
-p 5432:5432 \
|
|
-e POSTGRES_PASSWORD=postgres \
|
|
-e POSTGRES_DB=trading_data \
|
|
timescale/timescaledb:latest-pg14
|
|
```
|
|
|
|
**Option B: Local Installation**
|
|
- Follow TimescaleDB installation guide: https://docs.timescale.com/install/latest/
|
|
- Create database: `createdb trading_data`
|
|
|
|
### 2. Python Dependencies
|
|
|
|
Ensure you have the required Python packages:
|
|
```bash
|
|
pip install asyncpg
|
|
```
|
|
|
|
## Database Configuration
|
|
|
|
Update your `config.yaml` with database connection details:
|
|
|
|
```yaml
|
|
database:
|
|
host: localhost
|
|
port: 5432
|
|
name: trading_data
|
|
user: postgres
|
|
password: postgres
|
|
pool_size: 20
|
|
```
|
|
|
|
## Setup Process
|
|
|
|
### Step 1: Run Setup Script
|
|
|
|
```bash
|
|
python scripts/setup_unified_storage.py
|
|
```
|
|
|
|
This script will:
|
|
1. Connect to the database
|
|
2. Verify TimescaleDB extension
|
|
3. Create all required tables
|
|
4. Convert tables to hypertables
|
|
5. Create indexes for performance
|
|
6. Set up continuous aggregates
|
|
7. Configure compression policies
|
|
8. Configure retention policies
|
|
9. Verify the setup
|
|
10. Run basic operation tests
|
|
|
|
### Step 2: Verify Setup
|
|
|
|
The setup script will display schema information:
|
|
|
|
```
|
|
=== Schema Information ===
|
|
Migrations applied: 8
|
|
Tables created: 5
|
|
Hypertables: 5
|
|
Continuous aggregates: 5
|
|
|
|
=== Table Sizes ===
|
|
ohlcv_data: 8192 bytes
|
|
order_book_snapshots: 8192 bytes
|
|
order_book_1s_agg: 8192 bytes
|
|
order_book_imbalances: 8192 bytes
|
|
trade_events: 8192 bytes
|
|
|
|
=== Hypertables ===
|
|
ohlcv_data: 0 chunks, compression=enabled
|
|
order_book_snapshots: 0 chunks, compression=enabled
|
|
order_book_1s_agg: 0 chunks, compression=enabled
|
|
order_book_imbalances: 0 chunks, compression=enabled
|
|
trade_events: 0 chunks, compression=enabled
|
|
|
|
=== Continuous Aggregates ===
|
|
ohlcv_1m_continuous: 8192 bytes
|
|
ohlcv_5m_continuous: 8192 bytes
|
|
ohlcv_15m_continuous: 8192 bytes
|
|
ohlcv_1h_continuous: 8192 bytes
|
|
ohlcv_1d_continuous: 8192 bytes
|
|
```
|
|
|
|
## Database Schema
|
|
|
|
### Tables
|
|
|
|
#### 1. ohlcv_data
|
|
Stores candlestick data for all timeframes with pre-calculated technical indicators.
|
|
|
|
**Columns:**
|
|
- `timestamp` (TIMESTAMPTZ): Candle timestamp
|
|
- `symbol` (VARCHAR): Trading pair (e.g., 'ETH/USDT')
|
|
- `timeframe` (VARCHAR): Timeframe (1s, 1m, 5m, 15m, 1h, 1d)
|
|
- `open_price`, `high_price`, `low_price`, `close_price` (DECIMAL): OHLC prices
|
|
- `volume` (DECIMAL): Trading volume
|
|
- `trade_count` (INTEGER): Number of trades
|
|
- Technical indicators: `rsi_14`, `macd`, `macd_signal`, `bb_upper`, `bb_middle`, `bb_lower`, etc.
|
|
|
|
**Primary Key:** `(timestamp, symbol, timeframe)`
|
|
|
|
#### 2. order_book_snapshots
|
|
Stores raw order book snapshots.
|
|
|
|
**Columns:**
|
|
- `timestamp` (TIMESTAMPTZ): Snapshot timestamp
|
|
- `symbol` (VARCHAR): Trading pair
|
|
- `exchange` (VARCHAR): Exchange name
|
|
- `bids` (JSONB): Bid levels (top 50)
|
|
- `asks` (JSONB): Ask levels (top 50)
|
|
- `mid_price`, `spread`, `bid_volume`, `ask_volume` (DECIMAL): Calculated metrics
|
|
|
|
**Primary Key:** `(timestamp, symbol, exchange)`
|
|
|
|
#### 3. order_book_1s_agg
|
|
Stores 1-second aggregated order book data with $1 price buckets.
|
|
|
|
**Columns:**
|
|
- `timestamp` (TIMESTAMPTZ): Aggregation timestamp
|
|
- `symbol` (VARCHAR): Trading pair
|
|
- `price_bucket` (DECIMAL): Price bucket ($1 increments)
|
|
- `bid_volume`, `ask_volume` (DECIMAL): Aggregated volumes
|
|
- `bid_count`, `ask_count` (INTEGER): Number of orders
|
|
- `imbalance` (DECIMAL): Order book imbalance
|
|
|
|
**Primary Key:** `(timestamp, symbol, price_bucket)`
|
|
|
|
#### 4. order_book_imbalances
|
|
Stores multi-timeframe order book imbalance metrics.
|
|
|
|
**Columns:**
|
|
- `timestamp` (TIMESTAMPTZ): Calculation timestamp
|
|
- `symbol` (VARCHAR): Trading pair
|
|
- `imbalance_1s`, `imbalance_5s`, `imbalance_15s`, `imbalance_60s` (DECIMAL): Imbalances
|
|
- `volume_imbalance_1s`, `volume_imbalance_5s`, etc. (DECIMAL): Volume-weighted imbalances
|
|
- `price_range` (DECIMAL): Price range used for calculation
|
|
|
|
**Primary Key:** `(timestamp, symbol)`
|
|
|
|
#### 5. trade_events
|
|
Stores individual trade events.
|
|
|
|
**Columns:**
|
|
- `timestamp` (TIMESTAMPTZ): Trade timestamp
|
|
- `symbol` (VARCHAR): Trading pair
|
|
- `exchange` (VARCHAR): Exchange name
|
|
- `price` (DECIMAL): Trade price
|
|
- `size` (DECIMAL): Trade size
|
|
- `side` (VARCHAR): Trade side ('buy' or 'sell')
|
|
- `trade_id` (VARCHAR): Unique trade identifier
|
|
|
|
**Primary Key:** `(timestamp, symbol, exchange, trade_id)`
|
|
|
|
### Continuous Aggregates
|
|
|
|
Continuous aggregates automatically generate higher timeframe data from lower timeframes:
|
|
|
|
1. **ohlcv_1m_continuous**: 1-minute candles from 1-second data
|
|
2. **ohlcv_5m_continuous**: 5-minute candles from 1-minute data
|
|
3. **ohlcv_15m_continuous**: 15-minute candles from 5-minute data
|
|
4. **ohlcv_1h_continuous**: 1-hour candles from 15-minute data
|
|
5. **ohlcv_1d_continuous**: 1-day candles from 1-hour data
|
|
|
|
### Compression Policies
|
|
|
|
Data is automatically compressed to save storage:
|
|
|
|
- **ohlcv_data**: Compress after 7 days
|
|
- **order_book_snapshots**: Compress after 1 day
|
|
- **order_book_1s_agg**: Compress after 2 days
|
|
- **order_book_imbalances**: Compress after 2 days
|
|
- **trade_events**: Compress after 7 days
|
|
|
|
Expected compression ratio: **>80%**
|
|
|
|
### Retention Policies
|
|
|
|
Old data is automatically deleted:
|
|
|
|
- **ohlcv_data**: Retain for 2 years
|
|
- **order_book_snapshots**: Retain for 30 days
|
|
- **order_book_1s_agg**: Retain for 60 days
|
|
- **order_book_imbalances**: Retain for 60 days
|
|
- **trade_events**: Retain for 90 days
|
|
|
|
## Performance Optimization
|
|
|
|
### Indexes
|
|
|
|
All tables have optimized indexes for common query patterns:
|
|
|
|
- Symbol + timestamp queries
|
|
- Timeframe-specific queries
|
|
- Exchange-specific queries
|
|
- Multi-column composite indexes
|
|
|
|
### Query Performance Targets
|
|
|
|
- **Cache reads**: <10ms
|
|
- **Single timestamp queries**: <100ms
|
|
- **Time range queries (1 hour)**: <500ms
|
|
- **Ingestion throughput**: >1000 ops/sec
|
|
|
|
### Best Practices
|
|
|
|
1. **Use time_bucket for aggregations**:
|
|
```sql
|
|
SELECT time_bucket('1 minute', timestamp) AS bucket,
|
|
symbol,
|
|
avg(close_price) AS avg_price
|
|
FROM ohlcv_data
|
|
WHERE symbol = 'ETH/USDT'
|
|
AND timestamp >= NOW() - INTERVAL '1 hour'
|
|
GROUP BY bucket, symbol;
|
|
```
|
|
|
|
2. **Query specific timeframes**:
|
|
```sql
|
|
SELECT * FROM ohlcv_data
|
|
WHERE symbol = 'ETH/USDT'
|
|
AND timeframe = '1m'
|
|
AND timestamp >= NOW() - INTERVAL '1 day'
|
|
ORDER BY timestamp DESC;
|
|
```
|
|
|
|
3. **Use continuous aggregates for historical data**:
|
|
```sql
|
|
SELECT * FROM ohlcv_1h_continuous
|
|
WHERE symbol = 'ETH/USDT'
|
|
AND timestamp >= NOW() - INTERVAL '7 days'
|
|
ORDER BY timestamp DESC;
|
|
```
|
|
|
|
## Monitoring
|
|
|
|
### Check Database Size
|
|
|
|
```sql
|
|
SELECT
|
|
hypertable_name,
|
|
pg_size_pretty(total_bytes) AS total_size,
|
|
pg_size_pretty(compressed_total_bytes) AS compressed_size,
|
|
ROUND((1 - compressed_total_bytes::numeric / total_bytes::numeric) * 100, 2) AS compression_ratio
|
|
FROM timescaledb_information.hypertables
|
|
WHERE hypertable_schema = 'public';
|
|
```
|
|
|
|
### Check Chunk Information
|
|
|
|
```sql
|
|
SELECT
|
|
hypertable_name,
|
|
num_chunks,
|
|
num_compressed_chunks,
|
|
compression_enabled
|
|
FROM timescaledb_information.hypertables
|
|
WHERE hypertable_schema = 'public';
|
|
```
|
|
|
|
### Check Continuous Aggregate Status
|
|
|
|
```sql
|
|
SELECT
|
|
view_name,
|
|
materialization_hypertable_name,
|
|
pg_size_pretty(total_bytes) AS size
|
|
FROM timescaledb_information.continuous_aggregates
|
|
WHERE view_schema = 'public';
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### TimescaleDB Extension Not Found
|
|
|
|
If you see "TimescaleDB extension not found":
|
|
|
|
1. Ensure TimescaleDB is installed
|
|
2. Connect to database and run: `CREATE EXTENSION timescaledb;`
|
|
3. Restart the setup script
|
|
|
|
### Connection Refused
|
|
|
|
If you see "connection refused":
|
|
|
|
1. Check PostgreSQL is running: `pg_isready`
|
|
2. Verify connection details in `config.yaml`
|
|
3. Check firewall settings
|
|
|
|
### Permission Denied
|
|
|
|
If you see "permission denied":
|
|
|
|
1. Ensure database user has CREATE privileges
|
|
2. Grant privileges: `GRANT ALL PRIVILEGES ON DATABASE trading_data TO postgres;`
|
|
|
|
### Slow Queries
|
|
|
|
If queries are slow:
|
|
|
|
1. Check if indexes exist: `\di` in psql
|
|
2. Analyze query plan: `EXPLAIN ANALYZE <your query>`
|
|
3. Ensure compression is enabled
|
|
4. Consider adding more specific indexes
|
|
|
|
## Next Steps
|
|
|
|
After setup is complete:
|
|
|
|
1. **Implement data models** (Task 2)
|
|
2. **Implement cache layer** (Task 3)
|
|
3. **Implement database connection layer** (Task 4)
|
|
4. **Start data migration** from Parquet files (Task 7)
|
|
|
|
## Support
|
|
|
|
For issues or questions:
|
|
- Check TimescaleDB docs: https://docs.timescale.com/
|
|
- Review PostgreSQL logs: `tail -f /var/log/postgresql/postgresql-*.log`
|
|
- Enable debug logging in setup script
|