uni data storage
This commit is contained in:
337
docs/UNIFIED_STORAGE_SETUP.md
Normal file
337
docs/UNIFIED_STORAGE_SETUP.md
Normal file
@@ -0,0 +1,337 @@
|
||||
# Unified Data Storage Setup Guide
|
||||
|
||||
## Overview
|
||||
|
||||
The unified data storage system consolidates all market data storage into a single TimescaleDB backend, replacing fragmented Parquet files, pickle files, and in-memory caches.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### 1. PostgreSQL with TimescaleDB
|
||||
|
||||
You need PostgreSQL 12+ with TimescaleDB extension installed.
|
||||
|
||||
#### Installation Options
|
||||
|
||||
**Option A: Docker (Recommended)**
|
||||
```bash
|
||||
docker run -d --name timescaledb \
|
||||
-p 5432:5432 \
|
||||
-e POSTGRES_PASSWORD=postgres \
|
||||
-e POSTGRES_DB=trading_data \
|
||||
timescale/timescaledb:latest-pg14
|
||||
```
|
||||
|
||||
**Option B: Local Installation**
|
||||
- Follow TimescaleDB installation guide: https://docs.timescale.com/install/latest/
|
||||
- Create database: `createdb trading_data`
|
||||
|
||||
### 2. Python Dependencies
|
||||
|
||||
Ensure you have the required Python packages:
|
||||
```bash
|
||||
pip install asyncpg
|
||||
```
|
||||
|
||||
## Database Configuration
|
||||
|
||||
Update your `config.yaml` with database connection details:
|
||||
|
||||
```yaml
|
||||
database:
|
||||
host: localhost
|
||||
port: 5432
|
||||
name: trading_data
|
||||
user: postgres
|
||||
password: postgres
|
||||
pool_size: 20
|
||||
```
|
||||
|
||||
## Setup Process
|
||||
|
||||
### Step 1: Run Setup Script
|
||||
|
||||
```bash
|
||||
python scripts/setup_unified_storage.py
|
||||
```
|
||||
|
||||
This script will:
|
||||
1. Connect to the database
|
||||
2. Verify TimescaleDB extension
|
||||
3. Create all required tables
|
||||
4. Convert tables to hypertables
|
||||
5. Create indexes for performance
|
||||
6. Set up continuous aggregates
|
||||
7. Configure compression policies
|
||||
8. Configure retention policies
|
||||
9. Verify the setup
|
||||
10. Run basic operation tests
|
||||
|
||||
### Step 2: Verify Setup
|
||||
|
||||
The setup script will display schema information:
|
||||
|
||||
```
|
||||
=== Schema Information ===
|
||||
Migrations applied: 8
|
||||
Tables created: 5
|
||||
Hypertables: 5
|
||||
Continuous aggregates: 5
|
||||
|
||||
=== Table Sizes ===
|
||||
ohlcv_data: 8192 bytes
|
||||
order_book_snapshots: 8192 bytes
|
||||
order_book_1s_agg: 8192 bytes
|
||||
order_book_imbalances: 8192 bytes
|
||||
trade_events: 8192 bytes
|
||||
|
||||
=== Hypertables ===
|
||||
ohlcv_data: 0 chunks, compression=enabled
|
||||
order_book_snapshots: 0 chunks, compression=enabled
|
||||
order_book_1s_agg: 0 chunks, compression=enabled
|
||||
order_book_imbalances: 0 chunks, compression=enabled
|
||||
trade_events: 0 chunks, compression=enabled
|
||||
|
||||
=== Continuous Aggregates ===
|
||||
ohlcv_1m_continuous: 8192 bytes
|
||||
ohlcv_5m_continuous: 8192 bytes
|
||||
ohlcv_15m_continuous: 8192 bytes
|
||||
ohlcv_1h_continuous: 8192 bytes
|
||||
ohlcv_1d_continuous: 8192 bytes
|
||||
```
|
||||
|
||||
## Database Schema
|
||||
|
||||
### Tables
|
||||
|
||||
#### 1. ohlcv_data
|
||||
Stores candlestick data for all timeframes with pre-calculated technical indicators.
|
||||
|
||||
**Columns:**
|
||||
- `timestamp` (TIMESTAMPTZ): Candle timestamp
|
||||
- `symbol` (VARCHAR): Trading pair (e.g., 'ETH/USDT')
|
||||
- `timeframe` (VARCHAR): Timeframe (1s, 1m, 5m, 15m, 1h, 1d)
|
||||
- `open_price`, `high_price`, `low_price`, `close_price` (DECIMAL): OHLC prices
|
||||
- `volume` (DECIMAL): Trading volume
|
||||
- `trade_count` (INTEGER): Number of trades
|
||||
- Technical indicators: `rsi_14`, `macd`, `macd_signal`, `bb_upper`, `bb_middle`, `bb_lower`, etc.
|
||||
|
||||
**Primary Key:** `(timestamp, symbol, timeframe)`
|
||||
|
||||
#### 2. order_book_snapshots
|
||||
Stores raw order book snapshots.
|
||||
|
||||
**Columns:**
|
||||
- `timestamp` (TIMESTAMPTZ): Snapshot timestamp
|
||||
- `symbol` (VARCHAR): Trading pair
|
||||
- `exchange` (VARCHAR): Exchange name
|
||||
- `bids` (JSONB): Bid levels (top 50)
|
||||
- `asks` (JSONB): Ask levels (top 50)
|
||||
- `mid_price`, `spread`, `bid_volume`, `ask_volume` (DECIMAL): Calculated metrics
|
||||
|
||||
**Primary Key:** `(timestamp, symbol, exchange)`
|
||||
|
||||
#### 3. order_book_1s_agg
|
||||
Stores 1-second aggregated order book data with $1 price buckets.
|
||||
|
||||
**Columns:**
|
||||
- `timestamp` (TIMESTAMPTZ): Aggregation timestamp
|
||||
- `symbol` (VARCHAR): Trading pair
|
||||
- `price_bucket` (DECIMAL): Price bucket ($1 increments)
|
||||
- `bid_volume`, `ask_volume` (DECIMAL): Aggregated volumes
|
||||
- `bid_count`, `ask_count` (INTEGER): Number of orders
|
||||
- `imbalance` (DECIMAL): Order book imbalance
|
||||
|
||||
**Primary Key:** `(timestamp, symbol, price_bucket)`
|
||||
|
||||
#### 4. order_book_imbalances
|
||||
Stores multi-timeframe order book imbalance metrics.
|
||||
|
||||
**Columns:**
|
||||
- `timestamp` (TIMESTAMPTZ): Calculation timestamp
|
||||
- `symbol` (VARCHAR): Trading pair
|
||||
- `imbalance_1s`, `imbalance_5s`, `imbalance_15s`, `imbalance_60s` (DECIMAL): Imbalances
|
||||
- `volume_imbalance_1s`, `volume_imbalance_5s`, etc. (DECIMAL): Volume-weighted imbalances
|
||||
- `price_range` (DECIMAL): Price range used for calculation
|
||||
|
||||
**Primary Key:** `(timestamp, symbol)`
|
||||
|
||||
#### 5. trade_events
|
||||
Stores individual trade events.
|
||||
|
||||
**Columns:**
|
||||
- `timestamp` (TIMESTAMPTZ): Trade timestamp
|
||||
- `symbol` (VARCHAR): Trading pair
|
||||
- `exchange` (VARCHAR): Exchange name
|
||||
- `price` (DECIMAL): Trade price
|
||||
- `size` (DECIMAL): Trade size
|
||||
- `side` (VARCHAR): Trade side ('buy' or 'sell')
|
||||
- `trade_id` (VARCHAR): Unique trade identifier
|
||||
|
||||
**Primary Key:** `(timestamp, symbol, exchange, trade_id)`
|
||||
|
||||
### Continuous Aggregates
|
||||
|
||||
Continuous aggregates automatically generate higher timeframe data from lower timeframes:
|
||||
|
||||
1. **ohlcv_1m_continuous**: 1-minute candles from 1-second data
|
||||
2. **ohlcv_5m_continuous**: 5-minute candles from 1-minute data
|
||||
3. **ohlcv_15m_continuous**: 15-minute candles from 5-minute data
|
||||
4. **ohlcv_1h_continuous**: 1-hour candles from 15-minute data
|
||||
5. **ohlcv_1d_continuous**: 1-day candles from 1-hour data
|
||||
|
||||
### Compression Policies
|
||||
|
||||
Data is automatically compressed to save storage:
|
||||
|
||||
- **ohlcv_data**: Compress after 7 days
|
||||
- **order_book_snapshots**: Compress after 1 day
|
||||
- **order_book_1s_agg**: Compress after 2 days
|
||||
- **order_book_imbalances**: Compress after 2 days
|
||||
- **trade_events**: Compress after 7 days
|
||||
|
||||
Expected compression ratio: **>80%**
|
||||
|
||||
### Retention Policies
|
||||
|
||||
Old data is automatically deleted:
|
||||
|
||||
- **ohlcv_data**: Retain for 2 years
|
||||
- **order_book_snapshots**: Retain for 30 days
|
||||
- **order_book_1s_agg**: Retain for 60 days
|
||||
- **order_book_imbalances**: Retain for 60 days
|
||||
- **trade_events**: Retain for 90 days
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Indexes
|
||||
|
||||
All tables have optimized indexes for common query patterns:
|
||||
|
||||
- Symbol + timestamp queries
|
||||
- Timeframe-specific queries
|
||||
- Exchange-specific queries
|
||||
- Multi-column composite indexes
|
||||
|
||||
### Query Performance Targets
|
||||
|
||||
- **Cache reads**: <10ms
|
||||
- **Single timestamp queries**: <100ms
|
||||
- **Time range queries (1 hour)**: <500ms
|
||||
- **Ingestion throughput**: >1000 ops/sec
|
||||
|
||||
### Best Practices
|
||||
|
||||
1. **Use time_bucket for aggregations**:
|
||||
```sql
|
||||
SELECT time_bucket('1 minute', timestamp) AS bucket,
|
||||
symbol,
|
||||
avg(close_price) AS avg_price
|
||||
FROM ohlcv_data
|
||||
WHERE symbol = 'ETH/USDT'
|
||||
AND timestamp >= NOW() - INTERVAL '1 hour'
|
||||
GROUP BY bucket, symbol;
|
||||
```
|
||||
|
||||
2. **Query specific timeframes**:
|
||||
```sql
|
||||
SELECT * FROM ohlcv_data
|
||||
WHERE symbol = 'ETH/USDT'
|
||||
AND timeframe = '1m'
|
||||
AND timestamp >= NOW() - INTERVAL '1 day'
|
||||
ORDER BY timestamp DESC;
|
||||
```
|
||||
|
||||
3. **Use continuous aggregates for historical data**:
|
||||
```sql
|
||||
SELECT * FROM ohlcv_1h_continuous
|
||||
WHERE symbol = 'ETH/USDT'
|
||||
AND timestamp >= NOW() - INTERVAL '7 days'
|
||||
ORDER BY timestamp DESC;
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Check Database Size
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
hypertable_name,
|
||||
pg_size_pretty(total_bytes) AS total_size,
|
||||
pg_size_pretty(compressed_total_bytes) AS compressed_size,
|
||||
ROUND((1 - compressed_total_bytes::numeric / total_bytes::numeric) * 100, 2) AS compression_ratio
|
||||
FROM timescaledb_information.hypertables
|
||||
WHERE hypertable_schema = 'public';
|
||||
```
|
||||
|
||||
### Check Chunk Information
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
hypertable_name,
|
||||
num_chunks,
|
||||
num_compressed_chunks,
|
||||
compression_enabled
|
||||
FROM timescaledb_information.hypertables
|
||||
WHERE hypertable_schema = 'public';
|
||||
```
|
||||
|
||||
### Check Continuous Aggregate Status
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
view_name,
|
||||
materialization_hypertable_name,
|
||||
pg_size_pretty(total_bytes) AS size
|
||||
FROM timescaledb_information.continuous_aggregates
|
||||
WHERE view_schema = 'public';
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### TimescaleDB Extension Not Found
|
||||
|
||||
If you see "TimescaleDB extension not found":
|
||||
|
||||
1. Ensure TimescaleDB is installed
|
||||
2. Connect to database and run: `CREATE EXTENSION timescaledb;`
|
||||
3. Restart the setup script
|
||||
|
||||
### Connection Refused
|
||||
|
||||
If you see "connection refused":
|
||||
|
||||
1. Check PostgreSQL is running: `pg_isready`
|
||||
2. Verify connection details in `config.yaml`
|
||||
3. Check firewall settings
|
||||
|
||||
### Permission Denied
|
||||
|
||||
If you see "permission denied":
|
||||
|
||||
1. Ensure database user has CREATE privileges
|
||||
2. Grant privileges: `GRANT ALL PRIVILEGES ON DATABASE trading_data TO postgres;`
|
||||
|
||||
### Slow Queries
|
||||
|
||||
If queries are slow:
|
||||
|
||||
1. Check if indexes exist: `\di` in psql
|
||||
2. Analyze query plan: `EXPLAIN ANALYZE <your query>`
|
||||
3. Ensure compression is enabled
|
||||
4. Consider adding more specific indexes
|
||||
|
||||
## Next Steps
|
||||
|
||||
After setup is complete:
|
||||
|
||||
1. **Implement data models** (Task 2)
|
||||
2. **Implement cache layer** (Task 3)
|
||||
3. **Implement database connection layer** (Task 4)
|
||||
4. **Start data migration** from Parquet files (Task 7)
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions:
|
||||
- Check TimescaleDB docs: https://docs.timescale.com/
|
||||
- Review PostgreSQL logs: `tail -f /var/log/postgresql/postgresql-*.log`
|
||||
- Enable debug logging in setup script
|
||||
Reference in New Issue
Block a user