uni data storage

This commit is contained in:
Dobromir Popov
2025-10-20 09:48:59 +03:00
parent 002d0f7858
commit f464a412dc
12 changed files with 2905 additions and 181 deletions

View File

@@ -0,0 +1,337 @@
# Unified Data Storage Setup Guide
## Overview
The unified data storage system consolidates all market data storage into a single TimescaleDB backend, replacing fragmented Parquet files, pickle files, and in-memory caches.
## Prerequisites
### 1. PostgreSQL with TimescaleDB
You need PostgreSQL 12+ with TimescaleDB extension installed.
#### Installation Options
**Option A: Docker (Recommended)**
```bash
docker run -d --name timescaledb \
-p 5432:5432 \
-e POSTGRES_PASSWORD=postgres \
-e POSTGRES_DB=trading_data \
timescale/timescaledb:latest-pg14
```
**Option B: Local Installation**
- Follow TimescaleDB installation guide: https://docs.timescale.com/install/latest/
- Create database: `createdb trading_data`
### 2. Python Dependencies
Ensure you have the required Python packages:
```bash
pip install asyncpg
```
## Database Configuration
Update your `config.yaml` with database connection details:
```yaml
database:
host: localhost
port: 5432
name: trading_data
user: postgres
password: postgres
pool_size: 20
```
## Setup Process
### Step 1: Run Setup Script
```bash
python scripts/setup_unified_storage.py
```
This script will:
1. Connect to the database
2. Verify TimescaleDB extension
3. Create all required tables
4. Convert tables to hypertables
5. Create indexes for performance
6. Set up continuous aggregates
7. Configure compression policies
8. Configure retention policies
9. Verify the setup
10. Run basic operation tests
### Step 2: Verify Setup
The setup script will display schema information:
```
=== Schema Information ===
Migrations applied: 8
Tables created: 5
Hypertables: 5
Continuous aggregates: 5
=== Table Sizes ===
ohlcv_data: 8192 bytes
order_book_snapshots: 8192 bytes
order_book_1s_agg: 8192 bytes
order_book_imbalances: 8192 bytes
trade_events: 8192 bytes
=== Hypertables ===
ohlcv_data: 0 chunks, compression=enabled
order_book_snapshots: 0 chunks, compression=enabled
order_book_1s_agg: 0 chunks, compression=enabled
order_book_imbalances: 0 chunks, compression=enabled
trade_events: 0 chunks, compression=enabled
=== Continuous Aggregates ===
ohlcv_1m_continuous: 8192 bytes
ohlcv_5m_continuous: 8192 bytes
ohlcv_15m_continuous: 8192 bytes
ohlcv_1h_continuous: 8192 bytes
ohlcv_1d_continuous: 8192 bytes
```
## Database Schema
### Tables
#### 1. ohlcv_data
Stores candlestick data for all timeframes with pre-calculated technical indicators.
**Columns:**
- `timestamp` (TIMESTAMPTZ): Candle timestamp
- `symbol` (VARCHAR): Trading pair (e.g., 'ETH/USDT')
- `timeframe` (VARCHAR): Timeframe (1s, 1m, 5m, 15m, 1h, 1d)
- `open_price`, `high_price`, `low_price`, `close_price` (DECIMAL): OHLC prices
- `volume` (DECIMAL): Trading volume
- `trade_count` (INTEGER): Number of trades
- Technical indicators: `rsi_14`, `macd`, `macd_signal`, `bb_upper`, `bb_middle`, `bb_lower`, etc.
**Primary Key:** `(timestamp, symbol, timeframe)`
#### 2. order_book_snapshots
Stores raw order book snapshots.
**Columns:**
- `timestamp` (TIMESTAMPTZ): Snapshot timestamp
- `symbol` (VARCHAR): Trading pair
- `exchange` (VARCHAR): Exchange name
- `bids` (JSONB): Bid levels (top 50)
- `asks` (JSONB): Ask levels (top 50)
- `mid_price`, `spread`, `bid_volume`, `ask_volume` (DECIMAL): Calculated metrics
**Primary Key:** `(timestamp, symbol, exchange)`
#### 3. order_book_1s_agg
Stores 1-second aggregated order book data with $1 price buckets.
**Columns:**
- `timestamp` (TIMESTAMPTZ): Aggregation timestamp
- `symbol` (VARCHAR): Trading pair
- `price_bucket` (DECIMAL): Price bucket ($1 increments)
- `bid_volume`, `ask_volume` (DECIMAL): Aggregated volumes
- `bid_count`, `ask_count` (INTEGER): Number of orders
- `imbalance` (DECIMAL): Order book imbalance
**Primary Key:** `(timestamp, symbol, price_bucket)`
#### 4. order_book_imbalances
Stores multi-timeframe order book imbalance metrics.
**Columns:**
- `timestamp` (TIMESTAMPTZ): Calculation timestamp
- `symbol` (VARCHAR): Trading pair
- `imbalance_1s`, `imbalance_5s`, `imbalance_15s`, `imbalance_60s` (DECIMAL): Imbalances
- `volume_imbalance_1s`, `volume_imbalance_5s`, etc. (DECIMAL): Volume-weighted imbalances
- `price_range` (DECIMAL): Price range used for calculation
**Primary Key:** `(timestamp, symbol)`
#### 5. trade_events
Stores individual trade events.
**Columns:**
- `timestamp` (TIMESTAMPTZ): Trade timestamp
- `symbol` (VARCHAR): Trading pair
- `exchange` (VARCHAR): Exchange name
- `price` (DECIMAL): Trade price
- `size` (DECIMAL): Trade size
- `side` (VARCHAR): Trade side ('buy' or 'sell')
- `trade_id` (VARCHAR): Unique trade identifier
**Primary Key:** `(timestamp, symbol, exchange, trade_id)`
### Continuous Aggregates
Continuous aggregates automatically generate higher timeframe data from lower timeframes:
1. **ohlcv_1m_continuous**: 1-minute candles from 1-second data
2. **ohlcv_5m_continuous**: 5-minute candles from 1-minute data
3. **ohlcv_15m_continuous**: 15-minute candles from 5-minute data
4. **ohlcv_1h_continuous**: 1-hour candles from 15-minute data
5. **ohlcv_1d_continuous**: 1-day candles from 1-hour data
### Compression Policies
Data is automatically compressed to save storage:
- **ohlcv_data**: Compress after 7 days
- **order_book_snapshots**: Compress after 1 day
- **order_book_1s_agg**: Compress after 2 days
- **order_book_imbalances**: Compress after 2 days
- **trade_events**: Compress after 7 days
Expected compression ratio: **>80%**
### Retention Policies
Old data is automatically deleted:
- **ohlcv_data**: Retain for 2 years
- **order_book_snapshots**: Retain for 30 days
- **order_book_1s_agg**: Retain for 60 days
- **order_book_imbalances**: Retain for 60 days
- **trade_events**: Retain for 90 days
## Performance Optimization
### Indexes
All tables have optimized indexes for common query patterns:
- Symbol + timestamp queries
- Timeframe-specific queries
- Exchange-specific queries
- Multi-column composite indexes
### Query Performance Targets
- **Cache reads**: <10ms
- **Single timestamp queries**: <100ms
- **Time range queries (1 hour)**: <500ms
- **Ingestion throughput**: >1000 ops/sec
### Best Practices
1. **Use time_bucket for aggregations**:
```sql
SELECT time_bucket('1 minute', timestamp) AS bucket,
symbol,
avg(close_price) AS avg_price
FROM ohlcv_data
WHERE symbol = 'ETH/USDT'
AND timestamp >= NOW() - INTERVAL '1 hour'
GROUP BY bucket, symbol;
```
2. **Query specific timeframes**:
```sql
SELECT * FROM ohlcv_data
WHERE symbol = 'ETH/USDT'
AND timeframe = '1m'
AND timestamp >= NOW() - INTERVAL '1 day'
ORDER BY timestamp DESC;
```
3. **Use continuous aggregates for historical data**:
```sql
SELECT * FROM ohlcv_1h_continuous
WHERE symbol = 'ETH/USDT'
AND timestamp >= NOW() - INTERVAL '7 days'
ORDER BY timestamp DESC;
```
## Monitoring
### Check Database Size
```sql
SELECT
hypertable_name,
pg_size_pretty(total_bytes) AS total_size,
pg_size_pretty(compressed_total_bytes) AS compressed_size,
ROUND((1 - compressed_total_bytes::numeric / total_bytes::numeric) * 100, 2) AS compression_ratio
FROM timescaledb_information.hypertables
WHERE hypertable_schema = 'public';
```
### Check Chunk Information
```sql
SELECT
hypertable_name,
num_chunks,
num_compressed_chunks,
compression_enabled
FROM timescaledb_information.hypertables
WHERE hypertable_schema = 'public';
```
### Check Continuous Aggregate Status
```sql
SELECT
view_name,
materialization_hypertable_name,
pg_size_pretty(total_bytes) AS size
FROM timescaledb_information.continuous_aggregates
WHERE view_schema = 'public';
```
## Troubleshooting
### TimescaleDB Extension Not Found
If you see "TimescaleDB extension not found":
1. Ensure TimescaleDB is installed
2. Connect to database and run: `CREATE EXTENSION timescaledb;`
3. Restart the setup script
### Connection Refused
If you see "connection refused":
1. Check PostgreSQL is running: `pg_isready`
2. Verify connection details in `config.yaml`
3. Check firewall settings
### Permission Denied
If you see "permission denied":
1. Ensure database user has CREATE privileges
2. Grant privileges: `GRANT ALL PRIVILEGES ON DATABASE trading_data TO postgres;`
### Slow Queries
If queries are slow:
1. Check if indexes exist: `\di` in psql
2. Analyze query plan: `EXPLAIN ANALYZE <your query>`
3. Ensure compression is enabled
4. Consider adding more specific indexes
## Next Steps
After setup is complete:
1. **Implement data models** (Task 2)
2. **Implement cache layer** (Task 3)
3. **Implement database connection layer** (Task 4)
4. **Start data migration** from Parquet files (Task 7)
## Support
For issues or questions:
- Check TimescaleDB docs: https://docs.timescale.com/
- Review PostgreSQL logs: `tail -f /var/log/postgresql/postgresql-*.log`
- Enable debug logging in setup script