12 KiB
Design Document
Overview
The Multi-Exchange Data Aggregation System is a comprehensive data collection and processing subsystem designed to serve as the foundational data layer for the trading orchestrator. The system will collect real-time order book and OHLCV data from the top 10 cryptocurrency exchanges, aggregate it into standardized formats, store it in a TimescaleDB time-series database, and provide both live data feeds and historical replay capabilities.
The system follows a microservices architecture with containerized components, ensuring scalability, maintainability, and seamless integration with the existing trading infrastructure.
We implement it in the .\COBY
subfolder for easy integration with the existing system
Architecture
High-Level Architecture
graph TB
subgraph "Exchange Connectors"
E1[Binance WebSocket]
E2[Coinbase WebSocket]
E3[Kraken WebSocket]
E4[Bybit WebSocket]
E5[OKX WebSocket]
E6[Huobi WebSocket]
E7[KuCoin WebSocket]
E8[Gate.io WebSocket]
E9[Bitfinex WebSocket]
E10[MEXC WebSocket]
end
subgraph "Data Processing Layer"
DP[Data Processor]
AGG[Aggregation Engine]
NORM[Data Normalizer]
end
subgraph "Storage Layer"
TSDB[(TimescaleDB)]
CACHE[Redis Cache]
end
subgraph "API Layer"
LIVE[Live Data API]
REPLAY[Replay API]
WEB[Web Dashboard]
end
subgraph "Integration Layer"
ORCH[Orchestrator Interface]
ADAPTER[Data Adapter]
end
E1 --> DP
E2 --> DP
E3 --> DP
E4 --> DP
E5 --> DP
E6 --> DP
E7 --> DP
E8 --> DP
E9 --> DP
E10 --> DP
DP --> NORM
NORM --> AGG
AGG --> TSDB
AGG --> CACHE
CACHE --> LIVE
TSDB --> REPLAY
LIVE --> WEB
REPLAY --> WEB
LIVE --> ADAPTER
REPLAY --> ADAPTER
ADAPTER --> ORCH
Component Architecture
The system is organized into several key components:
- Exchange Connectors: WebSocket clients for each exchange
- Data Processing Engine: Normalizes and validates incoming data
- Aggregation Engine: Creates price buckets and heatmaps
- Storage Layer: TimescaleDB for persistence, Redis for caching
- API Layer: REST and WebSocket APIs for data access
- Web Dashboard: Real-time visualization interface
- Integration Layer: Orchestrator-compatible interface
Components and Interfaces
Exchange Connector Interface
class ExchangeConnector:
"""Base interface for exchange WebSocket connectors"""
async def connect(self) -> bool
async def disconnect(self) -> None
async def subscribe_orderbook(self, symbol: str) -> None
async def subscribe_trades(self, symbol: str) -> None
def get_connection_status(self) -> ConnectionStatus
def add_data_callback(self, callback: Callable) -> None
Data Processing Interface
class DataProcessor:
"""Processes and normalizes raw exchange data"""
def normalize_orderbook(self, raw_data: Dict, exchange: str) -> OrderBookSnapshot
def normalize_trade(self, raw_data: Dict, exchange: str) -> TradeEvent
def validate_data(self, data: Union[OrderBookSnapshot, TradeEvent]) -> bool
def calculate_metrics(self, orderbook: OrderBookSnapshot) -> OrderBookMetrics
Aggregation Engine Interface
class AggregationEngine:
"""Aggregates data into price buckets and heatmaps"""
def create_price_buckets(self, orderbook: OrderBookSnapshot, bucket_size: float) -> PriceBuckets
def update_heatmap(self, symbol: str, buckets: PriceBuckets) -> HeatmapData
def calculate_imbalances(self, orderbook: OrderBookSnapshot) -> ImbalanceMetrics
def aggregate_across_exchanges(self, symbol: str) -> ConsolidatedOrderBook
Storage Interface
class StorageManager:
"""Manages data persistence and retrieval"""
async def store_orderbook(self, data: OrderBookSnapshot) -> bool
async def store_trade(self, data: TradeEvent) -> bool
async def get_historical_data(self, symbol: str, start: datetime, end: datetime) -> List[Dict]
async def get_latest_data(self, symbol: str) -> Dict
def setup_database_schema(self) -> None
Replay Interface
class ReplayManager:
"""Provides historical data replay functionality"""
def create_replay_session(self, start_time: datetime, end_time: datetime, speed: float) -> str
async def start_replay(self, session_id: str) -> None
async def pause_replay(self, session_id: str) -> None
async def stop_replay(self, session_id: str) -> None
def get_replay_status(self, session_id: str) -> ReplayStatus
Data Models
Core Data Structures
@dataclass
class OrderBookSnapshot:
"""Standardized order book snapshot"""
symbol: str
exchange: str
timestamp: datetime
bids: List[PriceLevel]
asks: List[PriceLevel]
sequence_id: Optional[int] = None
@dataclass
class PriceLevel:
"""Individual price level in order book"""
price: float
size: float
count: Optional[int] = None
@dataclass
class TradeEvent:
"""Standardized trade event"""
symbol: str
exchange: str
timestamp: datetime
price: float
size: float
side: str # 'buy' or 'sell'
trade_id: str
@dataclass
class PriceBuckets:
"""Aggregated price buckets for heatmap"""
symbol: str
timestamp: datetime
bucket_size: float
bid_buckets: Dict[float, float] # price -> volume
ask_buckets: Dict[float, float] # price -> volume
@dataclass
class HeatmapData:
"""Heatmap visualization data"""
symbol: str
timestamp: datetime
bucket_size: float
data: List[HeatmapPoint]
@dataclass
class HeatmapPoint:
"""Individual heatmap data point"""
price: float
volume: float
intensity: float # 0.0 to 1.0
side: str # 'bid' or 'ask'
Database Schema
TimescaleDB Tables
-- Order book snapshots table
CREATE TABLE order_book_snapshots (
id BIGSERIAL,
symbol VARCHAR(20) NOT NULL,
exchange VARCHAR(20) NOT NULL,
timestamp TIMESTAMPTZ NOT NULL,
bids JSONB NOT NULL,
asks JSONB NOT NULL,
sequence_id BIGINT,
mid_price DECIMAL(20,8),
spread DECIMAL(20,8),
bid_volume DECIMAL(30,8),
ask_volume DECIMAL(30,8),
PRIMARY KEY (timestamp, symbol, exchange)
);
-- Convert to hypertable
SELECT create_hypertable('order_book_snapshots', 'timestamp');
-- Trade events table
CREATE TABLE trade_events (
id BIGSERIAL,
symbol VARCHAR(20) NOT NULL,
exchange VARCHAR(20) NOT NULL,
timestamp TIMESTAMPTZ NOT NULL,
price DECIMAL(20,8) NOT NULL,
size DECIMAL(30,8) NOT NULL,
side VARCHAR(4) NOT NULL,
trade_id VARCHAR(100) NOT NULL,
PRIMARY KEY (timestamp, symbol, exchange, trade_id)
);
-- Convert to hypertable
SELECT create_hypertable('trade_events', 'timestamp');
-- Aggregated heatmap data table
CREATE TABLE heatmap_data (
symbol VARCHAR(20) NOT NULL,
timestamp TIMESTAMPTZ NOT NULL,
bucket_size DECIMAL(10,2) NOT NULL,
price_bucket DECIMAL(20,8) NOT NULL,
volume DECIMAL(30,8) NOT NULL,
side VARCHAR(3) NOT NULL,
exchange_count INTEGER NOT NULL,
PRIMARY KEY (timestamp, symbol, bucket_size, price_bucket, side)
);
-- Convert to hypertable
SELECT create_hypertable('heatmap_data', 'timestamp');
-- OHLCV data table
CREATE TABLE ohlcv_data (
symbol VARCHAR(20) NOT NULL,
timestamp TIMESTAMPTZ NOT NULL,
timeframe VARCHAR(10) NOT NULL,
open_price DECIMAL(20,8) NOT NULL,
high_price DECIMAL(20,8) NOT NULL,
low_price DECIMAL(20,8) NOT NULL,
close_price DECIMAL(20,8) NOT NULL,
volume DECIMAL(30,8) NOT NULL,
trade_count INTEGER,
PRIMARY KEY (timestamp, symbol, timeframe)
);
-- Convert to hypertable
SELECT create_hypertable('ohlcv_data', 'timestamp');
Error Handling
Connection Management
The system implements robust error handling for exchange connections:
- Exponential Backoff: Failed connections retry with increasing delays
- Circuit Breaker: Temporarily disable problematic exchanges
- Graceful Degradation: Continue operation with available exchanges
- Health Monitoring: Continuous monitoring of connection status
Data Validation
All incoming data undergoes validation:
- Schema Validation: Ensure data structure compliance
- Range Validation: Check price and volume ranges
- Timestamp Validation: Verify temporal consistency
- Duplicate Detection: Prevent duplicate data storage
Database Resilience
Database operations include comprehensive error handling:
- Connection Pooling: Maintain multiple database connections
- Transaction Management: Ensure data consistency
- Retry Logic: Automatic retry for transient failures
- Backup Strategies: Regular data backups and recovery procedures
Testing Strategy
Unit Testing
Each component will have comprehensive unit tests:
- Exchange Connectors: Mock WebSocket responses
- Data Processing: Test normalization and validation
- Aggregation Engine: Verify bucket calculations
- Storage Layer: Test database operations
- API Layer: Test endpoint responses
Integration Testing
End-to-end testing scenarios:
- Multi-Exchange Data Flow: Test complete data pipeline
- Database Integration: Verify TimescaleDB operations
- API Integration: Test orchestrator interface compatibility
- Performance Testing: Load testing with high-frequency data
Performance Testing
Performance benchmarks and testing:
- Throughput Testing: Measure data processing capacity
- Latency Testing: Measure end-to-end data latency
- Memory Usage: Monitor memory consumption patterns
- Database Performance: Query performance optimization
Monitoring and Observability
Comprehensive monitoring system:
- Metrics Collection: Prometheus-compatible metrics
- Logging: Structured logging with correlation IDs
- Alerting: Real-time alerts for system issues
- Dashboards: Grafana dashboards for system monitoring
Deployment Architecture
Docker Containerization
The system will be deployed using Docker containers:
# docker-compose.yml
version: '3.8'
services:
timescaledb:
image: timescale/timescaledb:latest-pg14
environment:
POSTGRES_DB: market_data
POSTGRES_USER: market_user
POSTGRES_PASSWORD: ${DB_PASSWORD}
volumes:
- timescale_data:/var/lib/postgresql/data
ports:
- "5432:5432"
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis_data:/data
data-aggregator:
build: ./data-aggregator
environment:
- DB_HOST=timescaledb
- REDIS_HOST=redis
- LOG_LEVEL=INFO
depends_on:
- timescaledb
- redis
web-dashboard:
build: ./web-dashboard
ports:
- "8080:8080"
environment:
- API_HOST=data-aggregator
depends_on:
- data-aggregator
volumes:
timescale_data:
redis_data:
Configuration Management
Environment-based configuration:
# config.py
@dataclass
class Config:
# Database settings
db_host: str = os.getenv('DB_HOST', 'localhost')
db_port: int = int(os.getenv('DB_PORT', '5432'))
db_name: str = os.getenv('DB_NAME', 'market_data')
db_user: str = os.getenv('DB_USER', 'market_user')
db_password: str = os.getenv('DB_PASSWORD', '')
# Redis settings
redis_host: str = os.getenv('REDIS_HOST', 'localhost')
redis_port: int = int(os.getenv('REDIS_PORT', '6379'))
# Exchange settings
exchanges: List[str] = field(default_factory=lambda: [
'binance', 'coinbase', 'kraken', 'bybit', 'okx',
'huobi', 'kucoin', 'gateio', 'bitfinex', 'mexc'
])
# Aggregation settings
btc_bucket_size: float = 10.0 # $10 USD buckets for BTC
eth_bucket_size: float = 1.0 # $1 USD buckets for ETH
# Performance settings
max_connections_per_exchange: int = 5
data_buffer_size: int = 10000
batch_write_size: int = 1000
# API settings
api_host: str = os.getenv('API_HOST', '0.0.0.0')
api_port: int = int(os.getenv('API_PORT', '8080'))
websocket_port: int = int(os.getenv('WS_PORT', '8081'))
This design provides a robust, scalable foundation for multi-exchange data aggregation that seamlessly integrates with the existing trading orchestrator while providing the flexibility for future enhancements and additional exchange integrations.