18 KiB
18 KiB
RL Training Pipeline Audit and Improvements
Current State Analysis
1. Existing RL Training Components
Current Architecture:
- EnhancedDQNAgent: Main RL agent with dueling DQN architecture
- EnhancedRLTrainer: Training coordinator with prioritized experience replay
- PrioritizedReplayBuffer: Experience replay with priority sampling
- RLTrainer: Basic training pipeline for scalping scenarios
Current Data Input Structure:
# Current MarketState in enhanced_orchestrator.py
@dataclass
class MarketState:
symbol: str
timestamp: datetime
prices: Dict[str, float] # {timeframe: current_price}
features: Dict[str, np.ndarray] # {timeframe: feature_matrix}
volatility: float
volume: float
trend_strength: float
market_regime: str # 'trending', 'ranging', 'volatile'
universal_data: UniversalDataStream
Current State Conversion:
- Limited to basic market metrics (volatility, volume, trend)
- Missing tick-level features
- No multi-symbol correlation data
- No CNN hidden layer integration
- Incomplete implementation of required data format
Critical Issues Identified
1. Insufficient Data Input (CRITICAL)
Current Problem: RL model only receives basic market metrics, missing required data:
- ❌ 300s of raw tick data for momentum detection
- ❌ Multi-timeframe OHLCV (1s, 1m, 1h, 1d) for both ETH and BTC
- ❌ CNN hidden layer features
- ❌ CNN predictions from all timeframes
- ❌ Pivot point predictions
Required Input per Specification:
ETH:
- 300s max of raw ticks data (detecting single big moves and momentum)
- 300s of 1s OHLCV data (5 min)
- 300 OHLCV + indicators bars of each 1m 1h 1d and 1s BTC
RL model should have access to:
- Last hidden layers of the CNN model where patterns are learned
- CNN output (predictions) for each timeframe (1s 1m 1h 1d)
- Next expected pivot point predictions
2. Inadequate State Representation
Current Issues:
- State size fixed at 100 features (too small)
- No standardization/normalization
- Missing temporal sequence information
- No multi-symbol context
3. Training Pipeline Limitations
- No real-time tick processing integration
- Missing CNN feature integration
- Limited reward engineering
- No market regime-specific training
4. Missing Pivot Point Integration
- No pivot point calculation system
- No recursive trend analysis
- Missing Williams market structure implementation
Comprehensive Improvement Plan
Phase 1: Enhanced State Representation
1.1 Create Comprehensive State Builder
class EnhancedRLStateBuilder:
"""Build comprehensive RL state from all available data sources"""
def __init__(self, config):
self.tick_window = 300 # 300s of ticks
self.ohlcv_window = 300 # 300 1s bars
self.state_components = {
'eth_ticks': 300 * 10, # ~10 features per tick
'eth_1s_ohlcv': 300 * 8, # OHLCV + indicators
'eth_1m_ohlcv': 300 * 8, # 300 1m bars
'eth_1h_ohlcv': 300 * 8, # 300 1h bars
'eth_1d_ohlcv': 300 * 8, # 300 1d bars
'btc_reference': 300 * 8, # BTC reference data
'cnn_features': 512, # CNN hidden layer features
'cnn_predictions': 16, # CNN predictions (4 timeframes * 4 outputs)
'pivot_points': 50, # Recursive pivot points
'market_regime': 10 # Market regime features
}
self.total_state_size = sum(self.state_components.values()) # ~8000+ features
1.2 Multi-Symbol Data Integration
def build_rl_state(self, universal_stream: UniversalDataStream,
cnn_hidden_features: Dict = None,
cnn_predictions: Dict = None) -> np.ndarray:
"""Build comprehensive RL state vector"""
state_vector = []
# 1. ETH Tick Data (300s window)
eth_tick_features = self._process_tick_data(
universal_stream.eth_ticks, window_size=300
)
state_vector.extend(eth_tick_features)
# 2. ETH Multi-timeframe OHLCV
for timeframe in ['1s', '1m', '1h', '1d']:
ohlcv_features = self._process_ohlcv_data(
getattr(universal_stream, f'eth_{timeframe}'),
timeframe=timeframe,
window_size=300
)
state_vector.extend(ohlcv_features)
# 3. BTC Reference Data
btc_features = self._process_btc_reference(universal_stream.btc_ticks)
state_vector.extend(btc_features)
# 4. CNN Hidden Layer Features
if cnn_hidden_features:
cnn_hidden = self._process_cnn_hidden_features(cnn_hidden_features)
state_vector.extend(cnn_hidden)
else:
state_vector.extend([0.0] * self.state_components['cnn_features'])
# 5. CNN Predictions
if cnn_predictions:
cnn_pred = self._process_cnn_predictions(cnn_predictions)
state_vector.extend(cnn_pred)
else:
state_vector.extend([0.0] * self.state_components['cnn_predictions'])
# 6. Pivot Points
pivot_features = self._calculate_recursive_pivot_points(universal_stream)
state_vector.extend(pivot_features)
# 7. Market Regime Features
regime_features = self._extract_market_regime_features(universal_stream)
state_vector.extend(regime_features)
return np.array(state_vector, dtype=np.float32)
Phase 2: Pivot Point System Implementation
2.1 Williams Market Structure Pivot Points
class WilliamsMarketStructure:
"""Implementation of Larry Williams market structure analysis"""
def calculate_recursive_pivot_points(self, ohlcv_data: np.ndarray) -> Dict:
"""Calculate 5 levels of recursive pivot points"""
levels = {}
current_data = ohlcv_data
for level in range(5):
# Find swing highs and lows
swing_points = self._find_swing_points(current_data)
# Determine trend direction
trend_direction = self._determine_trend_direction(swing_points)
levels[f'level_{level}'] = {
'swing_points': swing_points,
'trend_direction': trend_direction,
'trend_strength': self._calculate_trend_strength(swing_points)
}
# Use swing points as input for next level
if len(swing_points) >= 5:
current_data = self._convert_swings_to_ohlcv(swing_points)
else:
break
return levels
def _find_swing_points(self, ohlcv_data: np.ndarray) -> List[Dict]:
"""Find swing highs and lows (higher lows/lower highs on both sides)"""
swing_points = []
for i in range(2, len(ohlcv_data) - 2):
current_high = ohlcv_data[i, 2] # High price
current_low = ohlcv_data[i, 3] # Low price
# Check for swing high (lower highs on both sides)
if (current_high > ohlcv_data[i-1, 2] and
current_high > ohlcv_data[i-2, 2] and
current_high > ohlcv_data[i+1, 2] and
current_high > ohlcv_data[i+2, 2]):
swing_points.append({
'type': 'swing_high',
'timestamp': ohlcv_data[i, 0],
'price': current_high,
'index': i
})
# Check for swing low (higher lows on both sides)
if (current_low < ohlcv_data[i-1, 3] and
current_low < ohlcv_data[i-2, 3] and
current_low < ohlcv_data[i+1, 3] and
current_low < ohlcv_data[i+2, 3]):
swing_points.append({
'type': 'swing_low',
'timestamp': ohlcv_data[i, 0],
'price': current_low,
'index': i
})
return swing_points
Phase 3: CNN Integration Layer
3.1 CNN-RL Bridge
class CNNRLBridge:
"""Bridge between CNN and RL models for feature sharing"""
def __init__(self, cnn_models: Dict, rl_agents: Dict):
self.cnn_models = cnn_models
self.rl_agents = rl_agents
self.feature_cache = {}
async def extract_cnn_features_for_rl(self, universal_stream: UniversalDataStream) -> Dict:
"""Extract CNN hidden layer features and predictions for RL"""
cnn_features = {
'hidden_features': {},
'predictions': {},
'confidences': {}
}
for timeframe in ['1s', '1m', '1h', '1d']:
if timeframe in self.cnn_models:
model = self.cnn_models[timeframe]
# Get input data for this timeframe
timeframe_data = getattr(universal_stream, f'eth_{timeframe}')
if len(timeframe_data) > 0:
# Extract hidden layer features
hidden_features = await self._extract_hidden_features(
model, timeframe_data
)
cnn_features['hidden_features'][timeframe] = hidden_features
# Get predictions
predictions, confidence = await model.predict(timeframe_data)
cnn_features['predictions'][timeframe] = predictions
cnn_features['confidences'][timeframe] = confidence
return cnn_features
async def _extract_hidden_features(self, model, data: np.ndarray) -> np.ndarray:
"""Extract hidden layer features from CNN model"""
try:
# Hook into the model's hidden layers
activation = {}
def get_activation(name):
def hook(model, input, output):
activation[name] = output.detach()
return hook
# Register hook on the last hidden layer before output
handle = model.fc_hidden.register_forward_hook(get_activation('hidden'))
# Forward pass
with torch.no_grad():
_ = model(torch.FloatTensor(data).unsqueeze(0))
# Remove hook
handle.remove()
# Return flattened hidden features
if 'hidden' in activation:
return activation['hidden'].cpu().numpy().flatten()
else:
return np.zeros(512) # Default size
except Exception as e:
logger.error(f"Error extracting CNN hidden features: {e}")
return np.zeros(512)
Phase 4: Enhanced Training Pipeline
4.1 Multi-Modal Training Loop
class EnhancedRLTrainingPipeline:
"""Comprehensive RL training with all required data inputs"""
def __init__(self, config):
self.config = config
self.state_builder = EnhancedRLStateBuilder(config)
self.pivot_calculator = WilliamsMarketStructure()
self.cnn_rl_bridge = CNNRLBridge(config.cnn_models, config.rl_agents)
# Enhanced DQN with larger state space
self.agent = EnhancedDQNAgent({
'state_size': self.state_builder.total_state_size, # ~8000+ features
'action_space': 3,
'hidden_size': 1024, # Larger hidden layers
'learning_rate': 0.0001,
'gamma': 0.99,
'buffer_size': 50000, # Larger replay buffer
'batch_size': 128
})
async def training_step(self, universal_stream: UniversalDataStream):
"""Single training step with comprehensive data"""
# 1. Extract CNN features and predictions
cnn_data = await self.cnn_rl_bridge.extract_cnn_features_for_rl(universal_stream)
# 2. Build comprehensive RL state
current_state = self.state_builder.build_rl_state(
universal_stream=universal_stream,
cnn_hidden_features=cnn_data['hidden_features'],
cnn_predictions=cnn_data['predictions']
)
# 3. Agent action selection
action = self.agent.act(current_state)
# 4. Execute action and get reward
reward, next_universal_stream = await self._execute_action_and_get_reward(
action, universal_stream
)
# 5. Build next state
next_cnn_data = await self.cnn_rl_bridge.extract_cnn_features_for_rl(
next_universal_stream
)
next_state = self.state_builder.build_rl_state(
universal_stream=next_universal_stream,
cnn_hidden_features=next_cnn_data['hidden_features'],
cnn_predictions=next_cnn_data['predictions']
)
# 6. Store experience
self.agent.remember(
state=current_state,
action=action,
reward=reward,
next_state=next_state,
done=False
)
# 7. Train if enough experiences
if len(self.agent.replay_buffer) > self.agent.batch_size:
loss = self.agent.replay()
return {'loss': loss, 'reward': reward, 'action': action}
return {'reward': reward, 'action': action}
4.2 Enhanced Reward Engineering
class EnhancedRewardCalculator:
"""Sophisticated reward calculation considering multiple factors"""
def calculate_reward(self, action: int, market_data_before: Dict,
market_data_after: Dict, trade_outcome: float = None) -> float:
"""Calculate multi-factor reward"""
base_reward = 0.0
# 1. Price Movement Reward
if trade_outcome is not None:
# Direct trading outcome
base_reward += trade_outcome * 10 # Scale P&L
else:
# Prediction accuracy reward
price_change = self._calculate_price_change(market_data_before, market_data_after)
action_correctness = self._evaluate_action_correctness(action, price_change)
base_reward += action_correctness * 5
# 2. Market Regime Bonus
regime_bonus = self._calculate_regime_bonus(action, market_data_after)
base_reward += regime_bonus
# 3. Volatility Penalty/Bonus
volatility_factor = self._calculate_volatility_factor(market_data_after)
base_reward *= volatility_factor
# 4. CNN Confidence Alignment
cnn_alignment = self._calculate_cnn_alignment_bonus(action, market_data_after)
base_reward += cnn_alignment
# 5. Pivot Point Accuracy
pivot_accuracy = self._calculate_pivot_accuracy_bonus(action, market_data_after)
base_reward += pivot_accuracy
return base_reward
Phase 5: Implementation Timeline
Week 1: State Representation Enhancement
- Implement EnhancedRLStateBuilder
- Add tick data processing
- Implement multi-timeframe OHLCV integration
- Add BTC reference data processing
Week 2: Pivot Point System
- Implement WilliamsMarketStructure class
- Add recursive pivot point calculation
- Integrate with state builder
- Test pivot point accuracy
Week 3: CNN-RL Integration
- Implement CNNRLBridge
- Add hidden feature extraction
- Integrate CNN predictions into RL state
- Test feature consistency
Week 4: Enhanced Training Pipeline
- Implement EnhancedRLTrainingPipeline
- Add enhanced reward calculator
- Integrate all components
- Performance testing and optimization
Week 5: Testing and Validation
- Comprehensive integration testing
- Performance validation
- Memory usage optimization
- Documentation and monitoring
Expected Improvements
1. State Representation Quality
- Current: ~100 basic features
- Enhanced: ~8000+ comprehensive features
- Improvement: 80x more information density
2. Decision Making Accuracy
- Current: Limited to basic market metrics
- Enhanced: Multi-modal with CNN features + pivot points
- Expected: 40-60% improvement in prediction accuracy
3. Market Adaptability
- Current: Basic market regime detection
- Enhanced: Multi-timeframe analysis with recursive trends
- Expected: Better performance across different market conditions
4. Learning Efficiency
- Current: Simple experience replay
- Enhanced: Prioritized replay with sophisticated rewards
- Expected: 2-3x faster convergence
Risk Mitigation
1. Memory Usage
- Risk: Large state vectors (~8000 features) may cause memory issues
- Mitigation: Implement state compression and efficient batching
2. Training Stability
- Risk: Complex state space may cause training instability
- Mitigation: Gradual state expansion, careful hyperparameter tuning
3. Integration Complexity
- Risk: CNN-RL integration may introduce bugs
- Mitigation: Extensive testing, fallback mechanisms
4. Performance Impact
- Risk: Real-time performance degradation
- Mitigation: Asynchronous processing, optimized data structures
Success Metrics
- State Quality: Feature coverage > 95% of required specification
- Training Performance: Convergence time < 50% of current
- Decision Accuracy: Prediction accuracy > 65% (vs current ~45%)
- Market Adaptability: Consistent performance across 3+ market regimes
- Integration Stability: Uptime > 99.5% with CNN integration
This comprehensive upgrade will transform the RL training pipeline from a basic implementation to a sophisticated multi-modal system that fully meets the specification requirements.