This commit is contained in:
Dobromir Popov
2025-07-22 16:13:42 +03:00
parent c63dc11c14
commit cc0c783411
2 changed files with 101 additions and 124 deletions

View File

@ -1,124 +0,0 @@
# Requirements Document
## Introduction
The Checkpoint Persistence Fix addresses a critical system flaw where model training progress is not being saved during training, causing all learning progress to be lost when the system restarts. Despite having a well-implemented CheckpointManager and proper checkpoint loading at startup, the system lacks checkpoint saving during training operations. This creates a fundamental issue where models train continuously but never persist their improved weights, making continuous improvement impossible and wasting computational resources.
## Requirements
### Requirement 1: Real-time Checkpoint Saving During Training
**User Story:** As a system operator, I want model improvements to be automatically saved during training, so that training progress is never lost when the system restarts.
#### Acceptance Criteria
1. WHEN the DQN model is trained in _train_models_on_decision THEN the system SHALL save a checkpoint if the loss improves.
2. WHEN the CNN model is trained THEN the system SHALL save a checkpoint if the loss improves.
3. WHEN the COB RL model is trained THEN the system SHALL save a checkpoint if the loss improves.
4. WHEN the Extrema trainer is trained THEN the system SHALL save a checkpoint if the loss improves.
5. WHEN any model training completes THEN the system SHALL compare current performance to best performance and save if improved.
6. WHEN checkpoint saving occurs THEN the system SHALL update the model_states dictionary with new performance metrics.
### Requirement 2: Performance-Based Checkpoint Management
**User Story:** As a developer, I want checkpoints to be saved only when model performance improves, so that storage is used efficiently and only the best models are preserved.
#### Acceptance Criteria
1. WHEN evaluating whether to save a checkpoint THEN the system SHALL compare current loss to the best recorded loss.
2. WHEN loss decreases by a configurable threshold THEN the system SHALL trigger checkpoint saving.
3. WHEN multiple models are trained simultaneously THEN each model SHALL have independent performance tracking.
4. WHEN checkpoint rotation occurs THEN the system SHALL keep only the best performing checkpoints.
5. WHEN performance metrics are updated THEN the system SHALL log the improvement for monitoring.
6. WHEN no improvement is detected THEN the system SHALL skip checkpoint saving to avoid unnecessary I/O.
### Requirement 3: Periodic Checkpoint Saving
**User Story:** As a system administrator, I want checkpoints to be saved periodically regardless of performance, so that progress is preserved even during long training sessions without significant improvement.
#### Acceptance Criteria
1. WHEN a configurable number of training iterations have passed THEN the system SHALL save a checkpoint regardless of performance.
2. WHEN periodic saving occurs THEN the system SHALL use a separate checkpoint category to distinguish from performance-based saves.
3. WHEN the system runs for extended periods THEN periodic checkpoints SHALL ensure no more than X minutes of training progress can be lost.
4. WHEN periodic checkpoints accumulate THEN the system SHALL maintain a rolling window of recent saves.
5. WHEN storage space is limited THEN periodic checkpoints SHALL be cleaned up while preserving performance-based checkpoints.
6. WHEN the system restarts THEN it SHALL load the most recent checkpoint (either performance-based or periodic).
### Requirement 4: Enhanced Training System Integration
**User Story:** As a developer, I want the EnhancedRealtimeTrainingSystem to properly save checkpoints, so that continuous learning progress is preserved across system restarts.
#### Acceptance Criteria
1. WHEN the EnhancedRealtimeTrainingSystem trains models THEN it SHALL integrate with the CheckpointManager.
2. WHEN training episodes complete THEN the system SHALL evaluate and save improved models.
3. WHEN the training system initializes THEN it SHALL load the best available checkpoints.
4. WHEN training data is collected THEN the system SHALL track performance metrics for checkpoint decisions.
5. WHEN the training system shuts down THEN it SHALL save final checkpoints before termination.
6. WHEN training resumes THEN the system SHALL continue from the last saved checkpoint state.
### Requirement 5: Complete Training Data Storage
**User Story:** As a developer, I want complete training episodes to be stored with full input dataframes, so that training can be replayed and analyzed with all original context.
#### Acceptance Criteria
1. WHEN training episodes are saved THEN the system SHALL store the complete input dataframe with all model inputs (price data, indicators, market structure, etc.).
2. WHEN model actions are recorded THEN the system SHALL store the full context that led to the decision, not just the action result.
3. WHEN training cases are saved THEN they SHALL include timestamps, market conditions, and all feature vectors used by the models.
4. WHEN storing training data THEN the system SHALL preserve the exact state that can be used to reproduce the model's decision.
5. WHEN training episodes are replayed THEN the system SHALL be able to reconstruct the exact same inputs that were originally used.
6. WHEN analyzing training performance THEN complete dataframes SHALL be available for debugging and improvement.
### Requirement 6: Comprehensive Performance Tracking
**User Story:** As a system operator, I want detailed performance metrics to be tracked and persisted, so that I can monitor training progress and model improvement over time.
#### Acceptance Criteria
1. WHEN models are trained THEN the system SHALL track loss values, accuracy metrics, and training timestamps.
2. WHEN performance improves THEN the system SHALL log the improvement amount and save metadata.
3. WHEN checkpoints are saved THEN the system SHALL store performance metrics alongside model weights.
4. WHEN the system starts THEN it SHALL display the performance history of loaded checkpoints.
5. WHEN multiple training sessions occur THEN the system SHALL maintain a continuous performance history.
6. WHEN performance degrades THEN the system SHALL provide alerts and revert to better checkpoints if configured.
### Requirement 7: Robust Error Handling and Recovery
**User Story:** As a system administrator, I want checkpoint operations to be resilient to failures, so that training can continue even if individual checkpoint saves fail.
#### Acceptance Criteria
1. WHEN checkpoint saving fails THEN the system SHALL log the error and continue training without crashing.
2. WHEN disk space is insufficient THEN the system SHALL clean up old checkpoints and retry saving.
3. WHEN checkpoint files are corrupted THEN the system SHALL fall back to previous valid checkpoints.
4. WHEN concurrent access conflicts occur THEN the system SHALL use proper locking mechanisms.
5. WHEN the system recovers from failures THEN it SHALL validate checkpoint integrity before loading.
6. WHEN critical checkpoint operations fail repeatedly THEN the system SHALL alert administrators.
### Requirement 8: Configuration and Monitoring
**User Story:** As a developer, I want configurable checkpoint settings and monitoring capabilities, so that I can optimize checkpoint behavior for different training scenarios.
#### Acceptance Criteria
1. WHEN configuring the system THEN checkpoint saving frequency SHALL be adjustable.
2. WHEN setting performance thresholds THEN the minimum improvement required for saving SHALL be configurable.
3. WHEN monitoring training THEN checkpoint save events SHALL be visible in logs and dashboards.
4. WHEN analyzing performance THEN checkpoint metadata SHALL be accessible for review.
5. WHEN tuning the system THEN checkpoint storage limits SHALL be configurable.
6. WHEN debugging issues THEN detailed checkpoint operation logs SHALL be available.
### Requirement 9: Backward Compatibility and Migration
**User Story:** As a user, I want existing checkpoints to remain compatible, so that current model progress is preserved when the checkpoint system is enhanced.
#### Acceptance Criteria
1. WHEN the enhanced checkpoint system starts THEN it SHALL load existing checkpoints without issues.
2. WHEN checkpoint formats are updated THEN migration utilities SHALL convert old formats.
3. WHEN new metadata is added THEN existing checkpoints SHALL work with default values.
4. WHEN the system upgrades THEN checkpoint directories SHALL be preserved and enhanced.
5. WHEN rollback is needed THEN the system SHALL support reverting to previous checkpoint versions.
6. WHEN compatibility issues arise THEN clear error messages SHALL guide resolution.

View File

@ -220,6 +220,11 @@ class TradingOrchestrator:
self.data_provider.start_centralized_data_collection()
logger.info("Centralized data collection started - all models and dashboard will receive data")
# CRITICAL: Initialize checkpoint manager for saving training progress
self.checkpoint_manager = None
self.training_iterations = 0 # Track training iterations for periodic saves
self._initialize_checkpoint_manager()
# Initialize models, COB integration, and training system
self._initialize_ml_models()
self._initialize_cob_integration()
@ -2145,6 +2150,9 @@ class TradingOrchestrator:
if not market_data:
return
# Track if any model was trained for checkpoint saving
models_trained = []
# Train DQN agent if available
if self.rl_agent and hasattr(self.rl_agent, 'add_experience'):
try:
@ -2167,6 +2175,7 @@ class TradingOrchestrator:
done=False
)
models_trained.append('dqn')
logger.debug(f"🧠 Added DQN experience: {action} {symbol} (reward: {immediate_reward:.3f})")
except Exception as e:
@ -2185,6 +2194,7 @@ class TradingOrchestrator:
# Add training sample
self.cnn_model.add_training_sample(cnn_features, target, weight=confidence)
models_trained.append('cnn')
logger.debug(f"🔍 Added CNN training sample: {action} {symbol}")
except Exception as e:
@ -2206,14 +2216,105 @@ class TradingOrchestrator:
symbol=symbol
)
models_trained.append('cob_rl')
logger.debug(f"📊 Added COB RL experience: {action} {symbol}")
except Exception as e:
logger.debug(f"Error training COB RL on decision: {e}")
# CRITICAL FIX: Save checkpoints after training
if models_trained:
self._save_training_checkpoints(models_trained, confidence)
except Exception as e:
logger.error(f"Error training models on decision: {e}")
def _save_training_checkpoints(self, models_trained: List[str], performance_score: float):
"""Save checkpoints for trained models if performance improved
This is CRITICAL for preserving training progress across restarts.
"""
try:
if not self.checkpoint_manager:
return
# Increment training counter
self.training_iterations += 1
# Save checkpoints for each trained model
for model_name in models_trained:
try:
model_obj = None
current_loss = None
# Get model object and calculate current performance
if model_name == 'dqn' and self.rl_agent:
model_obj = self.rl_agent
# Use negative performance score as loss (higher confidence = lower loss)
current_loss = 1.0 - performance_score
elif model_name == 'cnn' and self.cnn_model:
model_obj = self.cnn_model
current_loss = 1.0 - performance_score
elif model_name == 'cob_rl' and self.cob_rl_agent:
model_obj = self.cob_rl_agent
current_loss = 1.0 - performance_score
if model_obj and current_loss is not None:
# Check if this is the best performance so far
model_state = self.model_states.get(model_name, {})
best_loss = model_state.get('best_loss', float('inf'))
# Update current loss
model_state['current_loss'] = current_loss
model_state['last_training'] = datetime.now()
# Save checkpoint if performance improved or periodic save
should_save = (
current_loss < best_loss or # Performance improved
self.training_iterations % 100 == 0 # Periodic save every 100 iterations
)
if should_save:
# Prepare metadata
metadata = {
'loss': current_loss,
'performance_score': performance_score,
'training_iterations': self.training_iterations,
'timestamp': datetime.now().isoformat(),
'model_type': model_name
}
# Save checkpoint
checkpoint_path = self.checkpoint_manager.save_checkpoint(
model=model_obj,
model_name=model_name,
performance=current_loss,
metadata=metadata
)
if checkpoint_path:
# Update best performance
if current_loss < best_loss:
model_state['best_loss'] = current_loss
model_state['best_checkpoint'] = checkpoint_path
logger.info(f"💾 Saved BEST checkpoint for {model_name}: {checkpoint_path} (loss: {current_loss:.4f})")
else:
logger.debug(f"💾 Saved periodic checkpoint for {model_name}: {checkpoint_path}")
model_state['last_checkpoint'] = checkpoint_path
model_state['checkpoints_saved'] = model_state.get('checkpoints_saved', 0) + 1
# Update model state
self.model_states[model_name] = model_state
except Exception as e:
logger.error(f"Error saving checkpoint for {model_name}: {e}")
except Exception as e:
logger.error(f"Error saving training checkpoints: {e}")
def _get_current_market_data(self, symbol: str) -> Optional[Dict]:
"""Get current market data for training context"""
try: