diff --git a/.kiro/specs/checkpoint-persistence-fix/requirements.md b/.kiro/specs/checkpoint-persistence-fix/requirements.md deleted file mode 100644 index ff9d7ca..0000000 --- a/.kiro/specs/checkpoint-persistence-fix/requirements.md +++ /dev/null @@ -1,124 +0,0 @@ -# Requirements Document - -## Introduction - -The Checkpoint Persistence Fix addresses a critical system flaw where model training progress is not being saved during training, causing all learning progress to be lost when the system restarts. Despite having a well-implemented CheckpointManager and proper checkpoint loading at startup, the system lacks checkpoint saving during training operations. This creates a fundamental issue where models train continuously but never persist their improved weights, making continuous improvement impossible and wasting computational resources. - -## Requirements - -### Requirement 1: Real-time Checkpoint Saving During Training - -**User Story:** As a system operator, I want model improvements to be automatically saved during training, so that training progress is never lost when the system restarts. - -#### Acceptance Criteria - -1. WHEN the DQN model is trained in _train_models_on_decision THEN the system SHALL save a checkpoint if the loss improves. -2. WHEN the CNN model is trained THEN the system SHALL save a checkpoint if the loss improves. -3. WHEN the COB RL model is trained THEN the system SHALL save a checkpoint if the loss improves. -4. WHEN the Extrema trainer is trained THEN the system SHALL save a checkpoint if the loss improves. -5. WHEN any model training completes THEN the system SHALL compare current performance to best performance and save if improved. -6. WHEN checkpoint saving occurs THEN the system SHALL update the model_states dictionary with new performance metrics. - -### Requirement 2: Performance-Based Checkpoint Management - -**User Story:** As a developer, I want checkpoints to be saved only when model performance improves, so that storage is used efficiently and only the best models are preserved. - -#### Acceptance Criteria - -1. WHEN evaluating whether to save a checkpoint THEN the system SHALL compare current loss to the best recorded loss. -2. WHEN loss decreases by a configurable threshold THEN the system SHALL trigger checkpoint saving. -3. WHEN multiple models are trained simultaneously THEN each model SHALL have independent performance tracking. -4. WHEN checkpoint rotation occurs THEN the system SHALL keep only the best performing checkpoints. -5. WHEN performance metrics are updated THEN the system SHALL log the improvement for monitoring. -6. WHEN no improvement is detected THEN the system SHALL skip checkpoint saving to avoid unnecessary I/O. - -### Requirement 3: Periodic Checkpoint Saving - -**User Story:** As a system administrator, I want checkpoints to be saved periodically regardless of performance, so that progress is preserved even during long training sessions without significant improvement. - -#### Acceptance Criteria - -1. WHEN a configurable number of training iterations have passed THEN the system SHALL save a checkpoint regardless of performance. -2. WHEN periodic saving occurs THEN the system SHALL use a separate checkpoint category to distinguish from performance-based saves. -3. WHEN the system runs for extended periods THEN periodic checkpoints SHALL ensure no more than X minutes of training progress can be lost. -4. WHEN periodic checkpoints accumulate THEN the system SHALL maintain a rolling window of recent saves. -5. WHEN storage space is limited THEN periodic checkpoints SHALL be cleaned up while preserving performance-based checkpoints. -6. WHEN the system restarts THEN it SHALL load the most recent checkpoint (either performance-based or periodic). - -### Requirement 4: Enhanced Training System Integration - -**User Story:** As a developer, I want the EnhancedRealtimeTrainingSystem to properly save checkpoints, so that continuous learning progress is preserved across system restarts. - -#### Acceptance Criteria - -1. WHEN the EnhancedRealtimeTrainingSystem trains models THEN it SHALL integrate with the CheckpointManager. -2. WHEN training episodes complete THEN the system SHALL evaluate and save improved models. -3. WHEN the training system initializes THEN it SHALL load the best available checkpoints. -4. WHEN training data is collected THEN the system SHALL track performance metrics for checkpoint decisions. -5. WHEN the training system shuts down THEN it SHALL save final checkpoints before termination. -6. WHEN training resumes THEN the system SHALL continue from the last saved checkpoint state. - -### Requirement 5: Complete Training Data Storage - -**User Story:** As a developer, I want complete training episodes to be stored with full input dataframes, so that training can be replayed and analyzed with all original context. - -#### Acceptance Criteria - -1. WHEN training episodes are saved THEN the system SHALL store the complete input dataframe with all model inputs (price data, indicators, market structure, etc.). -2. WHEN model actions are recorded THEN the system SHALL store the full context that led to the decision, not just the action result. -3. WHEN training cases are saved THEN they SHALL include timestamps, market conditions, and all feature vectors used by the models. -4. WHEN storing training data THEN the system SHALL preserve the exact state that can be used to reproduce the model's decision. -5. WHEN training episodes are replayed THEN the system SHALL be able to reconstruct the exact same inputs that were originally used. -6. WHEN analyzing training performance THEN complete dataframes SHALL be available for debugging and improvement. - -### Requirement 6: Comprehensive Performance Tracking - -**User Story:** As a system operator, I want detailed performance metrics to be tracked and persisted, so that I can monitor training progress and model improvement over time. - -#### Acceptance Criteria - -1. WHEN models are trained THEN the system SHALL track loss values, accuracy metrics, and training timestamps. -2. WHEN performance improves THEN the system SHALL log the improvement amount and save metadata. -3. WHEN checkpoints are saved THEN the system SHALL store performance metrics alongside model weights. -4. WHEN the system starts THEN it SHALL display the performance history of loaded checkpoints. -5. WHEN multiple training sessions occur THEN the system SHALL maintain a continuous performance history. -6. WHEN performance degrades THEN the system SHALL provide alerts and revert to better checkpoints if configured. - -### Requirement 7: Robust Error Handling and Recovery - -**User Story:** As a system administrator, I want checkpoint operations to be resilient to failures, so that training can continue even if individual checkpoint saves fail. - -#### Acceptance Criteria - -1. WHEN checkpoint saving fails THEN the system SHALL log the error and continue training without crashing. -2. WHEN disk space is insufficient THEN the system SHALL clean up old checkpoints and retry saving. -3. WHEN checkpoint files are corrupted THEN the system SHALL fall back to previous valid checkpoints. -4. WHEN concurrent access conflicts occur THEN the system SHALL use proper locking mechanisms. -5. WHEN the system recovers from failures THEN it SHALL validate checkpoint integrity before loading. -6. WHEN critical checkpoint operations fail repeatedly THEN the system SHALL alert administrators. - -### Requirement 8: Configuration and Monitoring - -**User Story:** As a developer, I want configurable checkpoint settings and monitoring capabilities, so that I can optimize checkpoint behavior for different training scenarios. - -#### Acceptance Criteria - -1. WHEN configuring the system THEN checkpoint saving frequency SHALL be adjustable. -2. WHEN setting performance thresholds THEN the minimum improvement required for saving SHALL be configurable. -3. WHEN monitoring training THEN checkpoint save events SHALL be visible in logs and dashboards. -4. WHEN analyzing performance THEN checkpoint metadata SHALL be accessible for review. -5. WHEN tuning the system THEN checkpoint storage limits SHALL be configurable. -6. WHEN debugging issues THEN detailed checkpoint operation logs SHALL be available. - -### Requirement 9: Backward Compatibility and Migration - -**User Story:** As a user, I want existing checkpoints to remain compatible, so that current model progress is preserved when the checkpoint system is enhanced. - -#### Acceptance Criteria - -1. WHEN the enhanced checkpoint system starts THEN it SHALL load existing checkpoints without issues. -2. WHEN checkpoint formats are updated THEN migration utilities SHALL convert old formats. -3. WHEN new metadata is added THEN existing checkpoints SHALL work with default values. -4. WHEN the system upgrades THEN checkpoint directories SHALL be preserved and enhanced. -5. WHEN rollback is needed THEN the system SHALL support reverting to previous checkpoint versions. -6. WHEN compatibility issues arise THEN clear error messages SHALL guide resolution. \ No newline at end of file diff --git a/core/orchestrator.py b/core/orchestrator.py index 85b0a55..91399a6 100644 --- a/core/orchestrator.py +++ b/core/orchestrator.py @@ -220,6 +220,11 @@ class TradingOrchestrator: self.data_provider.start_centralized_data_collection() logger.info("Centralized data collection started - all models and dashboard will receive data") + # CRITICAL: Initialize checkpoint manager for saving training progress + self.checkpoint_manager = None + self.training_iterations = 0 # Track training iterations for periodic saves + self._initialize_checkpoint_manager() + # Initialize models, COB integration, and training system self._initialize_ml_models() self._initialize_cob_integration() @@ -2145,6 +2150,9 @@ class TradingOrchestrator: if not market_data: return + # Track if any model was trained for checkpoint saving + models_trained = [] + # Train DQN agent if available if self.rl_agent and hasattr(self.rl_agent, 'add_experience'): try: @@ -2167,6 +2175,7 @@ class TradingOrchestrator: done=False ) + models_trained.append('dqn') logger.debug(f"🧠 Added DQN experience: {action} {symbol} (reward: {immediate_reward:.3f})") except Exception as e: @@ -2185,6 +2194,7 @@ class TradingOrchestrator: # Add training sample self.cnn_model.add_training_sample(cnn_features, target, weight=confidence) + models_trained.append('cnn') logger.debug(f"🔍 Added CNN training sample: {action} {symbol}") except Exception as e: @@ -2206,14 +2216,105 @@ class TradingOrchestrator: symbol=symbol ) + models_trained.append('cob_rl') logger.debug(f"📊 Added COB RL experience: {action} {symbol}") except Exception as e: logger.debug(f"Error training COB RL on decision: {e}") + # CRITICAL FIX: Save checkpoints after training + if models_trained: + self._save_training_checkpoints(models_trained, confidence) + except Exception as e: logger.error(f"Error training models on decision: {e}") + def _save_training_checkpoints(self, models_trained: List[str], performance_score: float): + """Save checkpoints for trained models if performance improved + + This is CRITICAL for preserving training progress across restarts. + """ + try: + if not self.checkpoint_manager: + return + + # Increment training counter + self.training_iterations += 1 + + # Save checkpoints for each trained model + for model_name in models_trained: + try: + model_obj = None + current_loss = None + + # Get model object and calculate current performance + if model_name == 'dqn' and self.rl_agent: + model_obj = self.rl_agent + # Use negative performance score as loss (higher confidence = lower loss) + current_loss = 1.0 - performance_score + + elif model_name == 'cnn' and self.cnn_model: + model_obj = self.cnn_model + current_loss = 1.0 - performance_score + + elif model_name == 'cob_rl' and self.cob_rl_agent: + model_obj = self.cob_rl_agent + current_loss = 1.0 - performance_score + + if model_obj and current_loss is not None: + # Check if this is the best performance so far + model_state = self.model_states.get(model_name, {}) + best_loss = model_state.get('best_loss', float('inf')) + + # Update current loss + model_state['current_loss'] = current_loss + model_state['last_training'] = datetime.now() + + # Save checkpoint if performance improved or periodic save + should_save = ( + current_loss < best_loss or # Performance improved + self.training_iterations % 100 == 0 # Periodic save every 100 iterations + ) + + if should_save: + # Prepare metadata + metadata = { + 'loss': current_loss, + 'performance_score': performance_score, + 'training_iterations': self.training_iterations, + 'timestamp': datetime.now().isoformat(), + 'model_type': model_name + } + + # Save checkpoint + checkpoint_path = self.checkpoint_manager.save_checkpoint( + model=model_obj, + model_name=model_name, + performance=current_loss, + metadata=metadata + ) + + if checkpoint_path: + # Update best performance + if current_loss < best_loss: + model_state['best_loss'] = current_loss + model_state['best_checkpoint'] = checkpoint_path + logger.info(f"💾 Saved BEST checkpoint for {model_name}: {checkpoint_path} (loss: {current_loss:.4f})") + else: + logger.debug(f"💾 Saved periodic checkpoint for {model_name}: {checkpoint_path}") + + model_state['last_checkpoint'] = checkpoint_path + model_state['checkpoints_saved'] = model_state.get('checkpoints_saved', 0) + 1 + + # Update model state + self.model_states[model_name] = model_state + + except Exception as e: + logger.error(f"Error saving checkpoint for {model_name}: {e}") + + except Exception as e: + logger.error(f"Error saving training checkpoints: {e}") + def _get_current_market_data(self, symbol: str) -> Optional[Dict]: """Get current market data for training context""" try: