8.5 KiB
Requirements Document
Introduction
The Checkpoint Persistence Fix addresses a critical system flaw where model training progress is not being saved during training, causing all learning progress to be lost when the system restarts. Despite having a well-implemented CheckpointManager and proper checkpoint loading at startup, the system lacks checkpoint saving during training operations. This creates a fundamental issue where models train continuously but never persist their improved weights, making continuous improvement impossible and wasting computational resources.
Requirements
Requirement 1: Real-time Checkpoint Saving During Training
User Story: As a system operator, I want model improvements to be automatically saved during training, so that training progress is never lost when the system restarts.
Acceptance Criteria
- WHEN the DQN model is trained in _train_models_on_decision THEN the system SHALL save a checkpoint if the loss improves.
- WHEN the CNN model is trained THEN the system SHALL save a checkpoint if the loss improves.
- WHEN the COB RL model is trained THEN the system SHALL save a checkpoint if the loss improves.
- WHEN the Extrema trainer is trained THEN the system SHALL save a checkpoint if the loss improves.
- WHEN any model training completes THEN the system SHALL compare current performance to best performance and save if improved.
- WHEN checkpoint saving occurs THEN the system SHALL update the model_states dictionary with new performance metrics.
Requirement 2: Performance-Based Checkpoint Management
User Story: As a developer, I want checkpoints to be saved only when model performance improves, so that storage is used efficiently and only the best models are preserved.
Acceptance Criteria
- WHEN evaluating whether to save a checkpoint THEN the system SHALL compare current loss to the best recorded loss.
- WHEN loss decreases by a configurable threshold THEN the system SHALL trigger checkpoint saving.
- WHEN multiple models are trained simultaneously THEN each model SHALL have independent performance tracking.
- WHEN checkpoint rotation occurs THEN the system SHALL keep only the best performing checkpoints.
- WHEN performance metrics are updated THEN the system SHALL log the improvement for monitoring.
- WHEN no improvement is detected THEN the system SHALL skip checkpoint saving to avoid unnecessary I/O.
Requirement 3: Periodic Checkpoint Saving
User Story: As a system administrator, I want checkpoints to be saved periodically regardless of performance, so that progress is preserved even during long training sessions without significant improvement.
Acceptance Criteria
- WHEN a configurable number of training iterations have passed THEN the system SHALL save a checkpoint regardless of performance.
- WHEN periodic saving occurs THEN the system SHALL use a separate checkpoint category to distinguish from performance-based saves.
- WHEN the system runs for extended periods THEN periodic checkpoints SHALL ensure no more than X minutes of training progress can be lost.
- WHEN periodic checkpoints accumulate THEN the system SHALL maintain a rolling window of recent saves.
- WHEN storage space is limited THEN periodic checkpoints SHALL be cleaned up while preserving performance-based checkpoints.
- WHEN the system restarts THEN it SHALL load the most recent checkpoint (either performance-based or periodic).
Requirement 4: Enhanced Training System Integration
User Story: As a developer, I want the EnhancedRealtimeTrainingSystem to properly save checkpoints, so that continuous learning progress is preserved across system restarts.
Acceptance Criteria
- WHEN the EnhancedRealtimeTrainingSystem trains models THEN it SHALL integrate with the CheckpointManager.
- WHEN training episodes complete THEN the system SHALL evaluate and save improved models.
- WHEN the training system initializes THEN it SHALL load the best available checkpoints.
- WHEN training data is collected THEN the system SHALL track performance metrics for checkpoint decisions.
- WHEN the training system shuts down THEN it SHALL save final checkpoints before termination.
- WHEN training resumes THEN the system SHALL continue from the last saved checkpoint state.
Requirement 5: Complete Training Data Storage
User Story: As a developer, I want complete training episodes to be stored with full input dataframes, so that training can be replayed and analyzed with all original context.
Acceptance Criteria
- WHEN training episodes are saved THEN the system SHALL store the complete input dataframe with all model inputs (price data, indicators, market structure, etc.).
- WHEN model actions are recorded THEN the system SHALL store the full context that led to the decision, not just the action result.
- WHEN training cases are saved THEN they SHALL include timestamps, market conditions, and all feature vectors used by the models.
- WHEN storing training data THEN the system SHALL preserve the exact state that can be used to reproduce the model's decision.
- WHEN training episodes are replayed THEN the system SHALL be able to reconstruct the exact same inputs that were originally used.
- WHEN analyzing training performance THEN complete dataframes SHALL be available for debugging and improvement.
Requirement 6: Comprehensive Performance Tracking
User Story: As a system operator, I want detailed performance metrics to be tracked and persisted, so that I can monitor training progress and model improvement over time.
Acceptance Criteria
- WHEN models are trained THEN the system SHALL track loss values, accuracy metrics, and training timestamps.
- WHEN performance improves THEN the system SHALL log the improvement amount and save metadata.
- WHEN checkpoints are saved THEN the system SHALL store performance metrics alongside model weights.
- WHEN the system starts THEN it SHALL display the performance history of loaded checkpoints.
- WHEN multiple training sessions occur THEN the system SHALL maintain a continuous performance history.
- WHEN performance degrades THEN the system SHALL provide alerts and revert to better checkpoints if configured.
Requirement 7: Robust Error Handling and Recovery
User Story: As a system administrator, I want checkpoint operations to be resilient to failures, so that training can continue even if individual checkpoint saves fail.
Acceptance Criteria
- WHEN checkpoint saving fails THEN the system SHALL log the error and continue training without crashing.
- WHEN disk space is insufficient THEN the system SHALL clean up old checkpoints and retry saving.
- WHEN checkpoint files are corrupted THEN the system SHALL fall back to previous valid checkpoints.
- WHEN concurrent access conflicts occur THEN the system SHALL use proper locking mechanisms.
- WHEN the system recovers from failures THEN it SHALL validate checkpoint integrity before loading.
- WHEN critical checkpoint operations fail repeatedly THEN the system SHALL alert administrators.
Requirement 8: Configuration and Monitoring
User Story: As a developer, I want configurable checkpoint settings and monitoring capabilities, so that I can optimize checkpoint behavior for different training scenarios.
Acceptance Criteria
- WHEN configuring the system THEN checkpoint saving frequency SHALL be adjustable.
- WHEN setting performance thresholds THEN the minimum improvement required for saving SHALL be configurable.
- WHEN monitoring training THEN checkpoint save events SHALL be visible in logs and dashboards.
- WHEN analyzing performance THEN checkpoint metadata SHALL be accessible for review.
- WHEN tuning the system THEN checkpoint storage limits SHALL be configurable.
- WHEN debugging issues THEN detailed checkpoint operation logs SHALL be available.
Requirement 9: Backward Compatibility and Migration
User Story: As a user, I want existing checkpoints to remain compatible, so that current model progress is preserved when the checkpoint system is enhanced.
Acceptance Criteria
- WHEN the enhanced checkpoint system starts THEN it SHALL load existing checkpoints without issues.
- WHEN checkpoint formats are updated THEN migration utilities SHALL convert old formats.
- WHEN new metadata is added THEN existing checkpoints SHALL work with default values.
- WHEN the system upgrades THEN checkpoint directories SHALL be preserved and enhanced.
- WHEN rollback is needed THEN the system SHALL support reverting to previous checkpoint versions.
- WHEN compatibility issues arise THEN clear error messages SHALL guide resolution.