Files

Dobromir Popov c63dc11c14 cleanup

2025-07-22 16:08:58 +03:00

8.5 KiB

Raw Blame History

Requirements Document

Introduction

The Checkpoint Persistence Fix addresses a critical system flaw where model training progress is not being saved during training, causing all learning progress to be lost when the system restarts. Despite having a well-implemented CheckpointManager and proper checkpoint loading at startup, the system lacks checkpoint saving during training operations. This creates a fundamental issue where models train continuously but never persist their improved weights, making continuous improvement impossible and wasting computational resources.

Requirements

Requirement 1: Real-time Checkpoint Saving During Training

User Story: As a system operator, I want model improvements to be automatically saved during training, so that training progress is never lost when the system restarts.

Acceptance Criteria

WHEN the DQN model is trained in _train_models_on_decision THEN the system SHALL save a checkpoint if the loss improves.
WHEN the CNN model is trained THEN the system SHALL save a checkpoint if the loss improves.
WHEN the COB RL model is trained THEN the system SHALL save a checkpoint if the loss improves.
WHEN the Extrema trainer is trained THEN the system SHALL save a checkpoint if the loss improves.
WHEN any model training completes THEN the system SHALL compare current performance to best performance and save if improved.
WHEN checkpoint saving occurs THEN the system SHALL update the model_states dictionary with new performance metrics.

Requirement 2: Performance-Based Checkpoint Management

User Story: As a developer, I want checkpoints to be saved only when model performance improves, so that storage is used efficiently and only the best models are preserved.

Acceptance Criteria

WHEN evaluating whether to save a checkpoint THEN the system SHALL compare current loss to the best recorded loss.
WHEN loss decreases by a configurable threshold THEN the system SHALL trigger checkpoint saving.
WHEN multiple models are trained simultaneously THEN each model SHALL have independent performance tracking.
WHEN checkpoint rotation occurs THEN the system SHALL keep only the best performing checkpoints.
WHEN performance metrics are updated THEN the system SHALL log the improvement for monitoring.
WHEN no improvement is detected THEN the system SHALL skip checkpoint saving to avoid unnecessary I/O.

Requirement 3: Periodic Checkpoint Saving

User Story: As a system administrator, I want checkpoints to be saved periodically regardless of performance, so that progress is preserved even during long training sessions without significant improvement.

Acceptance Criteria

WHEN a configurable number of training iterations have passed THEN the system SHALL save a checkpoint regardless of performance.
WHEN periodic saving occurs THEN the system SHALL use a separate checkpoint category to distinguish from performance-based saves.
WHEN the system runs for extended periods THEN periodic checkpoints SHALL ensure no more than X minutes of training progress can be lost.
WHEN periodic checkpoints accumulate THEN the system SHALL maintain a rolling window of recent saves.
WHEN storage space is limited THEN periodic checkpoints SHALL be cleaned up while preserving performance-based checkpoints.
WHEN the system restarts THEN it SHALL load the most recent checkpoint (either performance-based or periodic).

Requirement 4: Enhanced Training System Integration

User Story: As a developer, I want the EnhancedRealtimeTrainingSystem to properly save checkpoints, so that continuous learning progress is preserved across system restarts.

Acceptance Criteria

WHEN the EnhancedRealtimeTrainingSystem trains models THEN it SHALL integrate with the CheckpointManager.
WHEN training episodes complete THEN the system SHALL evaluate and save improved models.
WHEN the training system initializes THEN it SHALL load the best available checkpoints.
WHEN training data is collected THEN the system SHALL track performance metrics for checkpoint decisions.
WHEN the training system shuts down THEN it SHALL save final checkpoints before termination.
WHEN training resumes THEN the system SHALL continue from the last saved checkpoint state.

Requirement 5: Complete Training Data Storage

User Story: As a developer, I want complete training episodes to be stored with full input dataframes, so that training can be replayed and analyzed with all original context.

Acceptance Criteria

WHEN training episodes are saved THEN the system SHALL store the complete input dataframe with all model inputs (price data, indicators, market structure, etc.).
WHEN model actions are recorded THEN the system SHALL store the full context that led to the decision, not just the action result.
WHEN training cases are saved THEN they SHALL include timestamps, market conditions, and all feature vectors used by the models.
WHEN storing training data THEN the system SHALL preserve the exact state that can be used to reproduce the model's decision.
WHEN training episodes are replayed THEN the system SHALL be able to reconstruct the exact same inputs that were originally used.
WHEN analyzing training performance THEN complete dataframes SHALL be available for debugging and improvement.

Requirement 6: Comprehensive Performance Tracking

User Story: As a system operator, I want detailed performance metrics to be tracked and persisted, so that I can monitor training progress and model improvement over time.

Acceptance Criteria

WHEN models are trained THEN the system SHALL track loss values, accuracy metrics, and training timestamps.
WHEN performance improves THEN the system SHALL log the improvement amount and save metadata.
WHEN checkpoints are saved THEN the system SHALL store performance metrics alongside model weights.
WHEN the system starts THEN it SHALL display the performance history of loaded checkpoints.
WHEN multiple training sessions occur THEN the system SHALL maintain a continuous performance history.
WHEN performance degrades THEN the system SHALL provide alerts and revert to better checkpoints if configured.

Requirement 7: Robust Error Handling and Recovery

User Story: As a system administrator, I want checkpoint operations to be resilient to failures, so that training can continue even if individual checkpoint saves fail.

Acceptance Criteria

WHEN checkpoint saving fails THEN the system SHALL log the error and continue training without crashing.
WHEN disk space is insufficient THEN the system SHALL clean up old checkpoints and retry saving.
WHEN checkpoint files are corrupted THEN the system SHALL fall back to previous valid checkpoints.
WHEN concurrent access conflicts occur THEN the system SHALL use proper locking mechanisms.
WHEN the system recovers from failures THEN it SHALL validate checkpoint integrity before loading.
WHEN critical checkpoint operations fail repeatedly THEN the system SHALL alert administrators.

Requirement 8: Configuration and Monitoring

User Story: As a developer, I want configurable checkpoint settings and monitoring capabilities, so that I can optimize checkpoint behavior for different training scenarios.

Acceptance Criteria

WHEN configuring the system THEN checkpoint saving frequency SHALL be adjustable.
WHEN setting performance thresholds THEN the minimum improvement required for saving SHALL be configurable.
WHEN monitoring training THEN checkpoint save events SHALL be visible in logs and dashboards.
WHEN analyzing performance THEN checkpoint metadata SHALL be accessible for review.
WHEN tuning the system THEN checkpoint storage limits SHALL be configurable.
WHEN debugging issues THEN detailed checkpoint operation logs SHALL be available.

Requirement 9: Backward Compatibility and Migration

User Story: As a user, I want existing checkpoints to remain compatible, so that current model progress is preserved when the checkpoint system is enhanced.

Acceptance Criteria

WHEN the enhanced checkpoint system starts THEN it SHALL load existing checkpoints without issues.
WHEN checkpoint formats are updated THEN migration utilities SHALL convert old formats.
WHEN new metadata is added THEN existing checkpoints SHALL work with default values.
WHEN the system upgrades THEN checkpoint directories SHALL be preserved and enhanced.
WHEN rollback is needed THEN the system SHALL support reverting to previous checkpoint versions.
WHEN compatibility issues arise THEN clear error messages SHALL guide resolution.

8.5 KiB Raw Blame History

Requirements Document

Introduction

Requirements

Requirement 1: Real-time Checkpoint Saving During Training

Acceptance Criteria

Requirement 2: Performance-Based Checkpoint Management

Acceptance Criteria

Requirement 3: Periodic Checkpoint Saving

Acceptance Criteria

Requirement 4: Enhanced Training System Integration

Acceptance Criteria

Requirement 5: Complete Training Data Storage

Acceptance Criteria

Requirement 6: Comprehensive Performance Tracking

Acceptance Criteria

Requirement 7: Robust Error Handling and Recovery

Acceptance Criteria

Requirement 8: Configuration and Monitoring

Acceptance Criteria

Requirement 9: Backward Compatibility and Migration

Acceptance Criteria

8.5 KiB

Raw Blame History