gogo2/MODEL_SAVING_RECOMMENDATIONS.md
Dobromir Popov 3871afd4b8 init
2025-03-18 09:23:09 +02:00

2.9 KiB

Model Saving Recommendations

During training, several PyTorch model serialization errors were identified and fixed. Here's a summary of our findings and recommendations to ensure robust model saving:

Issues Found

  1. PyTorch Serialization Errors: Errors like PytorchStreamWriter failed writing file data... and unexpected pos... indicate issues with PyTorch's serialization mechanism.

  2. Disk Space Issues: Our tests showed No space left on device errors, which can cause model corruption.

  3. Compatibility Issues: Some serialization methods might not be compatible with specific PyTorch versions or environments.

Implemented Solutions

  1. Robust Save Function: We added a robust_save function that tries multiple saving approaches in sequence:

    • First attempt: Standard save to a backup file, then copy to the target path
    • Second attempt: Save with pickle protocol 2 (more compatible)
    • Third attempt: Save without optimizer state (reduces file size)
    • Fourth attempt: Use TorchScript's jit.save() (different serialization mechanism)
  2. Memory Management: Implemented memory cleanup before saving:

    • Clearing GPU cache with torch.cuda.empty_cache()
    • Running garbage collection with gc.collect()
  3. Error Handling: Added comprehensive error handling around all saving operations.

  4. Circuit Breaker Pattern: Added circuit breakers to prevent consecutive failures during training.

Recommendations

  1. Disk Space: Ensure sufficient disk space is available (at least 1-2GB free). Large models can use several GB of disk space.

  2. Checkpoint Cleanup: Periodically remove old checkpoints to free up space:

    # Example script to keep only the most recent 5 checkpoints
    Get-ChildItem -Path .\models\trading_agent_checkpoint_*.pt | 
      Sort-Object LastWriteTime -Descending | 
      Select-Object -Skip 5 | 
      Remove-Item
    
  3. File System Check: If persistent errors occur, check the file system for errors or corruption.

  4. Use Smaller Models: Consider reducing model size if saving large models is problematic.

  5. Alternative Serialization: For very large models, consider saving key parameters separately rather than the entire model.

  6. Training Stability: Use our improved training functions with memory management and error handling.

How to Test Model Saving

We've provided a test script test_model_save_load.py that can verify if model saving is working correctly. Run it with:

python test_model_save_load.py

Or test all robust save methods with:

python test_model_save_load.py --test_robust

Future Development

  1. Checksumming: Add checksums to saved models to verify integrity.

  2. Compression: Implement model compression to reduce file size.

  3. Distributed Saving: For very large models, explore distributed saving mechanisms.

  4. Format Conversion: Add ability to save models in ONNX or other portable formats.