init

2025-03-18 09:23:09 +02:00
commit 3871afd4b8
100 changed files with 55180 additions and 0 deletions
--- a/MODEL_SAVING_FIX.md
+++ b/MODEL_SAVING_FIX.md
@ -0,0 +1,74 @@
+# Model Saving Fix
+
+## Issue
+
+During training sessions, PyTorch model saving operations sometimes fail with errors like:
+
+```
+RuntimeError: [enforce fail at inline_container.cc:626] . unexpected pos 18278784 vs 18278680
+```
+
+or 
+
+```
+RuntimeError: [enforce fail at inline_container.cc:820] . PytorchStreamWriter failed writing file data/75: file write failed
+```
+
+These errors occur in the PyTorch serialization mechanism when saving models using `torch.save()`.
+
+## Solution
+
+We've implemented a robust model saving approach that uses multiple fallback methods if the primary save operation fails:
+
+1. **Attempt 1**: Save to a backup file first, then copy to the target path.
+2. **Attempt 2**: Use an older pickle protocol (pickle protocol 2) which can be more compatible.
+3. **Attempt 3**: Save without the optimizer state, which can reduce file size and avoid serialization issues.
+4. **Attempt 4**: Use TorchScript's `torch.jit.save()` instead of `torch.save()`, which uses a different serialization mechanism.
+
+## Implementation
+
+The solution is implemented in two parts:
+
+1. A `robust_save` function that tries multiple saving approaches with fallbacks.
+2. A monkey patch that replaces the Agent's `save` method with our robust version.
+
+### Example Usage
+
+```python
+# Import the robust_save function
+from live_training import robust_save
+
+# Save a model with fallbacks
+success = robust_save(agent, "models/my_model.pt")
+if success:
+    print("Model saved successfully!")
+else:
+    print("All save attempts failed")
+```
+
+## Testing
+
+We've created a test script `test_save.py` that demonstrates the robust saving approach and verifies that it works correctly.
+
+To run the test:
+
+```bash
+python test_save.py
+```
+
+This script creates a simple model, attempts to save it using both the standard and robust methods, and reports on the results.
+
+## Future Improvements
+
+Possible future improvements to the model saving mechanism:
+
+1. Additional fallback methods like serializing individual neural network layers.
+2. Automatic retry mechanism with exponential backoff.
+3. Asynchronous saving to avoid blocking the training loop.
+4. Checksumming saved models to verify integrity.
+
+## Related Issues
+
+For more information on similar issues with PyTorch model saving, see:
+- https://github.com/pytorch/pytorch/issues/27736
+- https://github.com/pytorch/pytorch/issues/24045