sqlite for checkpoints, cleanup

2025-07-25 22:34:13 +03:00
parent 130a52fb9b
commit dd9f4b63ba
42 changed files with 2017 additions and 1485 deletions
--- a/docs/logging_system_upgrade.md
+++ b/docs/logging_system_upgrade.md
@@ -0,0 +1,280 @@
+# Trading System Logging Upgrade
+
+## Overview
+
+This upgrade implements a comprehensive logging and metadata management system that addresses the key issues:
+
+1. **Eliminates scattered "No checkpoints found" logs** during runtime
+2. **Fast checkpoint metadata access** without loading full models
+3. **Centralized inference logging** with database and text file storage
+4. **Structured tracking** of model performance and checkpoints
+
+## Key Components
+
+### 1. Database Manager (`utils/database_manager.py`)
+
+**Purpose**: SQLite-based storage for structured data
+
+**Features**:
+- Inference records logging with deduplication
+- Checkpoint metadata storage (separate from model weights)
+- Model performance tracking
+- Fast queries without loading model files
+
+**Tables**:
+- `inference_records`: All model predictions with metadata
+- `checkpoint_metadata`: Checkpoint info without model weights
+- `model_performance`: Daily aggregated statistics
+
+### 2. Inference Logger (`utils/inference_logger.py`)
+
+**Purpose**: Centralized logging for all model inferences
+
+**Features**:
+- Single function call replaces scattered `logger.info()` calls
+- Automatic feature hashing for deduplication
+- Memory usage tracking
+- Processing time measurement
+- Dual storage (database + text files)
+
+**Usage**:
+```python
+from utils.inference_logger import log_model_inference
+
+log_model_inference(
+    model_name="dqn_agent",
+    symbol="ETH/USDT", 
+    action="BUY",
+    confidence=0.85,
+    probabilities={"BUY": 0.85, "SELL": 0.10, "HOLD": 0.05},
+    input_features=features_array,
+    processing_time_ms=12.5,
+    checkpoint_id="dqn_agent_20250725_143500"
+)
+```
+
+### 3. Text Logger (`utils/text_logger.py`)
+
+**Purpose**: Human-readable log files for tracking
+
+**Features**:
+- Separate files for different event types
+- Clean, tabular format
+- Automatic cleanup of old entries
+- Easy to read and grep
+
+**Files**:
+- `logs/inference_records.txt`: All model predictions
+- `logs/checkpoint_events.txt`: Save/load events
+- `logs/system_events.txt`: General system events
+
+### 4. Enhanced Checkpoint Manager (`utils/checkpoint_manager.py`)
+
+**Purpose**: Improved checkpoint handling with metadata separation
+
+**Features**:
+- Database-backed metadata storage
+- Fast metadata queries without loading models
+- Eliminates "No checkpoints found" spam
+- Backward compatibility with existing code
+
+## Benefits
+
+### 1. Performance Improvements
+
+**Before**: Loading full checkpoint just to get metadata
+```python
+# Old way - loads entire model!
+checkpoint_path, metadata = load_best_checkpoint("dqn_agent")
+loss = metadata.loss  # Expensive operation
+```
+
+**After**: Fast metadata access from database
+```python
+# New way - database query only
+metadata = db_manager.get_best_checkpoint_metadata("dqn_agent")
+loss = metadata.performance_metrics['loss']  # Fast!
+```
+
+### 2. Cleaner Runtime Logs
+
+**Before**: Scattered logs everywhere
+```
+2025-07-25 14:34:39,749 - utils.checkpoint_manager - INFO - No checkpoints found for dqn_agent
+2025-07-25 14:34:39,754 - utils.checkpoint_manager - INFO - No checkpoints found for enhanced_cnn
+2025-07-25 14:34:39,756 - utils.checkpoint_manager - INFO - No checkpoints found for extrema_trainer
+```
+
+**After**: Clean, structured logging
+```
+2025-07-25 14:34:39 | dqn_agent       | ETH/USDT   | BUY  | conf=0.850 | time=  12.5ms [checkpoint: dqn_agent_20250725_143500]
+2025-07-25 14:34:40 | enhanced_cnn    | ETH/USDT   | HOLD | conf=0.720 | time=   8.2ms [checkpoint: enhanced_cnn_20250725_143501]
+```
+
+### 3. Structured Data Storage
+
+**Database Schema**:
+```sql
+-- Fast metadata queries
+SELECT * FROM checkpoint_metadata WHERE model_name = 'dqn_agent' AND is_active = TRUE;
+
+-- Performance analysis
+SELECT model_name, AVG(confidence), COUNT(*) 
+FROM inference_records 
+WHERE timestamp > datetime('now', '-24 hours')
+GROUP BY model_name;
+```
+
+### 4. Easy Integration
+
+**In Model Code**:
+```python
+# Replace scattered logging
+# OLD: logger.info(f"DQN prediction: {action} confidence={conf}")
+
+# NEW: Centralized logging
+self.orchestrator.log_model_inference(
+    model_name="dqn_agent",
+    symbol=symbol,
+    action=action,
+    confidence=confidence,
+    probabilities=probs,
+    input_features=features,
+    processing_time_ms=processing_time
+)
+```
+
+## Implementation Guide
+
+### 1. Update Model Classes
+
+Add inference logging to prediction methods:
+
+```python
+class DQNAgent:
+    def predict(self, state):
+        start_time = time.time()
+        
+        # Your prediction logic here
+        action = self._predict_action(state)
+        confidence = self._calculate_confidence()
+        
+        processing_time = (time.time() - start_time) * 1000
+        
+        # Log the inference
+        self.orchestrator.log_model_inference(
+            model_name="dqn_agent",
+            symbol=self.symbol,
+            action=action,
+            confidence=confidence,
+            probabilities=self.action_probabilities,
+            input_features=state,
+            processing_time_ms=processing_time,
+            checkpoint_id=self.current_checkpoint_id
+        )
+        
+        return action
+```
+
+### 2. Update Checkpoint Saving
+
+Use the enhanced checkpoint manager:
+
+```python
+from utils.checkpoint_manager import save_checkpoint
+
+# Save with metadata
+checkpoint_metadata = save_checkpoint(
+    model=self.model,
+    model_name="dqn_agent",
+    model_type="rl",
+    performance_metrics={"loss": 0.0234, "accuracy": 0.87},
+    training_metadata={"epochs": 100, "lr": 0.001}
+)
+```
+
+### 3. Fast Metadata Access
+
+Get checkpoint info without loading models:
+
+```python
+# Fast metadata access
+metadata = orchestrator.get_checkpoint_metadata_fast("dqn_agent")
+if metadata:
+    current_loss = metadata.performance_metrics['loss']
+    checkpoint_id = metadata.checkpoint_id
+```
+
+## Migration Steps
+
+1. **Install new dependencies** (if any)
+2. **Update model classes** to use centralized logging
+3. **Replace checkpoint loading** with database queries where possible
+4. **Remove scattered logger.info()** calls for inferences
+5. **Test with demo script**: `python demo_logging_system.py`
+
+## File Structure
+
+```
+utils/
+├── database_manager.py      # SQLite database management
+├── inference_logger.py      # Centralized inference logging
+├── text_logger.py          # Human-readable text logs
+└── checkpoint_manager.py    # Enhanced checkpoint handling
+
+logs/                        # Text log files
+├── inference_records.txt
+├── checkpoint_events.txt
+└── system_events.txt
+
+data/
+└── trading_system.db       # SQLite database
+
+demo_logging_system.py      # Demonstration script
+```
+
+## Monitoring and Maintenance
+
+### Daily Tasks
+- Check `logs/inference_records.txt` for recent activity
+- Monitor database size: `ls -lh data/trading_system.db`
+
+### Weekly Tasks  
+- Run cleanup: `inference_logger.cleanup_old_logs(days_to_keep=30)`
+- Check model performance trends in database
+
+### Monthly Tasks
+- Archive old log files
+- Analyze model performance statistics
+- Review checkpoint storage usage
+
+## Troubleshooting
+
+### Common Issues
+
+1. **Database locked**: Multiple processes accessing SQLite
+   - Solution: Use connection timeout and proper context managers
+
+2. **Log files growing too large**: 
+   - Solution: Run `text_logger.cleanup_old_logs(max_lines=10000)`
+
+3. **Missing checkpoint metadata**:
+   - Solution: System falls back to file-based approach automatically
+
+### Debug Commands
+
+```python
+# Check database status
+db_manager = get_database_manager()
+checkpoints = db_manager.list_checkpoints("dqn_agent")
+
+# Check recent inferences
+inference_logger = get_inference_logger()
+stats = inference_logger.get_model_stats("dqn_agent", hours=24)
+
+# View text logs
+text_logger = get_text_logger()
+recent = text_logger.get_recent_inferences(lines=50)
+```
+
+This upgrade provides a solid foundation for tracking model performance, eliminating log spam, and enabling fast metadata access without the overhead of loading full model checkpoints.