sqlite for checkpoints, cleanup

This commit is contained in:
Dobromir Popov
2025-07-25 22:34:13 +03:00
parent 130a52fb9b
commit dd9f4b63ba
42 changed files with 2017 additions and 1485 deletions

View File

@ -0,0 +1,280 @@
# Trading System Logging Upgrade
## Overview
This upgrade implements a comprehensive logging and metadata management system that addresses the key issues:
1. **Eliminates scattered "No checkpoints found" logs** during runtime
2. **Fast checkpoint metadata access** without loading full models
3. **Centralized inference logging** with database and text file storage
4. **Structured tracking** of model performance and checkpoints
## Key Components
### 1. Database Manager (`utils/database_manager.py`)
**Purpose**: SQLite-based storage for structured data
**Features**:
- Inference records logging with deduplication
- Checkpoint metadata storage (separate from model weights)
- Model performance tracking
- Fast queries without loading model files
**Tables**:
- `inference_records`: All model predictions with metadata
- `checkpoint_metadata`: Checkpoint info without model weights
- `model_performance`: Daily aggregated statistics
### 2. Inference Logger (`utils/inference_logger.py`)
**Purpose**: Centralized logging for all model inferences
**Features**:
- Single function call replaces scattered `logger.info()` calls
- Automatic feature hashing for deduplication
- Memory usage tracking
- Processing time measurement
- Dual storage (database + text files)
**Usage**:
```python
from utils.inference_logger import log_model_inference
log_model_inference(
model_name="dqn_agent",
symbol="ETH/USDT",
action="BUY",
confidence=0.85,
probabilities={"BUY": 0.85, "SELL": 0.10, "HOLD": 0.05},
input_features=features_array,
processing_time_ms=12.5,
checkpoint_id="dqn_agent_20250725_143500"
)
```
### 3. Text Logger (`utils/text_logger.py`)
**Purpose**: Human-readable log files for tracking
**Features**:
- Separate files for different event types
- Clean, tabular format
- Automatic cleanup of old entries
- Easy to read and grep
**Files**:
- `logs/inference_records.txt`: All model predictions
- `logs/checkpoint_events.txt`: Save/load events
- `logs/system_events.txt`: General system events
### 4. Enhanced Checkpoint Manager (`utils/checkpoint_manager.py`)
**Purpose**: Improved checkpoint handling with metadata separation
**Features**:
- Database-backed metadata storage
- Fast metadata queries without loading models
- Eliminates "No checkpoints found" spam
- Backward compatibility with existing code
## Benefits
### 1. Performance Improvements
**Before**: Loading full checkpoint just to get metadata
```python
# Old way - loads entire model!
checkpoint_path, metadata = load_best_checkpoint("dqn_agent")
loss = metadata.loss # Expensive operation
```
**After**: Fast metadata access from database
```python
# New way - database query only
metadata = db_manager.get_best_checkpoint_metadata("dqn_agent")
loss = metadata.performance_metrics['loss'] # Fast!
```
### 2. Cleaner Runtime Logs
**Before**: Scattered logs everywhere
```
2025-07-25 14:34:39,749 - utils.checkpoint_manager - INFO - No checkpoints found for dqn_agent
2025-07-25 14:34:39,754 - utils.checkpoint_manager - INFO - No checkpoints found for enhanced_cnn
2025-07-25 14:34:39,756 - utils.checkpoint_manager - INFO - No checkpoints found for extrema_trainer
```
**After**: Clean, structured logging
```
2025-07-25 14:34:39 | dqn_agent | ETH/USDT | BUY | conf=0.850 | time= 12.5ms [checkpoint: dqn_agent_20250725_143500]
2025-07-25 14:34:40 | enhanced_cnn | ETH/USDT | HOLD | conf=0.720 | time= 8.2ms [checkpoint: enhanced_cnn_20250725_143501]
```
### 3. Structured Data Storage
**Database Schema**:
```sql
-- Fast metadata queries
SELECT * FROM checkpoint_metadata WHERE model_name = 'dqn_agent' AND is_active = TRUE;
-- Performance analysis
SELECT model_name, AVG(confidence), COUNT(*)
FROM inference_records
WHERE timestamp > datetime('now', '-24 hours')
GROUP BY model_name;
```
### 4. Easy Integration
**In Model Code**:
```python
# Replace scattered logging
# OLD: logger.info(f"DQN prediction: {action} confidence={conf}")
# NEW: Centralized logging
self.orchestrator.log_model_inference(
model_name="dqn_agent",
symbol=symbol,
action=action,
confidence=confidence,
probabilities=probs,
input_features=features,
processing_time_ms=processing_time
)
```
## Implementation Guide
### 1. Update Model Classes
Add inference logging to prediction methods:
```python
class DQNAgent:
def predict(self, state):
start_time = time.time()
# Your prediction logic here
action = self._predict_action(state)
confidence = self._calculate_confidence()
processing_time = (time.time() - start_time) * 1000
# Log the inference
self.orchestrator.log_model_inference(
model_name="dqn_agent",
symbol=self.symbol,
action=action,
confidence=confidence,
probabilities=self.action_probabilities,
input_features=state,
processing_time_ms=processing_time,
checkpoint_id=self.current_checkpoint_id
)
return action
```
### 2. Update Checkpoint Saving
Use the enhanced checkpoint manager:
```python
from utils.checkpoint_manager import save_checkpoint
# Save with metadata
checkpoint_metadata = save_checkpoint(
model=self.model,
model_name="dqn_agent",
model_type="rl",
performance_metrics={"loss": 0.0234, "accuracy": 0.87},
training_metadata={"epochs": 100, "lr": 0.001}
)
```
### 3. Fast Metadata Access
Get checkpoint info without loading models:
```python
# Fast metadata access
metadata = orchestrator.get_checkpoint_metadata_fast("dqn_agent")
if metadata:
current_loss = metadata.performance_metrics['loss']
checkpoint_id = metadata.checkpoint_id
```
## Migration Steps
1. **Install new dependencies** (if any)
2. **Update model classes** to use centralized logging
3. **Replace checkpoint loading** with database queries where possible
4. **Remove scattered logger.info()** calls for inferences
5. **Test with demo script**: `python demo_logging_system.py`
## File Structure
```
utils/
├── database_manager.py # SQLite database management
├── inference_logger.py # Centralized inference logging
├── text_logger.py # Human-readable text logs
└── checkpoint_manager.py # Enhanced checkpoint handling
logs/ # Text log files
├── inference_records.txt
├── checkpoint_events.txt
└── system_events.txt
data/
└── trading_system.db # SQLite database
demo_logging_system.py # Demonstration script
```
## Monitoring and Maintenance
### Daily Tasks
- Check `logs/inference_records.txt` for recent activity
- Monitor database size: `ls -lh data/trading_system.db`
### Weekly Tasks
- Run cleanup: `inference_logger.cleanup_old_logs(days_to_keep=30)`
- Check model performance trends in database
### Monthly Tasks
- Archive old log files
- Analyze model performance statistics
- Review checkpoint storage usage
## Troubleshooting
### Common Issues
1. **Database locked**: Multiple processes accessing SQLite
- Solution: Use connection timeout and proper context managers
2. **Log files growing too large**:
- Solution: Run `text_logger.cleanup_old_logs(max_lines=10000)`
3. **Missing checkpoint metadata**:
- Solution: System falls back to file-based approach automatically
### Debug Commands
```python
# Check database status
db_manager = get_database_manager()
checkpoints = db_manager.list_checkpoints("dqn_agent")
# Check recent inferences
inference_logger = get_inference_logger()
stats = inference_logger.get_model_stats("dqn_agent", hours=24)
# View text logs
text_logger = get_text_logger()
recent = text_logger.get_recent_inferences(lines=50)
```
This upgrade provides a solid foundation for tracking model performance, eliminating log spam, and enabling fast metadata access without the overhead of loading full model checkpoints.