sqlite for checkpoints, cleanup
This commit is contained in:
280
docs/logging_system_upgrade.md
Normal file
280
docs/logging_system_upgrade.md
Normal file
@ -0,0 +1,280 @@
|
||||
# Trading System Logging Upgrade
|
||||
|
||||
## Overview
|
||||
|
||||
This upgrade implements a comprehensive logging and metadata management system that addresses the key issues:
|
||||
|
||||
1. **Eliminates scattered "No checkpoints found" logs** during runtime
|
||||
2. **Fast checkpoint metadata access** without loading full models
|
||||
3. **Centralized inference logging** with database and text file storage
|
||||
4. **Structured tracking** of model performance and checkpoints
|
||||
|
||||
## Key Components
|
||||
|
||||
### 1. Database Manager (`utils/database_manager.py`)
|
||||
|
||||
**Purpose**: SQLite-based storage for structured data
|
||||
|
||||
**Features**:
|
||||
- Inference records logging with deduplication
|
||||
- Checkpoint metadata storage (separate from model weights)
|
||||
- Model performance tracking
|
||||
- Fast queries without loading model files
|
||||
|
||||
**Tables**:
|
||||
- `inference_records`: All model predictions with metadata
|
||||
- `checkpoint_metadata`: Checkpoint info without model weights
|
||||
- `model_performance`: Daily aggregated statistics
|
||||
|
||||
### 2. Inference Logger (`utils/inference_logger.py`)
|
||||
|
||||
**Purpose**: Centralized logging for all model inferences
|
||||
|
||||
**Features**:
|
||||
- Single function call replaces scattered `logger.info()` calls
|
||||
- Automatic feature hashing for deduplication
|
||||
- Memory usage tracking
|
||||
- Processing time measurement
|
||||
- Dual storage (database + text files)
|
||||
|
||||
**Usage**:
|
||||
```python
|
||||
from utils.inference_logger import log_model_inference
|
||||
|
||||
log_model_inference(
|
||||
model_name="dqn_agent",
|
||||
symbol="ETH/USDT",
|
||||
action="BUY",
|
||||
confidence=0.85,
|
||||
probabilities={"BUY": 0.85, "SELL": 0.10, "HOLD": 0.05},
|
||||
input_features=features_array,
|
||||
processing_time_ms=12.5,
|
||||
checkpoint_id="dqn_agent_20250725_143500"
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Text Logger (`utils/text_logger.py`)
|
||||
|
||||
**Purpose**: Human-readable log files for tracking
|
||||
|
||||
**Features**:
|
||||
- Separate files for different event types
|
||||
- Clean, tabular format
|
||||
- Automatic cleanup of old entries
|
||||
- Easy to read and grep
|
||||
|
||||
**Files**:
|
||||
- `logs/inference_records.txt`: All model predictions
|
||||
- `logs/checkpoint_events.txt`: Save/load events
|
||||
- `logs/system_events.txt`: General system events
|
||||
|
||||
### 4. Enhanced Checkpoint Manager (`utils/checkpoint_manager.py`)
|
||||
|
||||
**Purpose**: Improved checkpoint handling with metadata separation
|
||||
|
||||
**Features**:
|
||||
- Database-backed metadata storage
|
||||
- Fast metadata queries without loading models
|
||||
- Eliminates "No checkpoints found" spam
|
||||
- Backward compatibility with existing code
|
||||
|
||||
## Benefits
|
||||
|
||||
### 1. Performance Improvements
|
||||
|
||||
**Before**: Loading full checkpoint just to get metadata
|
||||
```python
|
||||
# Old way - loads entire model!
|
||||
checkpoint_path, metadata = load_best_checkpoint("dqn_agent")
|
||||
loss = metadata.loss # Expensive operation
|
||||
```
|
||||
|
||||
**After**: Fast metadata access from database
|
||||
```python
|
||||
# New way - database query only
|
||||
metadata = db_manager.get_best_checkpoint_metadata("dqn_agent")
|
||||
loss = metadata.performance_metrics['loss'] # Fast!
|
||||
```
|
||||
|
||||
### 2. Cleaner Runtime Logs
|
||||
|
||||
**Before**: Scattered logs everywhere
|
||||
```
|
||||
2025-07-25 14:34:39,749 - utils.checkpoint_manager - INFO - No checkpoints found for dqn_agent
|
||||
2025-07-25 14:34:39,754 - utils.checkpoint_manager - INFO - No checkpoints found for enhanced_cnn
|
||||
2025-07-25 14:34:39,756 - utils.checkpoint_manager - INFO - No checkpoints found for extrema_trainer
|
||||
```
|
||||
|
||||
**After**: Clean, structured logging
|
||||
```
|
||||
2025-07-25 14:34:39 | dqn_agent | ETH/USDT | BUY | conf=0.850 | time= 12.5ms [checkpoint: dqn_agent_20250725_143500]
|
||||
2025-07-25 14:34:40 | enhanced_cnn | ETH/USDT | HOLD | conf=0.720 | time= 8.2ms [checkpoint: enhanced_cnn_20250725_143501]
|
||||
```
|
||||
|
||||
### 3. Structured Data Storage
|
||||
|
||||
**Database Schema**:
|
||||
```sql
|
||||
-- Fast metadata queries
|
||||
SELECT * FROM checkpoint_metadata WHERE model_name = 'dqn_agent' AND is_active = TRUE;
|
||||
|
||||
-- Performance analysis
|
||||
SELECT model_name, AVG(confidence), COUNT(*)
|
||||
FROM inference_records
|
||||
WHERE timestamp > datetime('now', '-24 hours')
|
||||
GROUP BY model_name;
|
||||
```
|
||||
|
||||
### 4. Easy Integration
|
||||
|
||||
**In Model Code**:
|
||||
```python
|
||||
# Replace scattered logging
|
||||
# OLD: logger.info(f"DQN prediction: {action} confidence={conf}")
|
||||
|
||||
# NEW: Centralized logging
|
||||
self.orchestrator.log_model_inference(
|
||||
model_name="dqn_agent",
|
||||
symbol=symbol,
|
||||
action=action,
|
||||
confidence=confidence,
|
||||
probabilities=probs,
|
||||
input_features=features,
|
||||
processing_time_ms=processing_time
|
||||
)
|
||||
```
|
||||
|
||||
## Implementation Guide
|
||||
|
||||
### 1. Update Model Classes
|
||||
|
||||
Add inference logging to prediction methods:
|
||||
|
||||
```python
|
||||
class DQNAgent:
|
||||
def predict(self, state):
|
||||
start_time = time.time()
|
||||
|
||||
# Your prediction logic here
|
||||
action = self._predict_action(state)
|
||||
confidence = self._calculate_confidence()
|
||||
|
||||
processing_time = (time.time() - start_time) * 1000
|
||||
|
||||
# Log the inference
|
||||
self.orchestrator.log_model_inference(
|
||||
model_name="dqn_agent",
|
||||
symbol=self.symbol,
|
||||
action=action,
|
||||
confidence=confidence,
|
||||
probabilities=self.action_probabilities,
|
||||
input_features=state,
|
||||
processing_time_ms=processing_time,
|
||||
checkpoint_id=self.current_checkpoint_id
|
||||
)
|
||||
|
||||
return action
|
||||
```
|
||||
|
||||
### 2. Update Checkpoint Saving
|
||||
|
||||
Use the enhanced checkpoint manager:
|
||||
|
||||
```python
|
||||
from utils.checkpoint_manager import save_checkpoint
|
||||
|
||||
# Save with metadata
|
||||
checkpoint_metadata = save_checkpoint(
|
||||
model=self.model,
|
||||
model_name="dqn_agent",
|
||||
model_type="rl",
|
||||
performance_metrics={"loss": 0.0234, "accuracy": 0.87},
|
||||
training_metadata={"epochs": 100, "lr": 0.001}
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Fast Metadata Access
|
||||
|
||||
Get checkpoint info without loading models:
|
||||
|
||||
```python
|
||||
# Fast metadata access
|
||||
metadata = orchestrator.get_checkpoint_metadata_fast("dqn_agent")
|
||||
if metadata:
|
||||
current_loss = metadata.performance_metrics['loss']
|
||||
checkpoint_id = metadata.checkpoint_id
|
||||
```
|
||||
|
||||
## Migration Steps
|
||||
|
||||
1. **Install new dependencies** (if any)
|
||||
2. **Update model classes** to use centralized logging
|
||||
3. **Replace checkpoint loading** with database queries where possible
|
||||
4. **Remove scattered logger.info()** calls for inferences
|
||||
5. **Test with demo script**: `python demo_logging_system.py`
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
utils/
|
||||
├── database_manager.py # SQLite database management
|
||||
├── inference_logger.py # Centralized inference logging
|
||||
├── text_logger.py # Human-readable text logs
|
||||
└── checkpoint_manager.py # Enhanced checkpoint handling
|
||||
|
||||
logs/ # Text log files
|
||||
├── inference_records.txt
|
||||
├── checkpoint_events.txt
|
||||
└── system_events.txt
|
||||
|
||||
data/
|
||||
└── trading_system.db # SQLite database
|
||||
|
||||
demo_logging_system.py # Demonstration script
|
||||
```
|
||||
|
||||
## Monitoring and Maintenance
|
||||
|
||||
### Daily Tasks
|
||||
- Check `logs/inference_records.txt` for recent activity
|
||||
- Monitor database size: `ls -lh data/trading_system.db`
|
||||
|
||||
### Weekly Tasks
|
||||
- Run cleanup: `inference_logger.cleanup_old_logs(days_to_keep=30)`
|
||||
- Check model performance trends in database
|
||||
|
||||
### Monthly Tasks
|
||||
- Archive old log files
|
||||
- Analyze model performance statistics
|
||||
- Review checkpoint storage usage
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **Database locked**: Multiple processes accessing SQLite
|
||||
- Solution: Use connection timeout and proper context managers
|
||||
|
||||
2. **Log files growing too large**:
|
||||
- Solution: Run `text_logger.cleanup_old_logs(max_lines=10000)`
|
||||
|
||||
3. **Missing checkpoint metadata**:
|
||||
- Solution: System falls back to file-based approach automatically
|
||||
|
||||
### Debug Commands
|
||||
|
||||
```python
|
||||
# Check database status
|
||||
db_manager = get_database_manager()
|
||||
checkpoints = db_manager.list_checkpoints("dqn_agent")
|
||||
|
||||
# Check recent inferences
|
||||
inference_logger = get_inference_logger()
|
||||
stats = inference_logger.get_model_stats("dqn_agent", hours=24)
|
||||
|
||||
# View text logs
|
||||
text_logger = get_text_logger()
|
||||
recent = text_logger.get_recent_inferences(lines=50)
|
||||
```
|
||||
|
||||
This upgrade provides a solid foundation for tracking model performance, eliminating log spam, and enabling fast metadata access without the overhead of loading full model checkpoints.
|
Reference in New Issue
Block a user