Files
gogo2/docs/logging_system_upgrade.md
2025-07-25 22:34:13 +03:00

7.8 KiB

Trading System Logging Upgrade

Overview

This upgrade implements a comprehensive logging and metadata management system that addresses the key issues:

  1. Eliminates scattered "No checkpoints found" logs during runtime
  2. Fast checkpoint metadata access without loading full models
  3. Centralized inference logging with database and text file storage
  4. Structured tracking of model performance and checkpoints

Key Components

1. Database Manager (utils/database_manager.py)

Purpose: SQLite-based storage for structured data

Features:

  • Inference records logging with deduplication
  • Checkpoint metadata storage (separate from model weights)
  • Model performance tracking
  • Fast queries without loading model files

Tables:

  • inference_records: All model predictions with metadata
  • checkpoint_metadata: Checkpoint info without model weights
  • model_performance: Daily aggregated statistics

2. Inference Logger (utils/inference_logger.py)

Purpose: Centralized logging for all model inferences

Features:

  • Single function call replaces scattered logger.info() calls
  • Automatic feature hashing for deduplication
  • Memory usage tracking
  • Processing time measurement
  • Dual storage (database + text files)

Usage:

from utils.inference_logger import log_model_inference

log_model_inference(
    model_name="dqn_agent",
    symbol="ETH/USDT", 
    action="BUY",
    confidence=0.85,
    probabilities={"BUY": 0.85, "SELL": 0.10, "HOLD": 0.05},
    input_features=features_array,
    processing_time_ms=12.5,
    checkpoint_id="dqn_agent_20250725_143500"
)

3. Text Logger (utils/text_logger.py)

Purpose: Human-readable log files for tracking

Features:

  • Separate files for different event types
  • Clean, tabular format
  • Automatic cleanup of old entries
  • Easy to read and grep

Files:

  • logs/inference_records.txt: All model predictions
  • logs/checkpoint_events.txt: Save/load events
  • logs/system_events.txt: General system events

4. Enhanced Checkpoint Manager (utils/checkpoint_manager.py)

Purpose: Improved checkpoint handling with metadata separation

Features:

  • Database-backed metadata storage
  • Fast metadata queries without loading models
  • Eliminates "No checkpoints found" spam
  • Backward compatibility with existing code

Benefits

1. Performance Improvements

Before: Loading full checkpoint just to get metadata

# Old way - loads entire model!
checkpoint_path, metadata = load_best_checkpoint("dqn_agent")
loss = metadata.loss  # Expensive operation

After: Fast metadata access from database

# New way - database query only
metadata = db_manager.get_best_checkpoint_metadata("dqn_agent")
loss = metadata.performance_metrics['loss']  # Fast!

2. Cleaner Runtime Logs

Before: Scattered logs everywhere

2025-07-25 14:34:39,749 - utils.checkpoint_manager - INFO - No checkpoints found for dqn_agent
2025-07-25 14:34:39,754 - utils.checkpoint_manager - INFO - No checkpoints found for enhanced_cnn
2025-07-25 14:34:39,756 - utils.checkpoint_manager - INFO - No checkpoints found for extrema_trainer

After: Clean, structured logging

2025-07-25 14:34:39 | dqn_agent       | ETH/USDT   | BUY  | conf=0.850 | time=  12.5ms [checkpoint: dqn_agent_20250725_143500]
2025-07-25 14:34:40 | enhanced_cnn    | ETH/USDT   | HOLD | conf=0.720 | time=   8.2ms [checkpoint: enhanced_cnn_20250725_143501]

3. Structured Data Storage

Database Schema:

-- Fast metadata queries
SELECT * FROM checkpoint_metadata WHERE model_name = 'dqn_agent' AND is_active = TRUE;

-- Performance analysis
SELECT model_name, AVG(confidence), COUNT(*) 
FROM inference_records 
WHERE timestamp > datetime('now', '-24 hours')
GROUP BY model_name;

4. Easy Integration

In Model Code:

# Replace scattered logging
# OLD: logger.info(f"DQN prediction: {action} confidence={conf}")

# NEW: Centralized logging
self.orchestrator.log_model_inference(
    model_name="dqn_agent",
    symbol=symbol,
    action=action,
    confidence=confidence,
    probabilities=probs,
    input_features=features,
    processing_time_ms=processing_time
)

Implementation Guide

1. Update Model Classes

Add inference logging to prediction methods:

class DQNAgent:
    def predict(self, state):
        start_time = time.time()
        
        # Your prediction logic here
        action = self._predict_action(state)
        confidence = self._calculate_confidence()
        
        processing_time = (time.time() - start_time) * 1000
        
        # Log the inference
        self.orchestrator.log_model_inference(
            model_name="dqn_agent",
            symbol=self.symbol,
            action=action,
            confidence=confidence,
            probabilities=self.action_probabilities,
            input_features=state,
            processing_time_ms=processing_time,
            checkpoint_id=self.current_checkpoint_id
        )
        
        return action

2. Update Checkpoint Saving

Use the enhanced checkpoint manager:

from utils.checkpoint_manager import save_checkpoint

# Save with metadata
checkpoint_metadata = save_checkpoint(
    model=self.model,
    model_name="dqn_agent",
    model_type="rl",
    performance_metrics={"loss": 0.0234, "accuracy": 0.87},
    training_metadata={"epochs": 100, "lr": 0.001}
)

3. Fast Metadata Access

Get checkpoint info without loading models:

# Fast metadata access
metadata = orchestrator.get_checkpoint_metadata_fast("dqn_agent")
if metadata:
    current_loss = metadata.performance_metrics['loss']
    checkpoint_id = metadata.checkpoint_id

Migration Steps

  1. Install new dependencies (if any)
  2. Update model classes to use centralized logging
  3. Replace checkpoint loading with database queries where possible
  4. Remove scattered logger.info() calls for inferences
  5. Test with demo script: python demo_logging_system.py

File Structure

utils/
├── database_manager.py      # SQLite database management
├── inference_logger.py      # Centralized inference logging
├── text_logger.py          # Human-readable text logs
└── checkpoint_manager.py    # Enhanced checkpoint handling

logs/                        # Text log files
├── inference_records.txt
├── checkpoint_events.txt
└── system_events.txt

data/
└── trading_system.db       # SQLite database

demo_logging_system.py      # Demonstration script

Monitoring and Maintenance

Daily Tasks

  • Check logs/inference_records.txt for recent activity
  • Monitor database size: ls -lh data/trading_system.db

Weekly Tasks

  • Run cleanup: inference_logger.cleanup_old_logs(days_to_keep=30)
  • Check model performance trends in database

Monthly Tasks

  • Archive old log files
  • Analyze model performance statistics
  • Review checkpoint storage usage

Troubleshooting

Common Issues

  1. Database locked: Multiple processes accessing SQLite

    • Solution: Use connection timeout and proper context managers
  2. Log files growing too large:

    • Solution: Run text_logger.cleanup_old_logs(max_lines=10000)
  3. Missing checkpoint metadata:

    • Solution: System falls back to file-based approach automatically

Debug Commands

# Check database status
db_manager = get_database_manager()
checkpoints = db_manager.list_checkpoints("dqn_agent")

# Check recent inferences
inference_logger = get_inference_logger()
stats = inference_logger.get_model_stats("dqn_agent", hours=24)

# View text logs
text_logger = get_text_logger()
recent = text_logger.get_recent_inferences(lines=50)

This upgrade provides a solid foundation for tracking model performance, eliminating log spam, and enabling fast metadata access without the overhead of loading full model checkpoints.