gogo2/TENSORBOARD_MONITORING.md
Dobromir Popov 0fe8286787 misc
2025-05-24 09:58:36 +03:00

9.4 KiB

TensorBoard Monitoring Guide

Overview

The trading system now uses TensorBoard for real-time training monitoring instead of static charts. This provides dynamic, interactive visualizations that update during training.

🚨 CRITICAL: Real Market Data Only

All TensorBoard metrics are derived from REAL market data training. No synthetic or generated data is used.

Quick Start

1. Start Training with TensorBoard

# CNN Training with TensorBoard
python main_clean.py --mode cnn --symbol ETH/USDT

# RL Training with TensorBoard  
python train_rl_with_realtime.py --episodes 10

# Quick CNN Test
python test_cnn_only.py

2. Launch TensorBoard

# Option 1: Direct command
tensorboard --logdir=runs

# Option 2: Convenience script
python run_tensorboard.py

3. Access TensorBoard

Open your browser to: http://localhost:6006

Available Metrics

CNN Training Metrics

Training Progress

  • Training/EpochLoss - Training loss per epoch
  • Training/EpochAccuracy - Training accuracy per epoch
  • Training/BatchLoss - Batch-level loss
  • Training/BatchAccuracy - Batch-level accuracy
  • Training/BatchConfidence - Model confidence scores
  • Training/LearningRate - Learning rate schedule
  • Training/EpochTime - Time per epoch

Validation Metrics

  • Validation/Loss - Validation loss
  • Validation/Accuracy - Validation accuracy
  • Validation/AvgConfidence - Average confidence on validation set
  • Validation/Class_0_Accuracy - BUY class accuracy
  • Validation/Class_1_Accuracy - SELL class accuracy
  • Validation/Class_2_Accuracy - HOLD class accuracy

Best Model Tracking

  • Best/ValidationLoss - Best validation loss achieved
  • Best/ValidationAccuracy - Best validation accuracy achieved

Data Statistics

  • Data/TotalSamples - Number of training samples from real data
  • Data/Features - Number of features (detected from real data)
  • Data/Timeframes - Number of timeframes used
  • Data/WindowSize - Window size for temporal patterns
  • Data/Class_X_Count - Sample count per class
  • Data/Feature_X_Mean/Std - Feature statistics

Model Architecture

  • Model/TotalParameters - Total model parameters
  • Model/TrainableParameters - Trainable parameters

Training Configuration

  • Config/LearningRate - Learning rate used
  • Config/BatchSize - Batch size
  • Config/MaxEpochs - Maximum epochs

RL Training Metrics

Episode Performance

  • Episode/TotalReward - Total reward per episode
  • Episode/FinalBalance - Final balance after episode
  • Episode/TotalReturn - Return percentage
  • Episode/Steps - Steps taken in episode

Trading Performance

  • Trading/TotalTrades - Number of trades executed
  • Trading/WinRate - Percentage of profitable trades
  • Trading/ProfitFactor - Gross profit / gross loss ratio
  • Trading/MaxDrawdown - Maximum drawdown percentage

Agent Learning

  • Agent/Epsilon - Exploration rate (epsilon)
  • Agent/LearningRate - Agent learning rate
  • Agent/MemorySize - Experience replay buffer size
  • Agent/Loss - Training loss from experience replay

Moving Averages

  • Moving_Average/Reward_50ep - 50-episode average reward
  • Moving_Average/Return_50ep - 50-episode average return

Best Performance

  • Best/Return - Best return percentage achieved

Directory Structure

runs/
├── cnn_training_1748043814/     # CNN training session
│   ├── events.out.tfevents.*    # TensorBoard event files
│   └── ...
├── rl_training_1748043920/      # RL training session  
│   ├── events.out.tfevents.*
│   └── ...
└── ...                          # Other training sessions

TensorBoard Features

Scalars Tab

  • Real-time line charts of all metrics
  • Smoothing controls for noisy metrics
  • Multiple run comparisons
  • Download data as CSV

Images Tab

  • Model architecture visualizations
  • Training progression images

Graphs Tab

  • Computational graph of models
  • Network architecture visualization

Histograms Tab

  • Weight and gradient distributions
  • Activation patterns over time

Projector Tab

  • High-dimensional data visualization
  • Feature embeddings

Usage Examples

1. Monitor CNN Training

# Start CNN training (generates TensorBoard logs)
python main_clean.py --mode cnn --symbol ETH/USDT

# In another terminal, start TensorBoard
tensorboard --logdir=runs

# Open browser to http://localhost:6006
# Navigate to Scalars tab to see:
# - Training/EpochLoss declining over time
# - Validation/Accuracy improving
# - Training/LearningRate schedule

2. Compare Multiple Training Runs

# Run multiple training sessions
python test_cnn_only.py  # Creates cnn_training_X
python test_cnn_only.py  # Creates cnn_training_Y

# TensorBoard automatically shows both runs
# Compare performance across runs in the same charts

3. Monitor RL Agent Training

# Start RL training with TensorBoard logging
python main_clean.py --mode rl --symbol ETH/USDT

# View in TensorBoard:
# - Episode/TotalReward trending up
# - Trading/WinRate improving
# - Agent/Epsilon decreasing (less exploration)

Real-Time Monitoring

Key Indicators to Watch

CNN Training Health

  • Training/EpochLoss should decrease over time
  • Validation/Accuracy should increase
  • ⚠️ Watch for overfitting (val loss increases while train loss decreases)
  • Training/LearningRate should follow schedule

RL Training Health

  • Episode/TotalReward trending upward
  • Trading/WinRate above 50%
  • Moving_Average/Return_50ep positive and stable
  • ⚠️ Agent/Epsilon should decay over time

Warning Signs

  • Loss not decreasing: Check learning rate, data quality
  • Accuracy plateauing: May need more data or different architecture
  • RL rewards oscillating: Unstable learning, adjust hyperparameters
  • Win rate dropping: Strategy not working, need different approach

Configuration

Custom TensorBoard Setup

from torch.utils.tensorboard import SummaryWriter

# Custom log directory
writer = SummaryWriter(log_dir='runs/my_experiment')

# Log custom metrics
writer.add_scalar('Custom/Metric', value, step)
writer.add_histogram('Custom/Weights', weights, step)

Advanced Features

# Start TensorBoard with custom port
tensorboard --logdir=runs --port=6007

# Enable debugging
tensorboard --logdir=runs --debugger_port=6064

# Profile performance
tensorboard --logdir=runs --load_fast=false

Integration with Training

CNN Trainer Integration

  • Automatically logs all training metrics
  • Model architecture visualization
  • Real data statistics tracking
  • Best model checkpointing based on TensorBoard metrics

RL Trainer Integration

  • Episode-by-episode performance tracking
  • Trading strategy effectiveness monitoring
  • Agent learning progress visualization
  • Hyperparameter optimization guidance

Benefits Over Static Charts

Real-Time Updates

  • See training progress as it happens
  • No need to wait for training completion
  • Immediate feedback on hyperparameter changes

Interactive Exploration

  • Zoom, pan, and explore metrics
  • Smooth noisy data with built-in controls
  • Compare multiple training runs side-by-side

Rich Visualizations

  • Scalars, histograms, images, and graphs
  • Model architecture visualization
  • High-dimensional data projections

Data Export

  • Download metrics as CSV
  • Programmatic access to training data
  • Integration with external analysis tools

Troubleshooting

TensorBoard Not Starting

# Check if TensorBoard is installed
pip install tensorboard

# Verify runs directory exists
dir runs  # Windows
ls runs   # Linux/Mac

# Kill existing TensorBoard processes
taskkill /F /IM tensorboard.exe  # Windows
pkill -f tensorboard             # Linux/Mac

No Data Showing

  • Ensure training is generating logs in runs/ directory
  • Check browser console for errors
  • Try refreshing the page
  • Verify correct port (default 6006)

Performance Issues

  • Use --load_fast=true for faster loading
  • Clear old log directories
  • Reduce logging frequency in training code

Best Practices

🎯 Regular Monitoring

  • Check TensorBoard every 10-20 epochs during CNN training
  • Monitor RL agents every 50-100 episodes
  • Look for concerning trends early

📊 Metric Organization

  • Use clear naming conventions (Training/, Validation/, etc.)
  • Group related metrics together
  • Log at appropriate frequencies (not every step)

💾 Data Management

  • Archive old training runs periodically
  • Keep successful run logs for reference
  • Document experiment parameters in run names

🔍 Hyperparameter Tuning

  • Compare multiple runs with different hyperparameters
  • Use TensorBoard data to guide optimization
  • Track which settings produce best results

Summary

TensorBoard integration provides real-time, interactive monitoring of training progress using only real market data. This replaces static plots with dynamic visualizations that help optimize model performance and catch issues early.

Key Commands:

# Train with TensorBoard logging
python main_clean.py --mode cnn --symbol ETH/USDT

# Start TensorBoard
python run_tensorboard.py

# Access dashboard
http://localhost:6006

All metrics are derived from real cryptocurrency market data to ensure authentic trading model development.