gogo2/TENSORBOARD_MONITORING.md

# TensorBoard Monitoring Guide

## Overview

The trading system now uses **TensorBoard** for real-time training monitoring instead of static charts. This provides dynamic, interactive visualizations that update during training.

## 🚨 CRITICAL: Real Market Data Only

All TensorBoard metrics are derived from **REAL market data training**. No synthetic or generated data is used.

## Quick Start

### 1. Start Training with TensorBoard
```bash
# CNN Training with TensorBoard
python main_clean.py --mode cnn --symbol ETH/USDT

# RL Training with TensorBoard
python train_rl_with_realtime.py --episodes 10

# Quick CNN Test
python test_cnn_only.py
```

### 2. Launch TensorBoard
```bash
# Option 1: Direct command
tensorboard --logdir=runs

# Option 2: Convenience script
python run_tensorboard.py
```

### 3. Access TensorBoard
Open your browser to: **http://localhost:6006**

## Available Metrics

### CNN Training Metrics

#### **Training Progress**
- `Training/EpochLoss` - Training loss per epoch
- `Training/EpochAccuracy` - Training accuracy per epoch
- `Training/BatchLoss` - Batch-level loss
- `Training/BatchAccuracy` - Batch-level accuracy
- `Training/BatchConfidence` - Model confidence scores
- `Training/LearningRate` - Learning rate schedule
- `Training/EpochTime` - Time per epoch

#### **Validation Metrics**
- `Validation/Loss` - Validation loss
- `Validation/Accuracy` - Validation accuracy
- `Validation/AvgConfidence` - Average confidence on validation set
- `Validation/Class_0_Accuracy` - BUY class accuracy
- `Validation/Class_1_Accuracy` - SELL class accuracy
- `Validation/Class_2_Accuracy` - HOLD class accuracy

#### **Best Model Tracking**
- `Best/ValidationLoss` - Best validation loss achieved
- `Best/ValidationAccuracy` - Best validation accuracy achieved

#### **Data Statistics**
- `Data/TotalSamples` - Number of training samples from real data
- `Data/Features` - Number of features (detected from real data)
- `Data/Timeframes` - Number of timeframes used
- `Data/WindowSize` - Window size for temporal patterns
- `Data/Class_X_Count` - Sample count per class
- `Data/Feature_X_Mean/Std` - Feature statistics

#### **Model Architecture**
- `Model/TotalParameters` - Total model parameters
- `Model/TrainableParameters` - Trainable parameters

#### **Training Configuration**
- `Config/LearningRate` - Learning rate used
- `Config/BatchSize` - Batch size
- `Config/MaxEpochs` - Maximum epochs

### RL Training Metrics

#### **Episode Performance**
- `Episode/TotalReward` - Total reward per episode
- `Episode/FinalBalance` - Final balance after episode
- `Episode/TotalReturn` - Return percentage
- `Episode/Steps` - Steps taken in episode

#### **Trading Performance**
- `Trading/TotalTrades` - Number of trades executed
- `Trading/WinRate` - Percentage of profitable trades
- `Trading/ProfitFactor` - Gross profit / gross loss ratio
- `Trading/MaxDrawdown` - Maximum drawdown percentage

#### **Agent Learning**
- `Agent/Epsilon` - Exploration rate (epsilon)
- `Agent/LearningRate` - Agent learning rate
- `Agent/MemorySize` - Experience replay buffer size
- `Agent/Loss` - Training loss from experience replay

#### **Moving Averages**
- `Moving_Average/Reward_50ep` - 50-episode average reward
- `Moving_Average/Return_50ep` - 50-episode average return

#### **Best Performance**
- `Best/Return` - Best return percentage achieved

## Directory Structure

```
runs/
├── cnn_training_1748043814/     # CNN training session
│   ├── events.out.tfevents.*    # TensorBoard event files
│   └── ...
├── rl_training_1748043920/      # RL training session
│   ├── events.out.tfevents.*
│   └── ...
└── ...                          # Other training sessions
```

## TensorBoard Features

### **Scalars Tab**
- Real-time line charts of all metrics
- Smoothing controls for noisy metrics
- Multiple run comparisons
- Download data as CSV

### **Images Tab**
- Model architecture visualizations
- Training progression images

### **Graphs Tab**
- Computational graph of models
- Network architecture visualization

### **Histograms Tab**
- Weight and gradient distributions
- Activation patterns over time

### **Projector Tab**
- High-dimensional data visualization
- Feature embeddings

## Usage Examples

### 1. Monitor CNN Training
```bash
# Start CNN training (generates TensorBoard logs)
python main_clean.py --mode cnn --symbol ETH/USDT

# In another terminal, start TensorBoard
tensorboard --logdir=runs

# Open browser to http://localhost:6006
# Navigate to Scalars tab to see:
# - Training/EpochLoss declining over time
# - Validation/Accuracy improving
# - Training/LearningRate schedule
```

### 2. Compare Multiple Training Runs
```bash
# Run multiple training sessions
python test_cnn_only.py  # Creates cnn_training_X
python test_cnn_only.py  # Creates cnn_training_Y

# TensorBoard automatically shows both runs
# Compare performance across runs in the same charts
```

### 3. Monitor RL Agent Training
```bash
# Start RL training with TensorBoard logging
python main_clean.py --mode rl --symbol ETH/USDT

# View in TensorBoard:
# - Episode/TotalReward trending up
# - Trading/WinRate improving
# - Agent/Epsilon decreasing (less exploration)
```

## Real-Time Monitoring

### Key Indicators to Watch

#### **CNN Training Health**
- ✅ `Training/EpochLoss` should decrease over time
- ✅ `Validation/Accuracy` should increase
- ⚠️ Watch for overfitting (val loss increases while train loss decreases)
- ✅ `Training/LearningRate` should follow schedule

#### **RL Training Health**
- ✅ `Episode/TotalReward` trending upward
- ✅ `Trading/WinRate` above 50%
- ✅ `Moving_Average/Return_50ep` positive and stable
- ⚠️ `Agent/Epsilon` should decay over time

### Warning Signs
- **Loss not decreasing**: Check learning rate, data quality
- **Accuracy plateauing**: May need more data or different architecture
- **RL rewards oscillating**: Unstable learning, adjust hyperparameters
- **Win rate dropping**: Strategy not working, need different approach

## Configuration

### Custom TensorBoard Setup
```python
from torch.utils.tensorboard import SummaryWriter

# Custom log directory
writer = SummaryWriter(log_dir='runs/my_experiment')

# Log custom metrics
writer.add_scalar('Custom/Metric', value, step)
writer.add_histogram('Custom/Weights', weights, step)
```

### Advanced Features
```bash
# Start TensorBoard with custom port
tensorboard --logdir=runs --port=6007

# Enable debugging
tensorboard --logdir=runs --debugger_port=6064

# Profile performance
tensorboard --logdir=runs --load_fast=false
```

## Integration with Training

### CNN Trainer Integration
- Automatically logs all training metrics
- Model architecture visualization
- Real data statistics tracking
- Best model checkpointing based on TensorBoard metrics

### RL Trainer Integration
- Episode-by-episode performance tracking
- Trading strategy effectiveness monitoring
- Agent learning progress visualization
- Hyperparameter optimization guidance

## Benefits Over Static Charts

### ✅ **Real-Time Updates**
- See training progress as it happens
- No need to wait for training completion
- Immediate feedback on hyperparameter changes

### ✅ **Interactive Exploration**
- Zoom, pan, and explore metrics
- Smooth noisy data with built-in controls
- Compare multiple training runs side-by-side

### ✅ **Rich Visualizations**
- Scalars, histograms, images, and graphs
- Model architecture visualization
- High-dimensional data projections

### ✅ **Data Export**
- Download metrics as CSV
- Programmatic access to training data
- Integration with external analysis tools

## Troubleshooting

### TensorBoard Not Starting
```bash
# Check if TensorBoard is installed
pip install tensorboard

# Verify runs directory exists
dir runs  # Windows
ls runs   # Linux/Mac

# Kill existing TensorBoard processes
taskkill /F /IM tensorboard.exe  # Windows
pkill -f tensorboard             # Linux/Mac
```

### No Data Showing
- Ensure training is generating logs in `runs/` directory
- Check browser console for errors
- Try refreshing the page
- Verify correct port (default 6006)

### Performance Issues
- Use `--load_fast=true` for faster loading
- Clear old log directories
- Reduce logging frequency in training code

## Best Practices

### 🎯 **Regular Monitoring**
- Check TensorBoard every 10-20 epochs during CNN training
- Monitor RL agents every 50-100 episodes
- Look for concerning trends early

### 📊 **Metric Organization**
- Use clear naming conventions (Training/, Validation/, etc.)
- Group related metrics together
- Log at appropriate frequencies (not every step)

### 💾 **Data Management**
- Archive old training runs periodically
- Keep successful run logs for reference
- Document experiment parameters in run names

### 🔍 **Hyperparameter Tuning**
- Compare multiple runs with different hyperparameters
- Use TensorBoard data to guide optimization
- Track which settings produce best results

---

## Summary

TensorBoard integration provides **real-time, interactive monitoring** of training progress using **only real market data**. This replaces static plots with dynamic visualizations that help optimize model performance and catch issues early.

**Key Commands:**
```bash
# Train with TensorBoard logging
python main_clean.py --mode cnn --symbol ETH/USDT

# Start TensorBoard
python run_tensorboard.py

# Access dashboard
http://localhost:6006
```

All metrics are derived from **real cryptocurrency market data** to ensure authentic trading model development.