gogo2/TENSORBOARD_MONITORING.md
Dobromir Popov 0fe8286787 misc
2025-05-24 09:58:36 +03:00

332 lines
9.4 KiB
Markdown

# TensorBoard Monitoring Guide
## Overview
The trading system now uses **TensorBoard** for real-time training monitoring instead of static charts. This provides dynamic, interactive visualizations that update during training.
## 🚨 CRITICAL: Real Market Data Only
All TensorBoard metrics are derived from **REAL market data training**. No synthetic or generated data is used.
## Quick Start
### 1. Start Training with TensorBoard
```bash
# CNN Training with TensorBoard
python main_clean.py --mode cnn --symbol ETH/USDT
# RL Training with TensorBoard
python train_rl_with_realtime.py --episodes 10
# Quick CNN Test
python test_cnn_only.py
```
### 2. Launch TensorBoard
```bash
# Option 1: Direct command
tensorboard --logdir=runs
# Option 2: Convenience script
python run_tensorboard.py
```
### 3. Access TensorBoard
Open your browser to: **http://localhost:6006**
## Available Metrics
### CNN Training Metrics
#### **Training Progress**
- `Training/EpochLoss` - Training loss per epoch
- `Training/EpochAccuracy` - Training accuracy per epoch
- `Training/BatchLoss` - Batch-level loss
- `Training/BatchAccuracy` - Batch-level accuracy
- `Training/BatchConfidence` - Model confidence scores
- `Training/LearningRate` - Learning rate schedule
- `Training/EpochTime` - Time per epoch
#### **Validation Metrics**
- `Validation/Loss` - Validation loss
- `Validation/Accuracy` - Validation accuracy
- `Validation/AvgConfidence` - Average confidence on validation set
- `Validation/Class_0_Accuracy` - BUY class accuracy
- `Validation/Class_1_Accuracy` - SELL class accuracy
- `Validation/Class_2_Accuracy` - HOLD class accuracy
#### **Best Model Tracking**
- `Best/ValidationLoss` - Best validation loss achieved
- `Best/ValidationAccuracy` - Best validation accuracy achieved
#### **Data Statistics**
- `Data/TotalSamples` - Number of training samples from real data
- `Data/Features` - Number of features (detected from real data)
- `Data/Timeframes` - Number of timeframes used
- `Data/WindowSize` - Window size for temporal patterns
- `Data/Class_X_Count` - Sample count per class
- `Data/Feature_X_Mean/Std` - Feature statistics
#### **Model Architecture**
- `Model/TotalParameters` - Total model parameters
- `Model/TrainableParameters` - Trainable parameters
#### **Training Configuration**
- `Config/LearningRate` - Learning rate used
- `Config/BatchSize` - Batch size
- `Config/MaxEpochs` - Maximum epochs
### RL Training Metrics
#### **Episode Performance**
- `Episode/TotalReward` - Total reward per episode
- `Episode/FinalBalance` - Final balance after episode
- `Episode/TotalReturn` - Return percentage
- `Episode/Steps` - Steps taken in episode
#### **Trading Performance**
- `Trading/TotalTrades` - Number of trades executed
- `Trading/WinRate` - Percentage of profitable trades
- `Trading/ProfitFactor` - Gross profit / gross loss ratio
- `Trading/MaxDrawdown` - Maximum drawdown percentage
#### **Agent Learning**
- `Agent/Epsilon` - Exploration rate (epsilon)
- `Agent/LearningRate` - Agent learning rate
- `Agent/MemorySize` - Experience replay buffer size
- `Agent/Loss` - Training loss from experience replay
#### **Moving Averages**
- `Moving_Average/Reward_50ep` - 50-episode average reward
- `Moving_Average/Return_50ep` - 50-episode average return
#### **Best Performance**
- `Best/Return` - Best return percentage achieved
## Directory Structure
```
runs/
├── cnn_training_1748043814/ # CNN training session
│ ├── events.out.tfevents.* # TensorBoard event files
│ └── ...
├── rl_training_1748043920/ # RL training session
│ ├── events.out.tfevents.*
│ └── ...
└── ... # Other training sessions
```
## TensorBoard Features
### **Scalars Tab**
- Real-time line charts of all metrics
- Smoothing controls for noisy metrics
- Multiple run comparisons
- Download data as CSV
### **Images Tab**
- Model architecture visualizations
- Training progression images
### **Graphs Tab**
- Computational graph of models
- Network architecture visualization
### **Histograms Tab**
- Weight and gradient distributions
- Activation patterns over time
### **Projector Tab**
- High-dimensional data visualization
- Feature embeddings
## Usage Examples
### 1. Monitor CNN Training
```bash
# Start CNN training (generates TensorBoard logs)
python main_clean.py --mode cnn --symbol ETH/USDT
# In another terminal, start TensorBoard
tensorboard --logdir=runs
# Open browser to http://localhost:6006
# Navigate to Scalars tab to see:
# - Training/EpochLoss declining over time
# - Validation/Accuracy improving
# - Training/LearningRate schedule
```
### 2. Compare Multiple Training Runs
```bash
# Run multiple training sessions
python test_cnn_only.py # Creates cnn_training_X
python test_cnn_only.py # Creates cnn_training_Y
# TensorBoard automatically shows both runs
# Compare performance across runs in the same charts
```
### 3. Monitor RL Agent Training
```bash
# Start RL training with TensorBoard logging
python main_clean.py --mode rl --symbol ETH/USDT
# View in TensorBoard:
# - Episode/TotalReward trending up
# - Trading/WinRate improving
# - Agent/Epsilon decreasing (less exploration)
```
## Real-Time Monitoring
### Key Indicators to Watch
#### **CNN Training Health**
-`Training/EpochLoss` should decrease over time
-`Validation/Accuracy` should increase
- ⚠️ Watch for overfitting (val loss increases while train loss decreases)
-`Training/LearningRate` should follow schedule
#### **RL Training Health**
-`Episode/TotalReward` trending upward
-`Trading/WinRate` above 50%
-`Moving_Average/Return_50ep` positive and stable
- ⚠️ `Agent/Epsilon` should decay over time
### Warning Signs
- **Loss not decreasing**: Check learning rate, data quality
- **Accuracy plateauing**: May need more data or different architecture
- **RL rewards oscillating**: Unstable learning, adjust hyperparameters
- **Win rate dropping**: Strategy not working, need different approach
## Configuration
### Custom TensorBoard Setup
```python
from torch.utils.tensorboard import SummaryWriter
# Custom log directory
writer = SummaryWriter(log_dir='runs/my_experiment')
# Log custom metrics
writer.add_scalar('Custom/Metric', value, step)
writer.add_histogram('Custom/Weights', weights, step)
```
### Advanced Features
```bash
# Start TensorBoard with custom port
tensorboard --logdir=runs --port=6007
# Enable debugging
tensorboard --logdir=runs --debugger_port=6064
# Profile performance
tensorboard --logdir=runs --load_fast=false
```
## Integration with Training
### CNN Trainer Integration
- Automatically logs all training metrics
- Model architecture visualization
- Real data statistics tracking
- Best model checkpointing based on TensorBoard metrics
### RL Trainer Integration
- Episode-by-episode performance tracking
- Trading strategy effectiveness monitoring
- Agent learning progress visualization
- Hyperparameter optimization guidance
## Benefits Over Static Charts
### ✅ **Real-Time Updates**
- See training progress as it happens
- No need to wait for training completion
- Immediate feedback on hyperparameter changes
### ✅ **Interactive Exploration**
- Zoom, pan, and explore metrics
- Smooth noisy data with built-in controls
- Compare multiple training runs side-by-side
### ✅ **Rich Visualizations**
- Scalars, histograms, images, and graphs
- Model architecture visualization
- High-dimensional data projections
### ✅ **Data Export**
- Download metrics as CSV
- Programmatic access to training data
- Integration with external analysis tools
## Troubleshooting
### TensorBoard Not Starting
```bash
# Check if TensorBoard is installed
pip install tensorboard
# Verify runs directory exists
dir runs # Windows
ls runs # Linux/Mac
# Kill existing TensorBoard processes
taskkill /F /IM tensorboard.exe # Windows
pkill -f tensorboard # Linux/Mac
```
### No Data Showing
- Ensure training is generating logs in `runs/` directory
- Check browser console for errors
- Try refreshing the page
- Verify correct port (default 6006)
### Performance Issues
- Use `--load_fast=true` for faster loading
- Clear old log directories
- Reduce logging frequency in training code
## Best Practices
### 🎯 **Regular Monitoring**
- Check TensorBoard every 10-20 epochs during CNN training
- Monitor RL agents every 50-100 episodes
- Look for concerning trends early
### 📊 **Metric Organization**
- Use clear naming conventions (Training/, Validation/, etc.)
- Group related metrics together
- Log at appropriate frequencies (not every step)
### 💾 **Data Management**
- Archive old training runs periodically
- Keep successful run logs for reference
- Document experiment parameters in run names
### 🔍 **Hyperparameter Tuning**
- Compare multiple runs with different hyperparameters
- Use TensorBoard data to guide optimization
- Track which settings produce best results
---
## Summary
TensorBoard integration provides **real-time, interactive monitoring** of training progress using **only real market data**. This replaces static plots with dynamic visualizations that help optimize model performance and catch issues early.
**Key Commands:**
```bash
# Train with TensorBoard logging
python main_clean.py --mode cnn --symbol ETH/USDT
# Start TensorBoard
python run_tensorboard.py
# Access dashboard
http://localhost:6006
```
All metrics are derived from **real cryptocurrency market data** to ensure authentic trading model development.