332 lines
9.4 KiB
Markdown
332 lines
9.4 KiB
Markdown
# TensorBoard Monitoring Guide
|
|
|
|
## Overview
|
|
|
|
The trading system now uses **TensorBoard** for real-time training monitoring instead of static charts. This provides dynamic, interactive visualizations that update during training.
|
|
|
|
## 🚨 CRITICAL: Real Market Data Only
|
|
|
|
All TensorBoard metrics are derived from **REAL market data training**. No synthetic or generated data is used.
|
|
|
|
## Quick Start
|
|
|
|
### 1. Start Training with TensorBoard
|
|
```bash
|
|
# CNN Training with TensorBoard
|
|
python main_clean.py --mode cnn --symbol ETH/USDT
|
|
|
|
# RL Training with TensorBoard
|
|
python train_rl_with_realtime.py --episodes 10
|
|
|
|
# Quick CNN Test
|
|
python test_cnn_only.py
|
|
```
|
|
|
|
### 2. Launch TensorBoard
|
|
```bash
|
|
# Option 1: Direct command
|
|
tensorboard --logdir=runs
|
|
|
|
# Option 2: Convenience script
|
|
python run_tensorboard.py
|
|
```
|
|
|
|
### 3. Access TensorBoard
|
|
Open your browser to: **http://localhost:6006**
|
|
|
|
## Available Metrics
|
|
|
|
### CNN Training Metrics
|
|
|
|
#### **Training Progress**
|
|
- `Training/EpochLoss` - Training loss per epoch
|
|
- `Training/EpochAccuracy` - Training accuracy per epoch
|
|
- `Training/BatchLoss` - Batch-level loss
|
|
- `Training/BatchAccuracy` - Batch-level accuracy
|
|
- `Training/BatchConfidence` - Model confidence scores
|
|
- `Training/LearningRate` - Learning rate schedule
|
|
- `Training/EpochTime` - Time per epoch
|
|
|
|
#### **Validation Metrics**
|
|
- `Validation/Loss` - Validation loss
|
|
- `Validation/Accuracy` - Validation accuracy
|
|
- `Validation/AvgConfidence` - Average confidence on validation set
|
|
- `Validation/Class_0_Accuracy` - BUY class accuracy
|
|
- `Validation/Class_1_Accuracy` - SELL class accuracy
|
|
- `Validation/Class_2_Accuracy` - HOLD class accuracy
|
|
|
|
#### **Best Model Tracking**
|
|
- `Best/ValidationLoss` - Best validation loss achieved
|
|
- `Best/ValidationAccuracy` - Best validation accuracy achieved
|
|
|
|
#### **Data Statistics**
|
|
- `Data/TotalSamples` - Number of training samples from real data
|
|
- `Data/Features` - Number of features (detected from real data)
|
|
- `Data/Timeframes` - Number of timeframes used
|
|
- `Data/WindowSize` - Window size for temporal patterns
|
|
- `Data/Class_X_Count` - Sample count per class
|
|
- `Data/Feature_X_Mean/Std` - Feature statistics
|
|
|
|
#### **Model Architecture**
|
|
- `Model/TotalParameters` - Total model parameters
|
|
- `Model/TrainableParameters` - Trainable parameters
|
|
|
|
#### **Training Configuration**
|
|
- `Config/LearningRate` - Learning rate used
|
|
- `Config/BatchSize` - Batch size
|
|
- `Config/MaxEpochs` - Maximum epochs
|
|
|
|
### RL Training Metrics
|
|
|
|
#### **Episode Performance**
|
|
- `Episode/TotalReward` - Total reward per episode
|
|
- `Episode/FinalBalance` - Final balance after episode
|
|
- `Episode/TotalReturn` - Return percentage
|
|
- `Episode/Steps` - Steps taken in episode
|
|
|
|
#### **Trading Performance**
|
|
- `Trading/TotalTrades` - Number of trades executed
|
|
- `Trading/WinRate` - Percentage of profitable trades
|
|
- `Trading/ProfitFactor` - Gross profit / gross loss ratio
|
|
- `Trading/MaxDrawdown` - Maximum drawdown percentage
|
|
|
|
#### **Agent Learning**
|
|
- `Agent/Epsilon` - Exploration rate (epsilon)
|
|
- `Agent/LearningRate` - Agent learning rate
|
|
- `Agent/MemorySize` - Experience replay buffer size
|
|
- `Agent/Loss` - Training loss from experience replay
|
|
|
|
#### **Moving Averages**
|
|
- `Moving_Average/Reward_50ep` - 50-episode average reward
|
|
- `Moving_Average/Return_50ep` - 50-episode average return
|
|
|
|
#### **Best Performance**
|
|
- `Best/Return` - Best return percentage achieved
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
runs/
|
|
├── cnn_training_1748043814/ # CNN training session
|
|
│ ├── events.out.tfevents.* # TensorBoard event files
|
|
│ └── ...
|
|
├── rl_training_1748043920/ # RL training session
|
|
│ ├── events.out.tfevents.*
|
|
│ └── ...
|
|
└── ... # Other training sessions
|
|
```
|
|
|
|
## TensorBoard Features
|
|
|
|
### **Scalars Tab**
|
|
- Real-time line charts of all metrics
|
|
- Smoothing controls for noisy metrics
|
|
- Multiple run comparisons
|
|
- Download data as CSV
|
|
|
|
### **Images Tab**
|
|
- Model architecture visualizations
|
|
- Training progression images
|
|
|
|
### **Graphs Tab**
|
|
- Computational graph of models
|
|
- Network architecture visualization
|
|
|
|
### **Histograms Tab**
|
|
- Weight and gradient distributions
|
|
- Activation patterns over time
|
|
|
|
### **Projector Tab**
|
|
- High-dimensional data visualization
|
|
- Feature embeddings
|
|
|
|
## Usage Examples
|
|
|
|
### 1. Monitor CNN Training
|
|
```bash
|
|
# Start CNN training (generates TensorBoard logs)
|
|
python main_clean.py --mode cnn --symbol ETH/USDT
|
|
|
|
# In another terminal, start TensorBoard
|
|
tensorboard --logdir=runs
|
|
|
|
# Open browser to http://localhost:6006
|
|
# Navigate to Scalars tab to see:
|
|
# - Training/EpochLoss declining over time
|
|
# - Validation/Accuracy improving
|
|
# - Training/LearningRate schedule
|
|
```
|
|
|
|
### 2. Compare Multiple Training Runs
|
|
```bash
|
|
# Run multiple training sessions
|
|
python test_cnn_only.py # Creates cnn_training_X
|
|
python test_cnn_only.py # Creates cnn_training_Y
|
|
|
|
# TensorBoard automatically shows both runs
|
|
# Compare performance across runs in the same charts
|
|
```
|
|
|
|
### 3. Monitor RL Agent Training
|
|
```bash
|
|
# Start RL training with TensorBoard logging
|
|
python main_clean.py --mode rl --symbol ETH/USDT
|
|
|
|
# View in TensorBoard:
|
|
# - Episode/TotalReward trending up
|
|
# - Trading/WinRate improving
|
|
# - Agent/Epsilon decreasing (less exploration)
|
|
```
|
|
|
|
## Real-Time Monitoring
|
|
|
|
### Key Indicators to Watch
|
|
|
|
#### **CNN Training Health**
|
|
- ✅ `Training/EpochLoss` should decrease over time
|
|
- ✅ `Validation/Accuracy` should increase
|
|
- ⚠️ Watch for overfitting (val loss increases while train loss decreases)
|
|
- ✅ `Training/LearningRate` should follow schedule
|
|
|
|
#### **RL Training Health**
|
|
- ✅ `Episode/TotalReward` trending upward
|
|
- ✅ `Trading/WinRate` above 50%
|
|
- ✅ `Moving_Average/Return_50ep` positive and stable
|
|
- ⚠️ `Agent/Epsilon` should decay over time
|
|
|
|
### Warning Signs
|
|
- **Loss not decreasing**: Check learning rate, data quality
|
|
- **Accuracy plateauing**: May need more data or different architecture
|
|
- **RL rewards oscillating**: Unstable learning, adjust hyperparameters
|
|
- **Win rate dropping**: Strategy not working, need different approach
|
|
|
|
## Configuration
|
|
|
|
### Custom TensorBoard Setup
|
|
```python
|
|
from torch.utils.tensorboard import SummaryWriter
|
|
|
|
# Custom log directory
|
|
writer = SummaryWriter(log_dir='runs/my_experiment')
|
|
|
|
# Log custom metrics
|
|
writer.add_scalar('Custom/Metric', value, step)
|
|
writer.add_histogram('Custom/Weights', weights, step)
|
|
```
|
|
|
|
### Advanced Features
|
|
```bash
|
|
# Start TensorBoard with custom port
|
|
tensorboard --logdir=runs --port=6007
|
|
|
|
# Enable debugging
|
|
tensorboard --logdir=runs --debugger_port=6064
|
|
|
|
# Profile performance
|
|
tensorboard --logdir=runs --load_fast=false
|
|
```
|
|
|
|
## Integration with Training
|
|
|
|
### CNN Trainer Integration
|
|
- Automatically logs all training metrics
|
|
- Model architecture visualization
|
|
- Real data statistics tracking
|
|
- Best model checkpointing based on TensorBoard metrics
|
|
|
|
### RL Trainer Integration
|
|
- Episode-by-episode performance tracking
|
|
- Trading strategy effectiveness monitoring
|
|
- Agent learning progress visualization
|
|
- Hyperparameter optimization guidance
|
|
|
|
## Benefits Over Static Charts
|
|
|
|
### ✅ **Real-Time Updates**
|
|
- See training progress as it happens
|
|
- No need to wait for training completion
|
|
- Immediate feedback on hyperparameter changes
|
|
|
|
### ✅ **Interactive Exploration**
|
|
- Zoom, pan, and explore metrics
|
|
- Smooth noisy data with built-in controls
|
|
- Compare multiple training runs side-by-side
|
|
|
|
### ✅ **Rich Visualizations**
|
|
- Scalars, histograms, images, and graphs
|
|
- Model architecture visualization
|
|
- High-dimensional data projections
|
|
|
|
### ✅ **Data Export**
|
|
- Download metrics as CSV
|
|
- Programmatic access to training data
|
|
- Integration with external analysis tools
|
|
|
|
## Troubleshooting
|
|
|
|
### TensorBoard Not Starting
|
|
```bash
|
|
# Check if TensorBoard is installed
|
|
pip install tensorboard
|
|
|
|
# Verify runs directory exists
|
|
dir runs # Windows
|
|
ls runs # Linux/Mac
|
|
|
|
# Kill existing TensorBoard processes
|
|
taskkill /F /IM tensorboard.exe # Windows
|
|
pkill -f tensorboard # Linux/Mac
|
|
```
|
|
|
|
### No Data Showing
|
|
- Ensure training is generating logs in `runs/` directory
|
|
- Check browser console for errors
|
|
- Try refreshing the page
|
|
- Verify correct port (default 6006)
|
|
|
|
### Performance Issues
|
|
- Use `--load_fast=true` for faster loading
|
|
- Clear old log directories
|
|
- Reduce logging frequency in training code
|
|
|
|
## Best Practices
|
|
|
|
### 🎯 **Regular Monitoring**
|
|
- Check TensorBoard every 10-20 epochs during CNN training
|
|
- Monitor RL agents every 50-100 episodes
|
|
- Look for concerning trends early
|
|
|
|
### 📊 **Metric Organization**
|
|
- Use clear naming conventions (Training/, Validation/, etc.)
|
|
- Group related metrics together
|
|
- Log at appropriate frequencies (not every step)
|
|
|
|
### 💾 **Data Management**
|
|
- Archive old training runs periodically
|
|
- Keep successful run logs for reference
|
|
- Document experiment parameters in run names
|
|
|
|
### 🔍 **Hyperparameter Tuning**
|
|
- Compare multiple runs with different hyperparameters
|
|
- Use TensorBoard data to guide optimization
|
|
- Track which settings produce best results
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
TensorBoard integration provides **real-time, interactive monitoring** of training progress using **only real market data**. This replaces static plots with dynamic visualizations that help optimize model performance and catch issues early.
|
|
|
|
**Key Commands:**
|
|
```bash
|
|
# Train with TensorBoard logging
|
|
python main_clean.py --mode cnn --symbol ETH/USDT
|
|
|
|
# Start TensorBoard
|
|
python run_tensorboard.py
|
|
|
|
# Access dashboard
|
|
http://localhost:6006
|
|
```
|
|
|
|
All metrics are derived from **real cryptocurrency market data** to ensure authentic trading model development. |