9.6 KiB
Design Document
Overview
The UI Stability Fix implements a comprehensive solution to resolve critical stability issues between the dashboard UI and training processes. The design focuses on complete process isolation, proper async/await handling, resource conflict resolution, and robust error handling. The solution ensures that the dashboard can operate independently without affecting training system stability.
Architecture
High-Level Architecture
graph TB
subgraph "Training Process"
TP[Training Process]
TM[Training Models]
TD[Training Data]
TL[Training Logs]
end
subgraph "Dashboard Process"
DP[Dashboard Process]
DU[Dashboard UI]
DC[Dashboard Cache]
DL[Dashboard Logs]
end
subgraph "Shared Resources"
SF[Shared Files]
SC[Shared Config]
SM[Shared Models]
SD[Shared Data]
end
TP --> SF
DP --> SF
TP --> SC
DP --> SC
TP --> SM
DP --> SM
TP --> SD
DP --> SD
TP -.->|No Direct Connection| DP
Process Isolation Design
The system will implement complete process isolation using:
- Separate Python Processes: Dashboard and training run as independent processes
- Inter-Process Communication: File-based communication for status and data sharing
- Resource Partitioning: Separate resource allocation for each process
- Independent Lifecycle Management: Each process can start, stop, and restart independently
Async/Await Error Resolution
The design addresses async issues through:
- Proper Event Loop Management: Single event loop per process with proper lifecycle
- Async Context Isolation: Separate async contexts for different components
- Coroutine Handling: Proper awaiting of all async operations
- Exception Propagation: Proper async exception handling and propagation
Components and Interfaces
1. Process Manager
Purpose: Manages the lifecycle of both dashboard and training processes
Interface:
class ProcessManager:
def start_training_process(self) -> bool
def start_dashboard_process(self, port: int = 8050) -> bool
def stop_training_process(self) -> bool
def stop_dashboard_process(self) -> bool
def get_process_status(self) -> Dict[str, str]
def restart_process(self, process_name: str) -> bool
Implementation Details:
- Uses subprocess.Popen for process creation
- Monitors process health with periodic checks
- Handles process output logging and error capture
- Implements graceful shutdown with timeout handling
2. Isolated Dashboard
Purpose: Provides a completely isolated dashboard that doesn't interfere with training
Interface:
class IsolatedDashboard:
def __init__(self, config: Dict[str, Any])
def start_server(self, host: str, port: int) -> None
def stop_server(self) -> None
def update_data_from_files(self) -> None
def get_training_status(self) -> Dict[str, Any]
Implementation Details:
- Runs in separate process with own event loop
- Reads data from shared files instead of direct memory access
- Uses file-based communication for training status
- Implements proper async/await patterns for all operations
3. Isolated Training Process
Purpose: Runs training completely isolated from UI components
Interface:
class IsolatedTrainingProcess:
def __init__(self, config: Dict[str, Any])
def start_training(self) -> None
def stop_training(self) -> None
def get_training_metrics(self) -> Dict[str, Any]
def save_status_to_file(self) -> None
Implementation Details:
- No UI dependencies or imports
- Writes status and metrics to shared files
- Implements proper resource cleanup
- Uses separate logging configuration
4. Shared Data Manager
Purpose: Manages data sharing between processes through files
Interface:
class SharedDataManager:
def write_training_status(self, status: Dict[str, Any]) -> None
def read_training_status(self) -> Dict[str, Any]
def write_market_data(self, data: Dict[str, Any]) -> None
def read_market_data(self) -> Dict[str, Any]
def write_model_metrics(self, metrics: Dict[str, Any]) -> None
def read_model_metrics(self) -> Dict[str, Any]
Implementation Details:
- Uses JSON files for structured data
- Implements file locking to prevent corruption
- Provides atomic write operations
- Includes data validation and error handling
5. Resource Manager
Purpose: Manages resource allocation and prevents conflicts
Interface:
class ResourceManager:
def allocate_gpu_resources(self, process_name: str) -> bool
def release_gpu_resources(self, process_name: str) -> None
def check_memory_usage(self) -> Dict[str, float]
def enforce_resource_limits(self) -> None
Implementation Details:
- Monitors GPU memory usage per process
- Implements resource quotas and limits
- Provides resource conflict detection
- Includes automatic resource cleanup
6. Async Handler
Purpose: Properly handles all async operations in the dashboard
Interface:
class AsyncHandler:
def __init__(self, loop: asyncio.AbstractEventLoop)
async def handle_orchestrator_connection(self) -> None
async def handle_cob_integration(self) -> None
async def handle_trading_decisions(self, decision: Dict) -> None
def run_async_safely(self, coro: Coroutine) -> Any
Implementation Details:
- Manages single event loop per process
- Provides proper exception handling for async operations
- Implements timeout handling for long-running operations
- Includes async context management
Data Models
Process Status Model
@dataclass
class ProcessStatus:
name: str
pid: int
status: str # 'running', 'stopped', 'error'
start_time: datetime
last_heartbeat: datetime
memory_usage: float
cpu_usage: float
error_message: Optional[str] = None
Training Status Model
@dataclass
class TrainingStatus:
is_running: bool
current_epoch: int
total_epochs: int
loss: float
accuracy: float
last_update: datetime
model_path: str
error_message: Optional[str] = None
Dashboard State Model
@dataclass
class DashboardState:
is_connected: bool
last_data_update: datetime
active_connections: int
error_count: int
performance_metrics: Dict[str, float]
Error Handling
Exception Hierarchy
class UIStabilityError(Exception):
"""Base exception for UI stability issues"""
pass
class ProcessCommunicationError(UIStabilityError):
"""Error in inter-process communication"""
pass
class AsyncOperationError(UIStabilityError):
"""Error in async operation handling"""
pass
class ResourceConflictError(UIStabilityError):
"""Error due to resource conflicts"""
pass
Error Recovery Strategies
- Automatic Retry: For transient network and file I/O errors
- Graceful Degradation: Fallback to basic functionality when components fail
- Process Restart: Automatic restart of failed processes
- Circuit Breaker: Temporary disable of failing components
- Rollback: Revert to last known good state
Error Monitoring
- Centralized error logging with structured format
- Real-time error rate monitoring
- Automatic alerting for critical errors
- Error trend analysis and reporting
Testing Strategy
Unit Tests
- Test each component in isolation
- Mock external dependencies
- Verify error handling paths
- Test async operation handling
Integration Tests
- Test inter-process communication
- Verify resource sharing mechanisms
- Test process lifecycle management
- Validate error recovery scenarios
System Tests
- End-to-end stability testing
- Load testing with concurrent processes
- Failure injection testing
- Performance regression testing
Monitoring Tests
- Health check endpoint testing
- Metrics collection validation
- Alert system testing
- Dashboard functionality testing
Performance Considerations
Resource Optimization
- Minimize memory footprint of each process
- Optimize file I/O operations for data sharing
- Implement efficient data serialization
- Use connection pooling for external services
Scalability
- Support multiple dashboard instances
- Handle increased data volume gracefully
- Implement efficient caching strategies
- Optimize for high-frequency updates
Monitoring
- Real-time performance metrics collection
- Resource usage tracking per process
- Response time monitoring
- Throughput measurement
Security Considerations
Process Isolation
- Separate user contexts for processes
- Limited file system access permissions
- Network access restrictions
- Resource usage limits
Data Protection
- Secure file sharing mechanisms
- Data validation and sanitization
- Access control for shared resources
- Audit logging for sensitive operations
Communication Security
- Encrypted inter-process communication
- Authentication for API endpoints
- Input validation for all interfaces
- Rate limiting for external requests
Deployment Strategy
Development Environment
- Local process management scripts
- Development-specific configuration
- Enhanced logging and debugging
- Hot-reload capabilities
Production Environment
- Systemd service management
- Production configuration templates
- Log rotation and archiving
- Monitoring and alerting setup
Migration Plan
- Deploy new process management components
- Update configuration files
- Test process isolation functionality
- Gradually migrate existing deployments
- Monitor stability improvements
- Remove legacy components