350 lines
9.6 KiB
Markdown
350 lines
9.6 KiB
Markdown
# Design Document
|
|
|
|
## Overview
|
|
|
|
The UI Stability Fix implements a comprehensive solution to resolve critical stability issues between the dashboard UI and training processes. The design focuses on complete process isolation, proper async/await handling, resource conflict resolution, and robust error handling. The solution ensures that the dashboard can operate independently without affecting training system stability.
|
|
|
|
## Architecture
|
|
|
|
### High-Level Architecture
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Training Process"
|
|
TP[Training Process]
|
|
TM[Training Models]
|
|
TD[Training Data]
|
|
TL[Training Logs]
|
|
end
|
|
|
|
subgraph "Dashboard Process"
|
|
DP[Dashboard Process]
|
|
DU[Dashboard UI]
|
|
DC[Dashboard Cache]
|
|
DL[Dashboard Logs]
|
|
end
|
|
|
|
subgraph "Shared Resources"
|
|
SF[Shared Files]
|
|
SC[Shared Config]
|
|
SM[Shared Models]
|
|
SD[Shared Data]
|
|
end
|
|
|
|
TP --> SF
|
|
DP --> SF
|
|
TP --> SC
|
|
DP --> SC
|
|
TP --> SM
|
|
DP --> SM
|
|
TP --> SD
|
|
DP --> SD
|
|
|
|
TP -.->|No Direct Connection| DP
|
|
```
|
|
|
|
### Process Isolation Design
|
|
|
|
The system will implement complete process isolation using:
|
|
|
|
1. **Separate Python Processes**: Dashboard and training run as independent processes
|
|
2. **Inter-Process Communication**: File-based communication for status and data sharing
|
|
3. **Resource Partitioning**: Separate resource allocation for each process
|
|
4. **Independent Lifecycle Management**: Each process can start, stop, and restart independently
|
|
|
|
### Async/Await Error Resolution
|
|
|
|
The design addresses async issues through:
|
|
|
|
1. **Proper Event Loop Management**: Single event loop per process with proper lifecycle
|
|
2. **Async Context Isolation**: Separate async contexts for different components
|
|
3. **Coroutine Handling**: Proper awaiting of all async operations
|
|
4. **Exception Propagation**: Proper async exception handling and propagation
|
|
|
|
## Components and Interfaces
|
|
|
|
### 1. Process Manager
|
|
|
|
**Purpose**: Manages the lifecycle of both dashboard and training processes
|
|
|
|
**Interface**:
|
|
```python
|
|
class ProcessManager:
|
|
def start_training_process(self) -> bool
|
|
def start_dashboard_process(self, port: int = 8050) -> bool
|
|
def stop_training_process(self) -> bool
|
|
def stop_dashboard_process(self) -> bool
|
|
def get_process_status(self) -> Dict[str, str]
|
|
def restart_process(self, process_name: str) -> bool
|
|
```
|
|
|
|
**Implementation Details**:
|
|
- Uses subprocess.Popen for process creation
|
|
- Monitors process health with periodic checks
|
|
- Handles process output logging and error capture
|
|
- Implements graceful shutdown with timeout handling
|
|
|
|
### 2. Isolated Dashboard
|
|
|
|
**Purpose**: Provides a completely isolated dashboard that doesn't interfere with training
|
|
|
|
**Interface**:
|
|
```python
|
|
class IsolatedDashboard:
|
|
def __init__(self, config: Dict[str, Any])
|
|
def start_server(self, host: str, port: int) -> None
|
|
def stop_server(self) -> None
|
|
def update_data_from_files(self) -> None
|
|
def get_training_status(self) -> Dict[str, Any]
|
|
```
|
|
|
|
**Implementation Details**:
|
|
- Runs in separate process with own event loop
|
|
- Reads data from shared files instead of direct memory access
|
|
- Uses file-based communication for training status
|
|
- Implements proper async/await patterns for all operations
|
|
|
|
### 3. Isolated Training Process
|
|
|
|
**Purpose**: Runs training completely isolated from UI components
|
|
|
|
**Interface**:
|
|
```python
|
|
class IsolatedTrainingProcess:
|
|
def __init__(self, config: Dict[str, Any])
|
|
def start_training(self) -> None
|
|
def stop_training(self) -> None
|
|
def get_training_metrics(self) -> Dict[str, Any]
|
|
def save_status_to_file(self) -> None
|
|
```
|
|
|
|
**Implementation Details**:
|
|
- No UI dependencies or imports
|
|
- Writes status and metrics to shared files
|
|
- Implements proper resource cleanup
|
|
- Uses separate logging configuration
|
|
|
|
### 4. Shared Data Manager
|
|
|
|
**Purpose**: Manages data sharing between processes through files
|
|
|
|
**Interface**:
|
|
```python
|
|
class SharedDataManager:
|
|
def write_training_status(self, status: Dict[str, Any]) -> None
|
|
def read_training_status(self) -> Dict[str, Any]
|
|
def write_market_data(self, data: Dict[str, Any]) -> None
|
|
def read_market_data(self) -> Dict[str, Any]
|
|
def write_model_metrics(self, metrics: Dict[str, Any]) -> None
|
|
def read_model_metrics(self) -> Dict[str, Any]
|
|
```
|
|
|
|
**Implementation Details**:
|
|
- Uses JSON files for structured data
|
|
- Implements file locking to prevent corruption
|
|
- Provides atomic write operations
|
|
- Includes data validation and error handling
|
|
|
|
### 5. Resource Manager
|
|
|
|
**Purpose**: Manages resource allocation and prevents conflicts
|
|
|
|
**Interface**:
|
|
```python
|
|
class ResourceManager:
|
|
def allocate_gpu_resources(self, process_name: str) -> bool
|
|
def release_gpu_resources(self, process_name: str) -> None
|
|
def check_memory_usage(self) -> Dict[str, float]
|
|
def enforce_resource_limits(self) -> None
|
|
```
|
|
|
|
**Implementation Details**:
|
|
- Monitors GPU memory usage per process
|
|
- Implements resource quotas and limits
|
|
- Provides resource conflict detection
|
|
- Includes automatic resource cleanup
|
|
|
|
### 6. Async Handler
|
|
|
|
**Purpose**: Properly handles all async operations in the dashboard
|
|
|
|
**Interface**:
|
|
```python
|
|
class AsyncHandler:
|
|
def __init__(self, loop: asyncio.AbstractEventLoop)
|
|
async def handle_orchestrator_connection(self) -> None
|
|
async def handle_cob_integration(self) -> None
|
|
async def handle_trading_decisions(self, decision: Dict) -> None
|
|
def run_async_safely(self, coro: Coroutine) -> Any
|
|
```
|
|
|
|
**Implementation Details**:
|
|
- Manages single event loop per process
|
|
- Provides proper exception handling for async operations
|
|
- Implements timeout handling for long-running operations
|
|
- Includes async context management
|
|
|
|
## Data Models
|
|
|
|
### Process Status Model
|
|
```python
|
|
@dataclass
|
|
class ProcessStatus:
|
|
name: str
|
|
pid: int
|
|
status: str # 'running', 'stopped', 'error'
|
|
start_time: datetime
|
|
last_heartbeat: datetime
|
|
memory_usage: float
|
|
cpu_usage: float
|
|
error_message: Optional[str] = None
|
|
```
|
|
|
|
### Training Status Model
|
|
```python
|
|
@dataclass
|
|
class TrainingStatus:
|
|
is_running: bool
|
|
current_epoch: int
|
|
total_epochs: int
|
|
loss: float
|
|
accuracy: float
|
|
last_update: datetime
|
|
model_path: str
|
|
error_message: Optional[str] = None
|
|
```
|
|
|
|
### Dashboard State Model
|
|
```python
|
|
@dataclass
|
|
class DashboardState:
|
|
is_connected: bool
|
|
last_data_update: datetime
|
|
active_connections: int
|
|
error_count: int
|
|
performance_metrics: Dict[str, float]
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
### Exception Hierarchy
|
|
```python
|
|
class UIStabilityError(Exception):
|
|
"""Base exception for UI stability issues"""
|
|
pass
|
|
|
|
class ProcessCommunicationError(UIStabilityError):
|
|
"""Error in inter-process communication"""
|
|
pass
|
|
|
|
class AsyncOperationError(UIStabilityError):
|
|
"""Error in async operation handling"""
|
|
pass
|
|
|
|
class ResourceConflictError(UIStabilityError):
|
|
"""Error due to resource conflicts"""
|
|
pass
|
|
```
|
|
|
|
### Error Recovery Strategies
|
|
|
|
1. **Automatic Retry**: For transient network and file I/O errors
|
|
2. **Graceful Degradation**: Fallback to basic functionality when components fail
|
|
3. **Process Restart**: Automatic restart of failed processes
|
|
4. **Circuit Breaker**: Temporary disable of failing components
|
|
5. **Rollback**: Revert to last known good state
|
|
|
|
### Error Monitoring
|
|
|
|
- Centralized error logging with structured format
|
|
- Real-time error rate monitoring
|
|
- Automatic alerting for critical errors
|
|
- Error trend analysis and reporting
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Tests
|
|
- Test each component in isolation
|
|
- Mock external dependencies
|
|
- Verify error handling paths
|
|
- Test async operation handling
|
|
|
|
### Integration Tests
|
|
- Test inter-process communication
|
|
- Verify resource sharing mechanisms
|
|
- Test process lifecycle management
|
|
- Validate error recovery scenarios
|
|
|
|
### System Tests
|
|
- End-to-end stability testing
|
|
- Load testing with concurrent processes
|
|
- Failure injection testing
|
|
- Performance regression testing
|
|
|
|
### Monitoring Tests
|
|
- Health check endpoint testing
|
|
- Metrics collection validation
|
|
- Alert system testing
|
|
- Dashboard functionality testing
|
|
|
|
## Performance Considerations
|
|
|
|
### Resource Optimization
|
|
- Minimize memory footprint of each process
|
|
- Optimize file I/O operations for data sharing
|
|
- Implement efficient data serialization
|
|
- Use connection pooling for external services
|
|
|
|
### Scalability
|
|
- Support multiple dashboard instances
|
|
- Handle increased data volume gracefully
|
|
- Implement efficient caching strategies
|
|
- Optimize for high-frequency updates
|
|
|
|
### Monitoring
|
|
- Real-time performance metrics collection
|
|
- Resource usage tracking per process
|
|
- Response time monitoring
|
|
- Throughput measurement
|
|
|
|
## Security Considerations
|
|
|
|
### Process Isolation
|
|
- Separate user contexts for processes
|
|
- Limited file system access permissions
|
|
- Network access restrictions
|
|
- Resource usage limits
|
|
|
|
### Data Protection
|
|
- Secure file sharing mechanisms
|
|
- Data validation and sanitization
|
|
- Access control for shared resources
|
|
- Audit logging for sensitive operations
|
|
|
|
### Communication Security
|
|
- Encrypted inter-process communication
|
|
- Authentication for API endpoints
|
|
- Input validation for all interfaces
|
|
- Rate limiting for external requests
|
|
|
|
## Deployment Strategy
|
|
|
|
### Development Environment
|
|
- Local process management scripts
|
|
- Development-specific configuration
|
|
- Enhanced logging and debugging
|
|
- Hot-reload capabilities
|
|
|
|
### Production Environment
|
|
- Systemd service management
|
|
- Production configuration templates
|
|
- Log rotation and archiving
|
|
- Monitoring and alerting setup
|
|
|
|
### Migration Plan
|
|
1. Deploy new process management components
|
|
2. Update configuration files
|
|
3. Test process isolation functionality
|
|
4. Gradually migrate existing deployments
|
|
5. Monitor stability improvements
|
|
6. Remove legacy components |