Files
gogo2/.kiro/specs/ui-stability-fix/design.md
2025-07-22 15:44:59 +03:00

9.6 KiB

Design Document

Overview

The UI Stability Fix implements a comprehensive solution to resolve critical stability issues between the dashboard UI and training processes. The design focuses on complete process isolation, proper async/await handling, resource conflict resolution, and robust error handling. The solution ensures that the dashboard can operate independently without affecting training system stability.

Architecture

High-Level Architecture

graph TB
    subgraph "Training Process"
        TP[Training Process]
        TM[Training Models]
        TD[Training Data]
        TL[Training Logs]
    end
    
    subgraph "Dashboard Process"
        DP[Dashboard Process]
        DU[Dashboard UI]
        DC[Dashboard Cache]
        DL[Dashboard Logs]
    end
    
    subgraph "Shared Resources"
        SF[Shared Files]
        SC[Shared Config]
        SM[Shared Models]
        SD[Shared Data]
    end
    
    TP --> SF
    DP --> SF
    TP --> SC
    DP --> SC
    TP --> SM
    DP --> SM
    TP --> SD
    DP --> SD
    
    TP -.->|No Direct Connection| DP

Process Isolation Design

The system will implement complete process isolation using:

  1. Separate Python Processes: Dashboard and training run as independent processes
  2. Inter-Process Communication: File-based communication for status and data sharing
  3. Resource Partitioning: Separate resource allocation for each process
  4. Independent Lifecycle Management: Each process can start, stop, and restart independently

Async/Await Error Resolution

The design addresses async issues through:

  1. Proper Event Loop Management: Single event loop per process with proper lifecycle
  2. Async Context Isolation: Separate async contexts for different components
  3. Coroutine Handling: Proper awaiting of all async operations
  4. Exception Propagation: Proper async exception handling and propagation

Components and Interfaces

1. Process Manager

Purpose: Manages the lifecycle of both dashboard and training processes

Interface:

class ProcessManager:
    def start_training_process(self) -> bool
    def start_dashboard_process(self, port: int = 8050) -> bool
    def stop_training_process(self) -> bool
    def stop_dashboard_process(self) -> bool
    def get_process_status(self) -> Dict[str, str]
    def restart_process(self, process_name: str) -> bool

Implementation Details:

  • Uses subprocess.Popen for process creation
  • Monitors process health with periodic checks
  • Handles process output logging and error capture
  • Implements graceful shutdown with timeout handling

2. Isolated Dashboard

Purpose: Provides a completely isolated dashboard that doesn't interfere with training

Interface:

class IsolatedDashboard:
    def __init__(self, config: Dict[str, Any])
    def start_server(self, host: str, port: int) -> None
    def stop_server(self) -> None
    def update_data_from_files(self) -> None
    def get_training_status(self) -> Dict[str, Any]

Implementation Details:

  • Runs in separate process with own event loop
  • Reads data from shared files instead of direct memory access
  • Uses file-based communication for training status
  • Implements proper async/await patterns for all operations

3. Isolated Training Process

Purpose: Runs training completely isolated from UI components

Interface:

class IsolatedTrainingProcess:
    def __init__(self, config: Dict[str, Any])
    def start_training(self) -> None
    def stop_training(self) -> None
    def get_training_metrics(self) -> Dict[str, Any]
    def save_status_to_file(self) -> None

Implementation Details:

  • No UI dependencies or imports
  • Writes status and metrics to shared files
  • Implements proper resource cleanup
  • Uses separate logging configuration

4. Shared Data Manager

Purpose: Manages data sharing between processes through files

Interface:

class SharedDataManager:
    def write_training_status(self, status: Dict[str, Any]) -> None
    def read_training_status(self) -> Dict[str, Any]
    def write_market_data(self, data: Dict[str, Any]) -> None
    def read_market_data(self) -> Dict[str, Any]
    def write_model_metrics(self, metrics: Dict[str, Any]) -> None
    def read_model_metrics(self) -> Dict[str, Any]

Implementation Details:

  • Uses JSON files for structured data
  • Implements file locking to prevent corruption
  • Provides atomic write operations
  • Includes data validation and error handling

5. Resource Manager

Purpose: Manages resource allocation and prevents conflicts

Interface:

class ResourceManager:
    def allocate_gpu_resources(self, process_name: str) -> bool
    def release_gpu_resources(self, process_name: str) -> None
    def check_memory_usage(self) -> Dict[str, float]
    def enforce_resource_limits(self) -> None

Implementation Details:

  • Monitors GPU memory usage per process
  • Implements resource quotas and limits
  • Provides resource conflict detection
  • Includes automatic resource cleanup

6. Async Handler

Purpose: Properly handles all async operations in the dashboard

Interface:

class AsyncHandler:
    def __init__(self, loop: asyncio.AbstractEventLoop)
    async def handle_orchestrator_connection(self) -> None
    async def handle_cob_integration(self) -> None
    async def handle_trading_decisions(self, decision: Dict) -> None
    def run_async_safely(self, coro: Coroutine) -> Any

Implementation Details:

  • Manages single event loop per process
  • Provides proper exception handling for async operations
  • Implements timeout handling for long-running operations
  • Includes async context management

Data Models

Process Status Model

@dataclass
class ProcessStatus:
    name: str
    pid: int
    status: str  # 'running', 'stopped', 'error'
    start_time: datetime
    last_heartbeat: datetime
    memory_usage: float
    cpu_usage: float
    error_message: Optional[str] = None

Training Status Model

@dataclass
class TrainingStatus:
    is_running: bool
    current_epoch: int
    total_epochs: int
    loss: float
    accuracy: float
    last_update: datetime
    model_path: str
    error_message: Optional[str] = None

Dashboard State Model

@dataclass
class DashboardState:
    is_connected: bool
    last_data_update: datetime
    active_connections: int
    error_count: int
    performance_metrics: Dict[str, float]

Error Handling

Exception Hierarchy

class UIStabilityError(Exception):
    """Base exception for UI stability issues"""
    pass

class ProcessCommunicationError(UIStabilityError):
    """Error in inter-process communication"""
    pass

class AsyncOperationError(UIStabilityError):
    """Error in async operation handling"""
    pass

class ResourceConflictError(UIStabilityError):
    """Error due to resource conflicts"""
    pass

Error Recovery Strategies

  1. Automatic Retry: For transient network and file I/O errors
  2. Graceful Degradation: Fallback to basic functionality when components fail
  3. Process Restart: Automatic restart of failed processes
  4. Circuit Breaker: Temporary disable of failing components
  5. Rollback: Revert to last known good state

Error Monitoring

  • Centralized error logging with structured format
  • Real-time error rate monitoring
  • Automatic alerting for critical errors
  • Error trend analysis and reporting

Testing Strategy

Unit Tests

  • Test each component in isolation
  • Mock external dependencies
  • Verify error handling paths
  • Test async operation handling

Integration Tests

  • Test inter-process communication
  • Verify resource sharing mechanisms
  • Test process lifecycle management
  • Validate error recovery scenarios

System Tests

  • End-to-end stability testing
  • Load testing with concurrent processes
  • Failure injection testing
  • Performance regression testing

Monitoring Tests

  • Health check endpoint testing
  • Metrics collection validation
  • Alert system testing
  • Dashboard functionality testing

Performance Considerations

Resource Optimization

  • Minimize memory footprint of each process
  • Optimize file I/O operations for data sharing
  • Implement efficient data serialization
  • Use connection pooling for external services

Scalability

  • Support multiple dashboard instances
  • Handle increased data volume gracefully
  • Implement efficient caching strategies
  • Optimize for high-frequency updates

Monitoring

  • Real-time performance metrics collection
  • Resource usage tracking per process
  • Response time monitoring
  • Throughput measurement

Security Considerations

Process Isolation

  • Separate user contexts for processes
  • Limited file system access permissions
  • Network access restrictions
  • Resource usage limits

Data Protection

  • Secure file sharing mechanisms
  • Data validation and sanitization
  • Access control for shared resources
  • Audit logging for sensitive operations

Communication Security

  • Encrypted inter-process communication
  • Authentication for API endpoints
  • Input validation for all interfaces
  • Rate limiting for external requests

Deployment Strategy

Development Environment

  • Local process management scripts
  • Development-specific configuration
  • Enhanced logging and debugging
  • Hot-reload capabilities

Production Environment

  • Systemd service management
  • Production configuration templates
  • Log rotation and archiving
  • Monitoring and alerting setup

Migration Plan

  1. Deploy new process management components
  2. Update configuration files
  3. Test process isolation functionality
  4. Gradually migrate existing deployments
  5. Monitor stability improvements
  6. Remove legacy components