fix model mappings,dash updates, trading

2025-07-22 15:44:59 +03:00
parent 3e35b9cddb
commit 1a54fb1d56
32 changed files with 6168 additions and 857 deletions
--- a/.kiro/specs/multi-modal-trading-system/design.md
+++ b/.kiro/specs/multi-modal-trading-system/design.md
@@ -430,6 +430,43 @@ The implementation will follow a phased approach:
   - Fix bugs and optimize performance
   - Deploy to production

+## Monitoring and Visualization
+
+### TensorBoard Integration (Future Enhancement)
+
+A comprehensive TensorBoard integration has been designed to provide detailed training visualization and monitoring capabilities:
+
+#### Features
+- **Training Metrics Visualization**: Real-time tracking of model losses, rewards, and performance metrics
+- **Feature Distribution Analysis**: Histograms and statistics of input features to validate data quality
+- **State Quality Monitoring**: Tracking of comprehensive state building (13,400 features) success rates
+- **Reward Component Analysis**: Detailed breakdown of reward calculations including PnL, confidence, volatility, and order flow
+- **Model Performance Comparison**: Side-by-side comparison of CNN, RL, and orchestrator performance
+
+#### Implementation Status
+- **Completed**: TensorBoardLogger utility class with comprehensive logging methods
+- **Completed**: Integration points in enhanced_rl_training_integration.py
+- **Completed**: Enhanced run_tensorboard.py with improved visualization options
+- **Status**: Ready for deployment when system stability is achieved
+
+#### Usage
+```bash
+# Start TensorBoard dashboard
+python run_tensorboard.py
+
+# Access at http://localhost:6006
+# View training metrics, feature distributions, and model performance
+```
+
+#### Benefits
+- Real-time validation of training process
+- Early detection of training issues
+- Feature importance analysis
+- Model performance comparison
+- Historical training progress tracking
+
+**Note**: TensorBoard integration is currently deprioritized in favor of system stability and core model improvements. It will be activated once the core training system is stable and performing optimally.
+
 ## Conclusion

 This design document outlines the architecture, components, data flow, and implementation details for the Multi-Modal Trading System. The system is designed to be modular, extensible, and robust, with a focus on performance, reliability, and user experience.
--- a/.kiro/specs/ui-stability-fix/design.md
+++ b/.kiro/specs/ui-stability-fix/design.md
@@ -0,0 +1,350 @@
+# Design Document
+
+## Overview
+
+The UI Stability Fix implements a comprehensive solution to resolve critical stability issues between the dashboard UI and training processes. The design focuses on complete process isolation, proper async/await handling, resource conflict resolution, and robust error handling. The solution ensures that the dashboard can operate independently without affecting training system stability.
+
+## Architecture
+
+### High-Level Architecture
+
+```mermaid
+graph TB
+    subgraph "Training Process"
+        TP[Training Process]
+        TM[Training Models]
+        TD[Training Data]
+        TL[Training Logs]
+    end
+    
+    subgraph "Dashboard Process"
+        DP[Dashboard Process]
+        DU[Dashboard UI]
+        DC[Dashboard Cache]
+        DL[Dashboard Logs]
+    end
+    
+    subgraph "Shared Resources"
+        SF[Shared Files]
+        SC[Shared Config]
+        SM[Shared Models]
+        SD[Shared Data]
+    end
+    
+    TP --> SF
+    DP --> SF
+    TP --> SC
+    DP --> SC
+    TP --> SM
+    DP --> SM
+    TP --> SD
+    DP --> SD
+    
+    TP -.->|No Direct Connection| DP
+```
+
+### Process Isolation Design
+
+The system will implement complete process isolation using:
+
+1. **Separate Python Processes**: Dashboard and training run as independent processes
+2. **Inter-Process Communication**: File-based communication for status and data sharing
+3. **Resource Partitioning**: Separate resource allocation for each process
+4. **Independent Lifecycle Management**: Each process can start, stop, and restart independently
+
+### Async/Await Error Resolution
+
+The design addresses async issues through:
+
+1. **Proper Event Loop Management**: Single event loop per process with proper lifecycle
+2. **Async Context Isolation**: Separate async contexts for different components
+3. **Coroutine Handling**: Proper awaiting of all async operations
+4. **Exception Propagation**: Proper async exception handling and propagation
+
+## Components and Interfaces
+
+### 1. Process Manager
+
+**Purpose**: Manages the lifecycle of both dashboard and training processes
+
+**Interface**:
+```python
+class ProcessManager:
+    def start_training_process(self) -> bool
+    def start_dashboard_process(self, port: int = 8050) -> bool
+    def stop_training_process(self) -> bool
+    def stop_dashboard_process(self) -> bool
+    def get_process_status(self) -> Dict[str, str]
+    def restart_process(self, process_name: str) -> bool
+```
+
+**Implementation Details**:
+- Uses subprocess.Popen for process creation
+- Monitors process health with periodic checks
+- Handles process output logging and error capture
+- Implements graceful shutdown with timeout handling
+
+### 2. Isolated Dashboard
+
+**Purpose**: Provides a completely isolated dashboard that doesn't interfere with training
+
+**Interface**:
+```python
+class IsolatedDashboard:
+    def __init__(self, config: Dict[str, Any])
+    def start_server(self, host: str, port: int) -> None
+    def stop_server(self) -> None
+    def update_data_from_files(self) -> None
+    def get_training_status(self) -> Dict[str, Any]
+```
+
+**Implementation Details**:
+- Runs in separate process with own event loop
+- Reads data from shared files instead of direct memory access
+- Uses file-based communication for training status
+- Implements proper async/await patterns for all operations
+
+### 3. Isolated Training Process
+
+**Purpose**: Runs training completely isolated from UI components
+
+**Interface**:
+```python
+class IsolatedTrainingProcess:
+    def __init__(self, config: Dict[str, Any])
+    def start_training(self) -> None
+    def stop_training(self) -> None
+    def get_training_metrics(self) -> Dict[str, Any]
+    def save_status_to_file(self) -> None
+```
+
+**Implementation Details**:
+- No UI dependencies or imports
+- Writes status and metrics to shared files
+- Implements proper resource cleanup
+- Uses separate logging configuration
+
+### 4. Shared Data Manager
+
+**Purpose**: Manages data sharing between processes through files
+
+**Interface**:
+```python
+class SharedDataManager:
+    def write_training_status(self, status: Dict[str, Any]) -> None
+    def read_training_status(self) -> Dict[str, Any]
+    def write_market_data(self, data: Dict[str, Any]) -> None
+    def read_market_data(self) -> Dict[str, Any]
+    def write_model_metrics(self, metrics: Dict[str, Any]) -> None
+    def read_model_metrics(self) -> Dict[str, Any]
+```
+
+**Implementation Details**:
+- Uses JSON files for structured data
+- Implements file locking to prevent corruption
+- Provides atomic write operations
+- Includes data validation and error handling
+
+### 5. Resource Manager
+
+**Purpose**: Manages resource allocation and prevents conflicts
+
+**Interface**:
+```python
+class ResourceManager:
+    def allocate_gpu_resources(self, process_name: str) -> bool
+    def release_gpu_resources(self, process_name: str) -> None
+    def check_memory_usage(self) -> Dict[str, float]
+    def enforce_resource_limits(self) -> None
+```
+
+**Implementation Details**:
+- Monitors GPU memory usage per process
+- Implements resource quotas and limits
+- Provides resource conflict detection
+- Includes automatic resource cleanup
+
+### 6. Async Handler
+
+**Purpose**: Properly handles all async operations in the dashboard
+
+**Interface**:
+```python
+class AsyncHandler:
+    def __init__(self, loop: asyncio.AbstractEventLoop)
+    async def handle_orchestrator_connection(self) -> None
+    async def handle_cob_integration(self) -> None
+    async def handle_trading_decisions(self, decision: Dict) -> None
+    def run_async_safely(self, coro: Coroutine) -> Any
+```
+
+**Implementation Details**:
+- Manages single event loop per process
+- Provides proper exception handling for async operations
+- Implements timeout handling for long-running operations
+- Includes async context management
+
+## Data Models
+
+### Process Status Model
+```python
+@dataclass
+class ProcessStatus:
+    name: str
+    pid: int
+    status: str  # 'running', 'stopped', 'error'
+    start_time: datetime
+    last_heartbeat: datetime
+    memory_usage: float
+    cpu_usage: float
+    error_message: Optional[str] = None
+```
+
+### Training Status Model
+```python
+@dataclass
+class TrainingStatus:
+    is_running: bool
+    current_epoch: int
+    total_epochs: int
+    loss: float
+    accuracy: float
+    last_update: datetime
+    model_path: str
+    error_message: Optional[str] = None
+```
+
+### Dashboard State Model
+```python
+@dataclass
+class DashboardState:
+    is_connected: bool
+    last_data_update: datetime
+    active_connections: int
+    error_count: int
+    performance_metrics: Dict[str, float]
+```
+
+## Error Handling
+
+### Exception Hierarchy
+```python
+class UIStabilityError(Exception):
+    """Base exception for UI stability issues"""
+    pass
+
+class ProcessCommunicationError(UIStabilityError):
+    """Error in inter-process communication"""
+    pass
+
+class AsyncOperationError(UIStabilityError):
+    """Error in async operation handling"""
+    pass
+
+class ResourceConflictError(UIStabilityError):
+    """Error due to resource conflicts"""
+    pass
+```
+
+### Error Recovery Strategies
+
+1. **Automatic Retry**: For transient network and file I/O errors
+2. **Graceful Degradation**: Fallback to basic functionality when components fail
+3. **Process Restart**: Automatic restart of failed processes
+4. **Circuit Breaker**: Temporary disable of failing components
+5. **Rollback**: Revert to last known good state
+
+### Error Monitoring
+
+- Centralized error logging with structured format
+- Real-time error rate monitoring
+- Automatic alerting for critical errors
+- Error trend analysis and reporting
+
+## Testing Strategy
+
+### Unit Tests
+- Test each component in isolation
+- Mock external dependencies
+- Verify error handling paths
+- Test async operation handling
+
+### Integration Tests
+- Test inter-process communication
+- Verify resource sharing mechanisms
+- Test process lifecycle management
+- Validate error recovery scenarios
+
+### System Tests
+- End-to-end stability testing
+- Load testing with concurrent processes
+- Failure injection testing
+- Performance regression testing
+
+### Monitoring Tests
+- Health check endpoint testing
+- Metrics collection validation
+- Alert system testing
+- Dashboard functionality testing
+
+## Performance Considerations
+
+### Resource Optimization
+- Minimize memory footprint of each process
+- Optimize file I/O operations for data sharing
+- Implement efficient data serialization
+- Use connection pooling for external services
+
+### Scalability
+- Support multiple dashboard instances
+- Handle increased data volume gracefully
+- Implement efficient caching strategies
+- Optimize for high-frequency updates
+
+### Monitoring
+- Real-time performance metrics collection
+- Resource usage tracking per process
+- Response time monitoring
+- Throughput measurement
+
+## Security Considerations
+
+### Process Isolation
+- Separate user contexts for processes
+- Limited file system access permissions
+- Network access restrictions
+- Resource usage limits
+
+### Data Protection
+- Secure file sharing mechanisms
+- Data validation and sanitization
+- Access control for shared resources
+- Audit logging for sensitive operations
+
+### Communication Security
+- Encrypted inter-process communication
+- Authentication for API endpoints
+- Input validation for all interfaces
+- Rate limiting for external requests
+
+## Deployment Strategy
+
+### Development Environment
+- Local process management scripts
+- Development-specific configuration
+- Enhanced logging and debugging
+- Hot-reload capabilities
+
+### Production Environment
+- Systemd service management
+- Production configuration templates
+- Log rotation and archiving
+- Monitoring and alerting setup
+
+### Migration Plan
+1. Deploy new process management components
+2. Update configuration files
+3. Test process isolation functionality
+4. Gradually migrate existing deployments
+5. Monitor stability improvements
+6. Remove legacy components
--- a/.kiro/specs/ui-stability-fix/requirements.md
+++ b/.kiro/specs/ui-stability-fix/requirements.md
@@ -0,0 +1,111 @@
+# Requirements Document
+
+## Introduction
+
+The UI Stability Fix addresses critical issues where loading the dashboard UI crashes the training process and causes unhandled exceptions. The system currently suffers from async/await handling problems, threading conflicts, resource contention, and improper separation of concerns between the UI and training processes. This fix will ensure the dashboard can run independently without affecting the training system's stability.
+
+## Requirements
+
+### Requirement 1: Async/Await Error Resolution
+
+**User Story:** As a developer, I want the dashboard to properly handle async operations, so that unhandled exceptions don't crash the entire system.
+
+#### Acceptance Criteria
+
+1. WHEN the dashboard initializes THEN it SHALL properly handle all async operations without throwing "An asyncio.Future, a coroutine or an awaitable is required" errors.
+2. WHEN connecting to the orchestrator THEN the system SHALL use proper async/await patterns for all coroutine calls.
+3. WHEN starting COB integration THEN the system SHALL properly manage event loops without conflicts.
+4. WHEN handling trading decisions THEN async callbacks SHALL be properly awaited and handled.
+5. WHEN the dashboard starts THEN it SHALL not create multiple conflicting event loops.
+6. WHEN async operations fail THEN the system SHALL handle exceptions gracefully without crashing.
+
+### Requirement 2: Process Isolation
+
+**User Story:** As a user, I want the dashboard and training processes to run independently, so that UI issues don't affect training stability.
+
+#### Acceptance Criteria
+
+1. WHEN the dashboard starts THEN it SHALL run in a completely separate process from the training system.
+2. WHEN the dashboard crashes THEN the training process SHALL continue running unaffected.
+3. WHEN the training process encounters issues THEN the dashboard SHALL remain functional.
+4. WHEN both processes are running THEN they SHALL communicate only through well-defined interfaces (files, APIs, or message queues).
+5. WHEN either process restarts THEN the other process SHALL continue operating normally.
+6. WHEN resources are accessed THEN there SHALL be no direct shared memory or threading conflicts between processes.
+
+### Requirement 3: Resource Contention Resolution
+
+**User Story:** As a system administrator, I want to eliminate resource conflicts between UI and training, so that both can operate efficiently without interference.
+
+#### Acceptance Criteria
+
+1. WHEN both dashboard and training are running THEN they SHALL not compete for the same GPU resources.
+2. WHEN accessing data files THEN proper file locking SHALL prevent corruption or access conflicts.
+3. WHEN using network resources THEN rate limiting SHALL prevent API conflicts between processes.
+4. WHEN accessing model files THEN proper synchronization SHALL prevent read/write conflicts.
+5. WHEN logging THEN separate log files SHALL be used to prevent write conflicts.
+6. WHEN using temporary files THEN separate directories SHALL be used for each process.
+
+### Requirement 4: Threading Safety
+
+**User Story:** As a developer, I want all threading operations to be safe and properly managed, so that race conditions and deadlocks don't occur.
+
+#### Acceptance Criteria
+
+1. WHEN the dashboard uses threads THEN all shared data SHALL be properly synchronized.
+2. WHEN background updates run THEN they SHALL not interfere with main UI thread operations.
+3. WHEN stopping threads THEN proper cleanup SHALL occur without hanging or deadlocks.
+4. WHEN accessing shared resources THEN proper locking mechanisms SHALL be used.
+5. WHEN threads encounter exceptions THEN they SHALL be handled without crashing the main process.
+6. WHEN the dashboard shuts down THEN all threads SHALL be properly terminated.
+
+### Requirement 5: Error Handling and Recovery
+
+**User Story:** As a user, I want the system to handle errors gracefully and recover automatically, so that temporary issues don't cause permanent failures.
+
+#### Acceptance Criteria
+
+1. WHEN unhandled exceptions occur THEN they SHALL be caught and logged without crashing the process.
+2. WHEN network connections fail THEN the system SHALL retry with exponential backoff.
+3. WHEN data sources are unavailable THEN fallback mechanisms SHALL provide basic functionality.
+4. WHEN memory issues occur THEN the system SHALL free resources and continue operating.
+5. WHEN critical errors happen THEN the system SHALL attempt automatic recovery.
+6. WHEN recovery fails THEN the system SHALL provide clear error messages and graceful degradation.
+
+### Requirement 6: Monitoring and Diagnostics
+
+**User Story:** As a developer, I want comprehensive monitoring and diagnostics, so that I can quickly identify and resolve stability issues.
+
+#### Acceptance Criteria
+
+1. WHEN the system runs THEN it SHALL provide real-time health monitoring for all components.
+2. WHEN errors occur THEN detailed diagnostic information SHALL be logged with timestamps and context.
+3. WHEN performance issues arise THEN resource usage metrics SHALL be available.
+4. WHEN processes communicate THEN message flow SHALL be traceable for debugging.
+5. WHEN the system starts THEN startup diagnostics SHALL verify all components are working correctly.
+6. WHEN stability issues occur THEN automated alerts SHALL notify administrators.
+
+### Requirement 7: Configuration and Control
+
+**User Story:** As a system administrator, I want flexible configuration options, so that I can optimize system behavior for different environments.
+
+#### Acceptance Criteria
+
+1. WHEN configuring the system THEN separate configuration files SHALL be used for dashboard and training processes.
+2. WHEN adjusting resource limits THEN configuration SHALL allow tuning memory, CPU, and GPU usage.
+3. WHEN setting update intervals THEN dashboard refresh rates SHALL be configurable.
+4. WHEN enabling features THEN individual components SHALL be independently controllable.
+5. WHEN debugging THEN log levels SHALL be adjustable without restarting processes.
+6. WHEN deploying THEN environment-specific configurations SHALL be supported.
+
+### Requirement 8: Backward Compatibility
+
+**User Story:** As a user, I want the stability fixes to maintain existing functionality, so that current workflows continue to work.
+
+#### Acceptance Criteria
+
+1. WHEN the fixes are applied THEN all existing dashboard features SHALL continue to work.
+2. WHEN training processes run THEN they SHALL maintain the same interfaces and outputs.
+3. WHEN data is accessed THEN existing data formats SHALL remain compatible.
+4. WHEN APIs are used THEN existing endpoints SHALL continue to function.
+5. WHEN configurations are loaded THEN existing config files SHALL remain valid.
+6. WHEN the system upgrades THEN migration paths SHALL preserve user settings and data.
--- a/.kiro/specs/ui-stability-fix/tasks.md
+++ b/.kiro/specs/ui-stability-fix/tasks.md
@@ -0,0 +1,79 @@
+# Implementation Plan
+
+- [x] 1. Create Shared Data Manager for inter-process communication
+
+
+  - Implement JSON-based file sharing with atomic writes and file locking
+  - Create data models for training status, dashboard state, and process status
+  - Add validation and error handling for all data operations
+  - _Requirements: 2.4, 3.4, 5.2_
+
+
+
+
+
+- [ ] 2. Implement Async Handler for proper async/await management
+  - Create centralized async operation handler with single event loop management
+  - Fix all async/await patterns in dashboard code
+  - Add proper exception handling for async operations with timeout support
+  - _Requirements: 1.1, 1.2, 1.3, 1.6_
+
+- [ ] 3. Create Isolated Training Process
+  - Extract training logic into standalone process without UI dependencies
+  - Implement file-based status reporting and metrics sharing
+  - Add proper resource cleanup and error handling
+  - _Requirements: 2.1, 2.2, 3.1, 4.5_
+
+- [ ] 4. Create Isolated Dashboard Process
+  - Refactor dashboard to run independently with file-based data access
+  - Remove direct memory sharing and threading conflicts with training
+  - Implement proper process lifecycle management
+  - _Requirements: 2.1, 2.3, 4.1, 4.2_
+
+- [ ] 5. Implement Process Manager
+  - Create process lifecycle management with subprocess handling
+  - Add process monitoring, health checks, and automatic restart capabilities
+  - Implement graceful shutdown with proper cleanup
+  - _Requirements: 2.5, 5.5, 6.1, 6.6_
+
+- [ ] 6. Create Resource Manager
+  - Implement GPU resource allocation and conflict prevention
+  - Add memory usage monitoring and resource limits enforcement
+  - Create separate logging and temporary file management
+  - _Requirements: 3.1, 3.2, 3.5, 3.6_
+
+- [ ] 7. Fix Threading Safety Issues
+  - Audit and fix all shared data access with proper synchronization
+  - Implement proper thread cleanup and exception handling
+  - Remove race conditions and deadlock potential
+  - _Requirements: 4.1, 4.2, 4.3, 4.6_
+
+- [ ] 8. Implement Error Handling and Recovery
+  - Add comprehensive exception handling with proper logging
+  - Create automatic retry mechanisms with exponential backoff
+  - Implement fallback mechanisms and graceful degradation
+  - _Requirements: 5.1, 5.2, 5.3, 5.6_
+
+- [ ] 9. Create System Launcher and Configuration
+  - Build unified launcher script for both processes
+  - Create separate configuration files for dashboard and training
+  - Add environment-specific configuration support
+  - _Requirements: 7.1, 7.2, 7.4, 7.6_
+
+- [ ] 10. Add Monitoring and Diagnostics
+  - Implement real-time health monitoring for all components
+  - Create detailed diagnostic logging with structured format
+  - Add performance metrics collection and resource usage tracking
+  - _Requirements: 6.1, 6.2, 6.3, 6.5_
+
+- [ ] 11. Create Integration Tests
+  - Write tests for inter-process communication and data sharing
+  - Test process lifecycle management and error recovery
+  - Validate resource conflict resolution and stability improvements
+  - _Requirements: 5.4, 5.5, 6.4, 8.1_
+
+- [ ] 12. Update Documentation and Migration Guide
+  - Document new architecture and deployment procedures
+  - Create migration guide from existing system
+  - Add troubleshooting guide for common stability issues
+  - _Requirements: 8.2, 8.5, 8.6_