# Strix Halo NPU Integration Guide ## Overview This guide explains how to use AMD's Strix Halo NPU (Neural Processing Unit) to accelerate your neural network trading models on Linux. The NPU provides significant performance improvements for inference workloads, especially for CNNs and transformers. ## Prerequisites - AMD Strix Halo processor - Linux kernel 6.11+ (Ubuntu 24.04 LTS recommended) - AMD Ryzen AI Software 1.5+ - ROCm 6.4.1+ (optional, for GPU acceleration) ## Quick Start ### 1. Install NPU Software Stack ```bash # Run the setup script chmod +x setup_strix_halo_npu.sh ./setup_strix_halo_npu.sh # Reboot to load NPU drivers sudo reboot ``` ### 2. Verify NPU Detection ```bash # Check NPU devices ls /dev/amdxdna* # Run NPU test python3 test_npu.py ``` ### 3. Test Model Integration ```bash # Run comprehensive integration tests python3 test_npu_integration.py ``` ## Architecture ### NPU Acceleration Stack ``` ┌─────────────────────────────────────┐ │ Trading Models │ │ (CNN, Transformer, RL, DQN) │ └─────────────┬───────────────────────┘ │ ┌─────────────▼───────────────────────┐ │ Model Interfaces │ │ (CNNModelInterface, RLAgentInterface) │ └─────────────┬───────────────────────┘ │ ┌─────────────▼───────────────────────┐ │ NPUAcceleratedModel │ │ (ONNX Runtime + DirectML) │ └─────────────┬───────────────────────┘ │ ┌─────────────▼───────────────────────┐ │ Strix Halo NPU │ │ (XDNA Architecture) │ └─────────────────────────────────────┘ ``` ### Key Components 1. **NPUDetector**: Detects NPU availability and capabilities 2. **ONNXModelWrapper**: Wraps ONNX models for NPU inference 3. **PyTorchToONNXConverter**: Converts PyTorch models to ONNX 4. **NPUAcceleratedModel**: High-level interface for NPU acceleration 5. **Enhanced Model Interfaces**: Updated interfaces with NPU support ## Usage Examples ### Basic NPU Acceleration ```python from utils.npu_acceleration import NPUAcceleratedModel import torch.nn as nn # Create your PyTorch model model = YourTradingModel() # Wrap with NPU acceleration npu_model = NPUAcceleratedModel( pytorch_model=model, model_name="trading_model", input_shape=(60, 50) # Your input shape ) # Run inference import numpy as np test_data = np.random.randn(1, 60, 50).astype(np.float32) prediction = npu_model.predict(test_data) ``` ### Using Enhanced Model Interfaces ```python from NN.models.model_interfaces import CNNModelInterface # Create CNN model interface with NPU support cnn_interface = CNNModelInterface( model=your_cnn_model, name="trading_cnn", enable_npu=True, input_shape=(60, 50) ) # Get acceleration info info = cnn_interface.get_acceleration_info() print(f"NPU available: {info['npu_available']}") # Make predictions (automatically uses NPU if available) prediction = cnn_interface.predict(test_data) ``` ### Converting Existing Models ```python from utils.npu_acceleration import PyTorchToONNXConverter # Convert your existing model converter = PyTorchToONNXConverter(your_model) success = converter.convert( output_path="models/your_model.onnx", input_shape=(60, 50), input_names=['trading_features'], output_names=['trading_signals'] ) ``` ## Performance Benefits ### Expected Improvements - **Inference Speed**: 3-6x faster than CPU - **Power Efficiency**: Lower power consumption than GPU - **Latency**: Sub-millisecond inference for small models - **Memory**: Efficient memory usage for NPU-optimized models ### Benchmarking ```python from utils.npu_acceleration import benchmark_npu_vs_cpu # Benchmark your model results = benchmark_npu_vs_cpu( model_path="models/your_model.onnx", test_data=your_test_data, iterations=100 ) print(f"NPU speedup: {results['speedup']:.2f}x") print(f"NPU latency: {results['npu_latency_ms']:.2f} ms") ``` ## Integration with Existing Code ### Orchestrator Integration The orchestrator automatically detects and uses NPU acceleration when available: ```python # In core/orchestrator.py from NN.models.model_interfaces import CNNModelInterface, RLAgentInterface # Models automatically use NPU if available cnn_interface = CNNModelInterface( model=cnn_model, name="trading_cnn", enable_npu=True, # Enable NPU acceleration input_shape=(60, 50) ) ``` ### Dashboard Integration The dashboard shows NPU status and performance metrics: ```python # NPU status is automatically displayed in the dashboard # Check the "Acceleration" section for NPU information ``` ## Troubleshooting ### Common Issues 1. **NPU Not Detected** ```bash # Check kernel version (need 6.11+) uname -r # Check NPU devices ls /dev/amdxdna* # Reboot if needed sudo reboot ``` 2. **ONNX Runtime Issues** ```bash # Reinstall ONNX Runtime with DirectML pip install onnxruntime-directml --force-reinstall ``` 3. **Model Conversion Failures** ```python # Check model compatibility # Some PyTorch operations may not be supported # Use simpler model architectures for NPU ``` ### Debug Mode ```python import logging logging.basicConfig(level=logging.DEBUG) # Enable detailed NPU logging from utils.npu_detector import get_npu_info print(get_npu_info()) ``` ## Best Practices ### Model Optimization 1. **Use ONNX-compatible operations**: Avoid custom PyTorch operations 2. **Optimize input shapes**: Use fixed input shapes when possible 3. **Batch processing**: Process multiple samples together 4. **Model quantization**: Consider INT8 quantization for better performance ### Memory Management 1. **Monitor NPU memory usage**: NPU has limited memory 2. **Use model streaming**: Load/unload models as needed 3. **Optimize batch sizes**: Balance performance vs memory usage ### Error Handling 1. **Always provide fallbacks**: NPU may not always be available 2. **Handle conversion errors**: Some models may not convert properly 3. **Monitor performance**: Ensure NPU is actually faster than CPU ## Advanced Configuration ### Custom ONNX Providers ```python from utils.npu_detector import get_onnx_providers # Get available providers providers = get_onnx_providers() print(f"Available providers: {providers}") # Use specific provider order custom_providers = ['DmlExecutionProvider', 'CPUExecutionProvider'] ``` ### Performance Tuning ```python # Enable ONNX optimizations session_options = ort.SessionOptions() session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL session_options.enable_profiling = True ``` ## Monitoring and Metrics ### Performance Monitoring ```python # Get detailed performance info perf_info = npu_model.get_performance_info() print(f"Providers: {perf_info['providers']}") print(f"Input shapes: {perf_info['input_shapes']}") ``` ### Dashboard Metrics The dashboard automatically displays: - NPU availability status - Inference latency - Memory usage - Provider information ## Future Enhancements ### Planned Features 1. **Automatic model optimization**: Auto-tune models for NPU 2. **Dynamic provider selection**: Choose best provider automatically 3. **Advanced benchmarking**: More detailed performance analysis 4. **Model compression**: Automatic model size optimization ### Contributing To contribute NPU improvements: 1. Test with your specific models 2. Report performance improvements 3. Suggest optimization techniques 4. Contribute to the NPU acceleration utilities ## Support For issues with NPU integration: 1. Check the troubleshooting section 2. Run the integration tests 3. Check AMD documentation for latest updates 4. Verify kernel and driver compatibility --- **Note**: NPU acceleration is most effective for inference workloads. Training is still recommended on GPU or CPU. The NPU excels at real-time trading inference where low latency is critical.