gogo2/STRX_HALO_NPU_GUIDE.md

# Strix Halo NPU Integration Guide

## Overview

This guide explains how to use AMD's Strix Halo NPU (Neural Processing Unit) to accelerate your neural network trading models on Linux. The NPU provides significant performance improvements for inference workloads, especially for CNNs and transformers.

## Prerequisites

- AMD Strix Halo processor
- Linux kernel 6.11+ (Ubuntu 24.04 LTS recommended)
- AMD Ryzen AI Software 1.5+
- ROCm 6.4.1+ (optional, for GPU acceleration)

## Quick Start

### 1. Install NPU Software Stack

```bash
# Run the setup script
chmod +x setup_strix_halo_npu.sh
./setup_strix_halo_npu.sh

# Reboot to load NPU drivers
sudo reboot
```

### 2. Verify NPU Detection

```bash
# Check NPU devices
ls /dev/amdxdna*

# Run NPU test
python3 test_npu.py
```

### 3. Test Model Integration

```bash
# Run comprehensive integration tests
python3 test_npu_integration.py
```

## Architecture

### NPU Acceleration Stack

```
┌─────────────────────────────────────┐
│           Trading Models            │
│  (CNN, Transformer, RL, DQN)       │
└─────────────┬───────────────────────┘
              │
┌─────────────▼───────────────────────┐
│        Model Interfaces            │
│  (CNNModelInterface, RLAgentInterface) │
└─────────────┬───────────────────────┘
              │
┌─────────────▼───────────────────────┐
│      NPUAcceleratedModel           │
│  (ONNX Runtime + DirectML)          │
└─────────────┬───────────────────────┘
              │
┌─────────────▼───────────────────────┐
│        Strix Halo NPU               │
│      (XDNA Architecture)            │
└─────────────────────────────────────┘
```

### Key Components

1. **NPUDetector**: Detects NPU availability and capabilities
2. **ONNXModelWrapper**: Wraps ONNX models for NPU inference
3. **PyTorchToONNXConverter**: Converts PyTorch models to ONNX
4. **NPUAcceleratedModel**: High-level interface for NPU acceleration
5. **Enhanced Model Interfaces**: Updated interfaces with NPU support

## Usage Examples

### Basic NPU Acceleration

```python
from utils.npu_acceleration import NPUAcceleratedModel
import torch.nn as nn

# Create your PyTorch model
model = YourTradingModel()

# Wrap with NPU acceleration
npu_model = NPUAcceleratedModel(
    pytorch_model=model,
    model_name="trading_model",
    input_shape=(60, 50)  # Your input shape
)

# Run inference
import numpy as np
test_data = np.random.randn(1, 60, 50).astype(np.float32)
prediction = npu_model.predict(test_data)
```

### Using Enhanced Model Interfaces

```python
from NN.models.model_interfaces import CNNModelInterface

# Create CNN model interface with NPU support
cnn_interface = CNNModelInterface(
    model=your_cnn_model,
    name="trading_cnn",
    enable_npu=True,
    input_shape=(60, 50)
)

# Get acceleration info
info = cnn_interface.get_acceleration_info()
print(f"NPU available: {info['npu_available']}")

# Make predictions (automatically uses NPU if available)
prediction = cnn_interface.predict(test_data)
```

### Converting Existing Models

```python
from utils.npu_acceleration import PyTorchToONNXConverter

# Convert your existing model
converter = PyTorchToONNXConverter(your_model)
success = converter.convert(
    output_path="models/your_model.onnx",
    input_shape=(60, 50),
    input_names=['trading_features'],
    output_names=['trading_signals']
)
```

## Performance Benefits

### Expected Improvements

- **Inference Speed**: 3-6x faster than CPU
- **Power Efficiency**: Lower power consumption than GPU
- **Latency**: Sub-millisecond inference for small models
- **Memory**: Efficient memory usage for NPU-optimized models

### Benchmarking

```python
from utils.npu_acceleration import benchmark_npu_vs_cpu

# Benchmark your model
results = benchmark_npu_vs_cpu(
    model_path="models/your_model.onnx",
    test_data=your_test_data,
    iterations=100
)

print(f"NPU speedup: {results['speedup']:.2f}x")
print(f"NPU latency: {results['npu_latency_ms']:.2f} ms")
```

## Integration with Existing Code

### Orchestrator Integration

The orchestrator automatically detects and uses NPU acceleration when available:

```python
# In core/orchestrator.py
from NN.models.model_interfaces import CNNModelInterface, RLAgentInterface

# Models automatically use NPU if available
cnn_interface = CNNModelInterface(
    model=cnn_model,
    name="trading_cnn",
    enable_npu=True,  # Enable NPU acceleration
    input_shape=(60, 50)
)
```

### Dashboard Integration

The dashboard shows NPU status and performance metrics:

```python
# NPU status is automatically displayed in the dashboard
# Check the "Acceleration" section for NPU information
```

## Troubleshooting

### Common Issues

1. **NPU Not Detected**
   ```bash
   # Check kernel version (need 6.11+)
   uname -r

   # Check NPU devices
   ls /dev/amdxdna*

   # Reboot if needed
   sudo reboot
   ```

2. **ONNX Runtime Issues**
   ```bash
   # Reinstall ONNX Runtime with DirectML
   pip install onnxruntime-directml --force-reinstall
   ```

3. **Model Conversion Failures**
   ```python
   # Check model compatibility
   # Some PyTorch operations may not be supported
   # Use simpler model architectures for NPU
   ```

### Debug Mode

```python
import logging
logging.basicConfig(level=logging.DEBUG)

# Enable detailed NPU logging
from utils.npu_detector import get_npu_info
print(get_npu_info())
```

## Best Practices

### Model Optimization

1. **Use ONNX-compatible operations**: Avoid custom PyTorch operations
2. **Optimize input shapes**: Use fixed input shapes when possible
3. **Batch processing**: Process multiple samples together
4. **Model quantization**: Consider INT8 quantization for better performance

### Memory Management

1. **Monitor NPU memory usage**: NPU has limited memory
2. **Use model streaming**: Load/unload models as needed
3. **Optimize batch sizes**: Balance performance vs memory usage

### Error Handling

1. **Always provide fallbacks**: NPU may not always be available
2. **Handle conversion errors**: Some models may not convert properly
3. **Monitor performance**: Ensure NPU is actually faster than CPU

## Advanced Configuration

### Custom ONNX Providers

```python
from utils.npu_detector import get_onnx_providers

# Get available providers
providers = get_onnx_providers()
print(f"Available providers: {providers}")

# Use specific provider order
custom_providers = ['DmlExecutionProvider', 'CPUExecutionProvider']
```

### Performance Tuning

```python
# Enable ONNX optimizations
session_options = ort.SessionOptions()
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session_options.enable_profiling = True
```

## Monitoring and Metrics

### Performance Monitoring

```python
# Get detailed performance info
perf_info = npu_model.get_performance_info()
print(f"Providers: {perf_info['providers']}")
print(f"Input shapes: {perf_info['input_shapes']}")
```

### Dashboard Metrics

The dashboard automatically displays:
- NPU availability status
- Inference latency
- Memory usage
- Provider information

## Future Enhancements

### Planned Features

1. **Automatic model optimization**: Auto-tune models for NPU
2. **Dynamic provider selection**: Choose best provider automatically
3. **Advanced benchmarking**: More detailed performance analysis
4. **Model compression**: Automatic model size optimization

### Contributing

To contribute NPU improvements:
1. Test with your specific models
2. Report performance improvements
3. Suggest optimization techniques
4. Contribute to the NPU acceleration utilities

## Support

For issues with NPU integration:
1. Check the troubleshooting section
2. Run the integration tests
3. Check AMD documentation for latest updates
4. Verify kernel and driver compatibility

---

**Note**: NPU acceleration is most effective for inference workloads. Training is still recommended on GPU or CPU. The NPU excels at real-time trading inference where low latency is critical.