Files

Dobromir Popov d68c915fd5 using LLM for sentiment analysis

2025-09-25 00:52:01 +03:00

8.4 KiB

Raw Blame History

Strix Halo NPU Integration Guide

Overview

This guide explains how to use AMD's Strix Halo NPU (Neural Processing Unit) to accelerate your neural network trading models on Linux. The NPU provides significant performance improvements for inference workloads, especially for CNNs and transformers.

Prerequisites

AMD Strix Halo processor
Linux kernel 6.11+ (Ubuntu 24.04 LTS recommended)
AMD Ryzen AI Software 1.5+
ROCm 6.4.1+ (optional, for GPU acceleration)

Quick Start

1. Install NPU Software Stack

# Run the setup script
chmod +x setup_strix_halo_npu.sh
./setup_strix_halo_npu.sh

# Reboot to load NPU drivers
sudo reboot

2. Verify NPU Detection

# Check NPU devices
ls /dev/amdxdna*

# Run NPU test
python3 test_npu.py

3. Test Model Integration

# Run comprehensive integration tests
python3 test_npu_integration.py

Architecture

NPU Acceleration Stack

┌─────────────────────────────────────┐
│           Trading Models            │
│  (CNN, Transformer, RL, DQN)       │
└─────────────┬───────────────────────┘
              │
┌─────────────▼───────────────────────┐
│        Model Interfaces            │
│  (CNNModelInterface, RLAgentInterface) │
└─────────────┬───────────────────────┘
              │
┌─────────────▼───────────────────────┐
│      NPUAcceleratedModel           │
│  (ONNX Runtime + DirectML)          │
└─────────────┬───────────────────────┘
              │
┌─────────────▼───────────────────────┐
│        Strix Halo NPU               │
│      (XDNA Architecture)            │
└─────────────────────────────────────┘

Key Components

NPUDetector: Detects NPU availability and capabilities
ONNXModelWrapper: Wraps ONNX models for NPU inference
PyTorchToONNXConverter: Converts PyTorch models to ONNX
NPUAcceleratedModel: High-level interface for NPU acceleration
Enhanced Model Interfaces: Updated interfaces with NPU support

Usage Examples

Basic NPU Acceleration

from utils.npu_acceleration import NPUAcceleratedModel
import torch.nn as nn

# Create your PyTorch model
model = YourTradingModel()

# Wrap with NPU acceleration
npu_model = NPUAcceleratedModel(
    pytorch_model=model,
    model_name="trading_model",
    input_shape=(60, 50)  # Your input shape
)

# Run inference
import numpy as np
test_data = np.random.randn(1, 60, 50).astype(np.float32)
prediction = npu_model.predict(test_data)

Using Enhanced Model Interfaces

from NN.models.model_interfaces import CNNModelInterface

# Create CNN model interface with NPU support
cnn_interface = CNNModelInterface(
    model=your_cnn_model,
    name="trading_cnn",
    enable_npu=True,
    input_shape=(60, 50)
)

# Get acceleration info
info = cnn_interface.get_acceleration_info()
print(f"NPU available: {info['npu_available']}")

# Make predictions (automatically uses NPU if available)
prediction = cnn_interface.predict(test_data)

Converting Existing Models

from utils.npu_acceleration import PyTorchToONNXConverter

# Convert your existing model
converter = PyTorchToONNXConverter(your_model)
success = converter.convert(
    output_path="models/your_model.onnx",
    input_shape=(60, 50),
    input_names=['trading_features'],
    output_names=['trading_signals']
)

Performance Benefits

Expected Improvements

Inference Speed: 3-6x faster than CPU
Power Efficiency: Lower power consumption than GPU
Latency: Sub-millisecond inference for small models
Memory: Efficient memory usage for NPU-optimized models

Benchmarking

from utils.npu_acceleration import benchmark_npu_vs_cpu

# Benchmark your model
results = benchmark_npu_vs_cpu(
    model_path="models/your_model.onnx",
    test_data=your_test_data,
    iterations=100
)

print(f"NPU speedup: {results['speedup']:.2f}x")
print(f"NPU latency: {results['npu_latency_ms']:.2f} ms")

Integration with Existing Code

Orchestrator Integration

The orchestrator automatically detects and uses NPU acceleration when available:

# In core/orchestrator.py
from NN.models.model_interfaces import CNNModelInterface, RLAgentInterface

# Models automatically use NPU if available
cnn_interface = CNNModelInterface(
    model=cnn_model,
    name="trading_cnn",
    enable_npu=True,  # Enable NPU acceleration
    input_shape=(60, 50)
)

Dashboard Integration

The dashboard shows NPU status and performance metrics:

# NPU status is automatically displayed in the dashboard
# Check the "Acceleration" section for NPU information

Troubleshooting

Common Issues

NPU Not Detected

# Check kernel version (need 6.11+)
uname -r

# Check NPU devices
ls /dev/amdxdna*

# Reboot if needed
sudo reboot

ONNX Runtime Issues

# Reinstall ONNX Runtime with DirectML
pip install onnxruntime-directml --force-reinstall

Model Conversion Failures

# Check model compatibility
# Some PyTorch operations may not be supported
# Use simpler model architectures for NPU

Debug Mode

import logging
logging.basicConfig(level=logging.DEBUG)

# Enable detailed NPU logging
from utils.npu_detector import get_npu_info
print(get_npu_info())

Best Practices

Model Optimization

Use ONNX-compatible operations: Avoid custom PyTorch operations
Optimize input shapes: Use fixed input shapes when possible
Batch processing: Process multiple samples together
Model quantization: Consider INT8 quantization for better performance

Memory Management

Monitor NPU memory usage: NPU has limited memory
Use model streaming: Load/unload models as needed
Optimize batch sizes: Balance performance vs memory usage

Error Handling

Always provide fallbacks: NPU may not always be available
Handle conversion errors: Some models may not convert properly
Monitor performance: Ensure NPU is actually faster than CPU

Advanced Configuration

Custom ONNX Providers

from utils.npu_detector import get_onnx_providers

# Get available providers
providers = get_onnx_providers()
print(f"Available providers: {providers}")

# Use specific provider order
custom_providers = ['DmlExecutionProvider', 'CPUExecutionProvider']

Performance Tuning

# Enable ONNX optimizations
session_options = ort.SessionOptions()
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session_options.enable_profiling = True

Monitoring and Metrics

Performance Monitoring

# Get detailed performance info
perf_info = npu_model.get_performance_info()
print(f"Providers: {perf_info['providers']}")
print(f"Input shapes: {perf_info['input_shapes']}")

Dashboard Metrics

The dashboard automatically displays:

NPU availability status
Inference latency
Memory usage
Provider information

Future Enhancements

Planned Features

Automatic model optimization: Auto-tune models for NPU
Dynamic provider selection: Choose best provider automatically
Advanced benchmarking: More detailed performance analysis
Model compression: Automatic model size optimization

Contributing

To contribute NPU improvements:

Test with your specific models
Report performance improvements
Suggest optimization techniques
Contribute to the NPU acceleration utilities

Support

For issues with NPU integration:

Check the troubleshooting section
Run the integration tests
Check AMD documentation for latest updates
Verify kernel and driver compatibility

Note: NPU acceleration is most effective for inference workloads. Training is still recommended on GPU or CPU. The NPU excels at real-time trading inference where low latency is critical.

8.4 KiB Raw Blame History