Run ONNX models

ONNX (Open Neural Network Exchange) is a standard format for exporting models — typically created in frameworks like PyTorch — so they can run anywhere. On Dragonwing devices you can use ONNX Runtime with AI Engine Direct to execute ONNX models directly on the NPU for maximum performance.

onnxruntime wheel with AI Engine Direct

onnxruntime currently does not publish prebuilt wheels for aarch64 Linux with AI Engine Direct bindings - so you cannot install onnxruntime from PyPI. You can download prebuilt wheels here:

(Install via pip3 install onnxruntime_qnn-*-linux_aarch64.whl)

To build a wheel for other onnxruntime or Python versions, see edgeimpulse/onnxruntime-qnn-linux-aarch64arrow-up-right.

Preparing your onnx file

The NPU only only supports quantized uint8/int8 models with a fixed input shape. If your model is not quantized, or if you have a dynamic input shape your model will automatically be offloaded to the CPU. Here's some tips on how to prepare your model.

A full length tutorial for exporting a PyTorch model to ONNX is available in the PyTorch documentationarrow-up-right.

Dynamic shapes

If you have a model with dynamic shapes, you'll need to make them fixed shape first. You can see the shape of your network via Netronarrow-up-right.

For example, this model has dynamic shapes:

A model with dynamic shape
An ONNX model with a dynamic shape. Here the input tensor is named `pixel_values`.

You can set a fixed shape via onnxruntime.tools.make_dynamic_shape_fixed:

Afterwards your model has a fixed shape and is ready to run on your NPU.

An ONNX model with a fixed shape
An ONNX model with a fixed shape

Quantizing models

The NPU only supports uint8/int8 quantized models. Unsupported models, or unsupported layers will be automatically moved back to the CPU. For a guide on quantization models, see ONNX Runtime docs: Quantize ONNX Modelsarrow-up-right.

circle-info

Don't want to quantize yourself? You can download a range of pre-quantized models from Qualcomm AI Hubarrow-up-right, or use Edge Impulse to quantize new or existing models.

Running a model on the NPU (Python)

To offload a model to the NPU, you just need to load the QNNExecutionProvider; and pass it when creating the InferenceSession. E.g.:

(Make sure you use an onnxruntime wheel with AI Engine Direct bindings, see the top of the page)

Example: SqueezeNet-1.1 (Python)

Open the terminal on your development board, or an ssh session to your development board, and:

  1. Create a new venv, and install the onnxruntime and Pillow:

  2. Here's an end-to-end example running SqueezeNet-1.1arrow-up-right. Save this file as inference_onnx.py:

    Note: this script has hard-coded quantization parameters. If you swap out the model you'll might need to change these.

  3. Run the model on the CPU:

  4. Run the model on the NPU:

As you can see this model runs significantly faster on NPU - but there's a slight change in the output of the model.

Example: PyTorch → ONNX → Quantized int8 → inference on NPU

Open the terminal on your development board, or an ssh session to your development board, and:

  1. Create a new venv, and install the onnxruntime and Pillow:

  2. Here's an end-to-end example running SqueezeNet-1.1 from torchvision. Save this file as inference_pytorch_onnx.py:

  3. Run the model on the CPU:

  4. Run the model on the NPU:

Tips & tricks

Disable CPU fallback

To debug, you might want to choose to disable fallback to the CPU via:

Building new versions of the the onnxruntime package

See edgeimpulse/onnxruntime-qnn-linux-aarch64arrow-up-right.

Last updated