Run LiteRT / TFLite models

LiteRT, formerly known as TensorFlow Lite, is Google's high-performance runtime for on-device AI. You can run existing quantized LiteRT models (in Python or C++) on the NPU on Dragonwing devices with a single line of code using the LiteRT delegates that are part of AI Engine Direct.

Quantizing models

The NPU only supports uint8/int8 quantized models. Unsupported models, or unsupported layers will be automatically moved back to the CPU. You can use quantization-aware trainingarrow-up-right or post-training quantizationarrow-up-right to quantize your LiteRT models. Make sure you follow the steps for "Full integer quantization".

circle-info

Don't want to quantize yourself? You can download a range of pre-quantized models from Qualcomm AI Hubarrow-up-right, or use Edge Impulse to quantize new or existing models.

Running a model on the NPU (Python)

To offload a model to the NPU, you just need to load the LiteRT delegate; and pass it into the interpreter. E.g.:

from ai_edge_litert.interpreter import Interpreter, load_delegate

qnn_delegate = load_delegate("libQnnTFLiteDelegate.so", options={"backend_type": "htp"})
interpreter = Interpreter(
    model_path=...,
    experimental_delegates=[qnn_delegate]
)

Running a model on the NPU (C++)

To offload a model to the NPU, you'll first need to add the following compile flags:

Then, you instantiate the LiteRT delegate and pass it to the LiteRT interpreter:

Example: Vision Transformers (Python)

Here's how you can run a Vision Transformer model (downloaded from AI Hubarrow-up-right) on both the CPU and the NPU using the LiteRT delegates.

Open the terminal on your development board, or an ssh session to your development board, and:

  1. Create a new venv, and install the LiteRT runtime and Pillow:

  2. Create inference_vit.py and add:

  3. Run the model on the CPU:

  4. Run the model on the NPU:

As you can see this model runs significantly faster on NPU - but there's a slight change in the output of the model. You can also see that for this model not all layers can run on NPU ("1382 nodes delegated out of 1633 nodes with 27 partitions").

Last updated