> For the complete documentation index, see [llms.txt](https://qc-ai-test.gitbook.io/qc-ai-test-docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://qc-ai-test.gitbook.io/qc-ai-test-docs/running-building-ai-models/lite-rt.md).

# Run LiteRT / TFLite models

LiteRT, formerly known as TensorFlow Lite, is Google's high-performance runtime for on-device AI. You can run existing quantized LiteRT models (in Python or C++) on the NPU on Dragonwing devices with a single line of code using the LiteRT delegates that are part of AI Engine Direct.

## Quantizing models

The NPU only supports uint8/int8 quantized models. Unsupported models, or unsupported layers will be automatically moved back to the CPU. You can use [quantization-aware training](https://www.tensorflow.org/model_optimization/guide/quantization/training_comprehensive_guide) or [post-training quantization](https://ai.google.dev/edge/litert/models/post_training_quantization) to quantize your LiteRT models. Make sure you follow the steps for "Full integer quantization".

{% hint style="info" %}
**Don't want to quantize yourself?** You can download a range of pre-quantized models from [Qualcomm AI Hub](https://aihub.qualcomm.com), or use [Edge Impulse](/qc-ai-test-docs/running-building-ai-models/edge-impulse.md) to quantize new or existing models.
{% endhint %}

## Running a model on the NPU (Python)

To offload a model to the NPU, you just need to load the LiteRT delegate; and pass it into the interpreter. E.g.:

```py
from ai_edge_litert.interpreter import Interpreter, load_delegate

qnn_delegate = load_delegate("libQnnTFLiteDelegate.so", options={"backend_type": "htp"})
interpreter = Interpreter(
    model_path=...,
    experimental_delegates=[qnn_delegate]
)
```

## Running a model on the NPU (C++)

To offload a model to the NPU, you'll first need to add the following compile flags:

```makefile
CFLAGS += -I${QNN_SDK_ROOT}/include
LDFLAGS += -L${QNN_SDK_ROOT}/lib/aarch64-ubuntu-gcc9.4 -lQnnTFLiteDelegate
```

Then, you instantiate the LiteRT delegate and pass it to the LiteRT interpreter:

```c
// == Includes ==
#include "QNN/TFLiteDelegate/QnnTFLiteDelegate.h"

// == Application code ==

// Get your interpreter...
tflite::Interpreter *interpreter = ...;

// Create QNN Delegate options structure.
TfLiteQnnDelegateOptions options = TfLiteQnnDelegateOptionsDefault();

// Set the mandatory backend_type option. All other options have default values.
options.backend_type = kHtpBackend;

// Instantiate delegate. Must not be freed until interpreter is freed.
TfLiteDelegate* delegate = TfLiteQnnDelegateCreate(&options);

TfLiteStatus status = interpreter->ModifyGraphWithDelegate(delegate);
// Check that status == kTfLiteOk
```

## Example: Vision Transformers (Python)

Here's how you can run a Vision Transformer model (downloaded from [AI Hub](https://aihub.qualcomm.com/models/vit)) on both the CPU and the NPU using the LiteRT delegates.

Open the terminal on your development board, or an ssh session to your development board, and:

1. Create a new venv, and install the LiteRT runtime and Pillow:

   ```bash
   mkdir -p litert-demo/
   cd litert-demo/

   python3 -m venv .venv
   source .venv/bin/activate
   pip3 install ai-edge-litert==1.3.0 Pillow
   ```
2. Create `inference_vit.py` and add:

   ```py
   import numpy as np, os, time, sys, urllib.request
   from ai_edge_litert.interpreter import Interpreter, load_delegate
   from PIL import Image

   use_npu = True if len(sys.argv) >= 2 and sys.argv[1] == '--use-npu' else False

   def download_file_if_not_exists(path, url):
       if not os.path.exists(path):
           os.makedirs(os.path.dirname(path), exist_ok=True)
           print(f"Downloading {path} from {url}...")
           urllib.request.urlretrieve(url, path)
       return path

   # Path to your model/label/test image (will be download automatically)
   MODEL_PATH = download_file_if_not_exists('models/vit-vit-w8a8.tflite', 'https://cdn.edgeimpulse.com/qc-ai-docs/models/vit-vit-w8a8.tflite')
   LABELS_PATH = download_file_if_not_exists('models/vit-vit-labels.txt', 'https://cdn.edgeimpulse.com/qc-ai-docs/models/vit-vit-labels.txt')
   IMAGE_PATH = download_file_if_not_exists('images/boa-constrictor.jpg', 'https://cdn.edgeimpulse.com/qc-ai-docs/examples/boa-constrictor.jpg')

   # Parse labels file
   with open(LABELS_PATH, 'r') as f:
       labels = [line for line in f.read().splitlines() if line.strip()]

   # Use HTP backend of libQnnTFLiteDelegate.so (NPU) when --use-npu is passed in
   experimental_delegates = []
   if use_npu:
       experimental_delegates = [load_delegate("libQnnTFLiteDelegate.so", options={"backend_type": "htp"})]

   # Load TFLite model and allocate tensors
   interpreter = Interpreter(
       model_path=MODEL_PATH,
       experimental_delegates=experimental_delegates
   )
   interpreter.allocate_tensors()

   # Get input and output tensor details
   input_details = interpreter.get_input_details()
   output_details = interpreter.get_output_details()

   # Load, preprocess and quantize image
   def load_image(path, input_shape):
       # Expected input shape: [1, height, width, channels]
       _, height, width, channels = input_shape

       # Load image
       img = Image.open(path).convert("RGB").resize((width, height))
       img_np = np.array(img, dtype=np.float32)
       # !! Normalize... this model is 0..1 scaled (no further normalization); but that depends on your model !!
       img_np = img_np / 255
       # Add batch dim
       img_np = np.expand_dims(img_np, axis=0)

       scale, zero_point = input_details[0]['quantization']  # (scale, zero_point); scale==0.0 -> unquantized

       # Quantize input if needed
       if input_details[0]['dtype'] == np.float32:
           return img_np
       elif input_details[0]['dtype'] == np.uint8:
           # q = round(x/scale + zp)
           q = np.round(img_np / scale + zero_point)
           return np.clip(q, 0, 255).astype(np.uint8)
       elif input_details[0]['dtype'] == np.int8:
           # Commonly zero_point ≈ 0 (symmetric), but use provided zp anyway
           q = np.round(img_np / scale + zero_point)
           return np.clip(q, -128, 127).astype(np.int8)
       else:
           raise Exception('Unexpected dtype: ' + str(input_details[0]['dtype']))

   input_shape = input_details[0]['shape']
   input_data = load_image(IMAGE_PATH, input_shape)

   # Set tensor and run inference
   interpreter.set_tensor(input_details[0]['index'], input_data)

   # Run once to warmup
   interpreter.invoke()

   # Then run 10x
   start = time.perf_counter()
   for i in range(0, 10):
       interpreter.invoke()
   end = time.perf_counter()

   # Get prediction
   q_output = interpreter.get_tensor(output_details[0]['index'])
   scale, zero_point = output_details[0]['quantization']
   f_output = (q_output.astype(np.float32) - zero_point) * scale

   # Image classification models in AI Hub miss a Softmax() layer at the end of the model, so add it manually
   def softmax(x, axis=-1):
       # subtract max for numerical stability
       x_max = np.max(x, axis=axis, keepdims=True)
       e_x = np.exp(x - x_max)
       return e_x / np.sum(e_x, axis=axis, keepdims=True)

   # show top-5 predictions
   scores = softmax(f_output[0])
   top_k = scores.argsort()[-5:][::-1]
   print("\nTop-5 predictions:")
   for i in top_k:
       print(f"Class {labels[i]}: score={scores[i]}")

   print('')
   print(f'Inference took (on average): {((end - start) * 1000) / 10:.4g}ms. per image')
   ```
3. Run the model on the CPU:

   ```bash
   python3 inference_vit.py

   # INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
   #
   # Top-5 predictions:
   # Class boa constrictor: score=0.47895267605781555
   # Class rock python: score=0.15231961011886597
   # Class night snake: score=0.008282911032438278
   # Class switch: score=0.0025113984011113644
   # Class remote control: score=0.002394335111603141
   #
   # Inference took (on average): 391.1ms. per image
   ```
4. Run the model on the NPU:

   ```bash
   python3 inference_vit.py --use-npu

   # INFO: TfLiteQnnDelegate delegate: 1382 nodes delegated out of 1633 nodes with 27 partitions.
   #
   # INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
   #
   # Top-5 predictions:
   # Class boa constrictor: score=0.39735516905784607
   # Class rock python: score=0.22408385574817657
   # Class night snake: score=0.019640149548649788
   # Class eggnog: score=0.002774509834125638
   # Class cup: score=0.0019864204805344343
   #
   # Inference took (on average): 132.7ms. per image
   ```

As you can see this model runs significantly faster on NPU - but there's a slight change in the output of the model. You can also see that for this model not all layers can run on NPU ("1382 nodes delegated out of 1633 nodes with 27 partitions").


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://qc-ai-test.gitbook.io/qc-ai-test-docs/running-building-ai-models/lite-rt.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
