Run context binaries (.bin/.dlc)

Some models from AI Hub (and from internal Qualcomm releases) are released as context binaries (.bin files). Context binaries contain the model, plus hardware optimizations; and can be ran with Qualcomm tools that directly use the Qualcomm® AI Runtime SDK. Examples of this are Genie (to run LLMs) and VoiceAI ASR (to run voice transcription); but you can also run context binaries directly from Python using QAI AppBuilder.

Not portable: Context binaries are not portable. They are tied to both the AI Engine Direct SDK version and your hardware target.

Finding supported models

Models in context binary format can be found in a few places:

Qualcomm AI Hub (note that these come in .dlc format - you'll need to convert them to .bin files, see below):
1. Under 'Chipset', select:
  - RB3 Gen 2 Vision Kit: 'Qualcomm QCS6490 (Proxy)'
  - RUBIK Pi 3: 'Qualcomm QCS6490 (Proxy)'
  - IQ-9075 EVK: 'Qualcomm QCS9075 (Proxy)'
2. Under 'Runtime', select "Qualcomm® AI Runtime".
Aplux model zoo:
1. Under 'Chipset', select:
  - RB3 Gen 2 Vision Kit: 'Qualcomm QCS6490'
  - RUBIK Pi 3: 'Qualcomm QCS6490'
  - IQ-9075 EVK: 'Qualcomm QCS9075'

Note that the NPU only supports quantized models. Floating point models (or layers) will be automatically moved back to the CPU.

From .dlc -> .bin

If your model comes in .dlc format (a portable serialized format); you'll first need to convert them to context binaries (.bin files). You convert these files using qnn-context-binary-generator (part of the AI Runtime SDK). Open the terminal on your development board, or an ssh session to your development board, and:

Install the AI Runtime SDK - Community Edition:

# Install the SDK
wget -qO- https://cdn.edgeimpulse.com/qc-ai-docs/device-setup/install_ai_runtime_sdk.sh | bash

# Use the SDK in your current session
source ~/.bash_profile

Convert the .dlc file into a .bin file:

qnn-context-binary-generator \
    --dlc_path=./inception_v3-inception-v3-w8a8.dlc \
    --output_dir=output/ \
    --backend=/usr/lib/libQnnHtp.so \
    --model=/usr/lib/libQnnModelDlc.so \
    --binary_file=inception_v3-inception-v3-w8a8.bin

You should now have output/inception_v3-inception-v3-w8a8.bin. Note that this file is not portable. It's tied to both the AI Engine Direct SDK version and your hardware target.

Troubleshooting

If conversion fails (e.g. with Failed to create dlc handle with code 1002 for dlc file), there might be a discrepancy between the QNN version that created the .dlc file; and the QNN version on your development board. Run:

# Find QNN version for DLC file
strings ./inception_v3-inception-v3-w8a8.dlc | egrep -i 'qnn|qairt|version' | head
#         "tool": "qairt-converter",
#         "converterVersion": "2.37.0.250724175447_124859",

# Find QNN version on your development board
strings /usr/lib/libQnnHtp.so | grep AISW_VERSION
# AISW_VERSION: 2.35.0

Here this file was created by QAIRT 2.37.0; but your development board runs 2.35.0. Ask the person that gave you the .dlc file for a version tied to QNN 2.35.0 instead.

Example: Inception-v3 (Python)

Here's how you can run an image classification model (downloaded from Aplux model zoo) on the NPU using QAI AppBuilder. Open the terminal on your development board, or an ssh session to your development board, and:

Build the AppBuilder wheel with QNN bindings:

# Build dependency
sudo apt update && sudo apt install -y yq

# Clone the repository
git clone https://github.com/quic/ai-engine-direct-helper
cd ai-engine-direct-helper
git submodule update --init --recursive
git checkout v2.38.0

# Create a new venv
python3.12 -m venv .venv
source .venv/bin/activate

# Build the wheel
pip3 install setuptools
python setup.py bdist_wheel

# Deactivate the venv
deactivate

export APPBUILDER_WHEEL=$PWD/dist/qai_appbuilder-2.38.0-cp312-cp312-linux_aarch64.whl

Now create a new folder for the application:

mkdir -p ~/context-binary-demo
cd ~/context-binary-demo

# Create a new venv
python3.12 -m venv .venv
source .venv/bin/activate

# Install the QAI AppBuilder plus some other dependencies
pip3 install $APPBUILDER_WHEEL
pip3 install numpy==2.3.3 Pillow==11.3.0

Create a new file context_demo.py and add:

import os, urllib.request, time, numpy as np
from qai_appbuilder import (QNNContext, Runtime, LogLevel, ProfilingLevel, PerfProfile, QNNConfig)
from PIL import Image

def download_file_if_not_exists(path, url):
    if not os.path.exists(path):
        os.makedirs(os.path.dirname(path), exist_ok=True)
        print(f"Downloading {path} from {url}...")
        urllib.request.urlretrieve(url, path)
    return path

# Path to your model/label/test image (will be download automatically)
MODEL_PATH = download_file_if_not_exists('models/inception_v3_w8a8.qcs6490.qnn216.ctx.bin', 'https://cdn.edgeimpulse.com/qc-ai-docs/models/inception_v3_w8a8.qcs6490.qnn216.ctx.bin')
LABELS_PATH = download_file_if_not_exists('models/inception_v3_labels.txt', 'https://cdn.edgeimpulse.com/qc-ai-docs/models/inception_v3_labels.txt')
IMAGE_PATH = download_file_if_not_exists('images/samoyed-square.jpg', 'https://cdn.edgeimpulse.com/qc-ai-docs/example-images/samoyed-square.jpg')

# Parse labels
with open(LABELS_PATH, 'r') as f:
    labels = [line for line in f.read().splitlines() if line.strip()]

# Set up the QNN config (/usr/lib => where all QNN libraries are installed)
QNNConfig.Config('/usr/lib', Runtime.HTP, LogLevel.WARN, ProfilingLevel.BASIC)

# Create a new context (name, path to .bin file)
ctx = QNNContext(os.path.basename(MODEL_PATH), MODEL_PATH)

# Load and preprocess image, input is scaled 0..1 (f32), no need to quantize yourself
def load_image(path, input_shape):
    # Expected input shape: [1, height, width, channels]
    _, height, width, channels = input_shape

    # expects unquantized input 0..1
    img = Image.open(path).convert("RGB").resize((width, height))
    img_np = np.array(img, dtype=np.float32)
    img_np = img_np / 255
    # add batch dim
    img_np = np.expand_dims(img_np, axis=0)
    return img_np

# As far as I know you cannot find the input shape from the .bin file...
# Resolution found at https://aiot.aidlux.com/en/models/detail/9?name=inception&precisionShow=1&soc=2
input_data = load_image(IMAGE_PATH, (1, 299, 299, 3))

# Run inference once to warmup
f_output = ctx.Inference(input_data)[0]

# Then run 10x
start = time.perf_counter()
for i in range(0, 10):
    f_output = ctx.Inference(input_data)[0]
end = time.perf_counter()

# Image classification models in AI Hub miss a Softmax() layer at the end of the model, so add it manually
def softmax(x, axis=-1):
    # subtract max for numerical stability
    x_max = np.max(x, axis=axis, keepdims=True)
    e_x = np.exp(x - x_max)
    return e_x / np.sum(e_x, axis=axis, keepdims=True)

# show top-5 predictions
scores = softmax(f_output)
top_k = scores.argsort()[-5:][::-1]
print("\nTop-5 predictions:")
for i in top_k:
    print(f"Class {labels[i]}: score={scores[i]}")

print('')
print(f'Inference took (on average): {((end - start) * 1000) / 10:.4g}ms. per image')

Run the example:

python3 context_demo.py

# Top-5 predictions:
# Class Samoyed: score=0.9563993215560913
# Class Arctic fox: score=0.0022275811061263084
# Class Pomeranian: score=0.001584612182341516
# Class chow: score=0.0008018668740987778
# Class keeshond: score=0.0006536662694998085
#
# Inference took (on average): 12.27ms. per image

Great! You now have ran a model in context binary format on the NPU.

PreviousRun ONNX models NextLLMs/VLMs using Llama.cpp

Last updated 1 month ago