LLMs using Genie

A select number of Large Language Models (LLMs) and Vision Language Models (VLMs) can run on the NPU on your Dragonwing development board using the Qualcomm Gen AI Inference Extensions (Genie)arrow-up-right. These models have been ported and optimized by Qualcomm to be as efficient as possible on hardware. Genie only supports a subset of manually ported models, so if your favourite model is not listed, look at Run LLMs / VLMs using llama.cpp to run models on the GPU as a fallback.

circle-exclamation

Installing AI Runtime SDK - Community Edition

First install the AI Runtime SDK - Community Editionarrow-up-right. Open the terminal on your development board, or an ssh session to your development board, and run:

# Install the SDK
wget -qO- https://cdn.edgeimpulse.com/qc-ai-docs/device-setup/install_ai_runtime_sdk.sh | bash

# Use the SDK in your current session
source ~/.bash_profile

Finding supported models

Genie-compatible LLM models can be found in a few places:

  • Aplux model zooarrow-up-right:

    1. Under 'Chipset', select:

      • RB3 Gen 2 Vision Kit: 'Qualcomm QCS6490'

      • RUBIK Pi 3: 'Qualcomm QCS6490'

      • IQ-9075 EVK: 'Qualcomm QCS9075'

    2. Under 'NLP', select "Text Generation".

  • Qualcomm AI Hubarrow-up-right:

    1. Under 'Chipset', select:

      • RB3 Gen 2 Vision Kit: 'Qualcomm QCS6490 (Proxy)'

      • RUBIK Pi 3: 'Qualcomm QCS6490 (Proxy)'

      • IQ-9075 EVK: 'Qualcomm QCS9075 (Proxy)'

    2. Under 'Domain/Use Case', select "Generative AI".

As an example, let's deploy the Qwen2.5-0.5B-Instructarrow-up-right model - which runs on QCS6490-based Dragwoning development boards like the Rubik Pi 3 and RB3 Gen 2 Vision Kit.

Running Qwen2.5-0.5B-Instruct

When you download a model you'll need 3 files:

Let's grab all of these and run Qwen2.5-0.5B-Instruct. Open the terminal on your development board, or an ssh session to your development board, and:

  1. Download the model onto your development board. Either:

    • Download the model from our CDN (only done for the Qwen model):

    • Download the model from Aplux model zoo:

      1. Sign up for an Aplux account.

      2. Under 'Device', select the QCS6490.

      3. Click "Download Model & Test code".

        Downloading Genie-compatible models for the QCS6490
        Downloading Genie-compatible models for the QCS6490
      4. After downloading, push the ZIP file to your development board over ssh:

        1. Find the IP address of your development board. Run on your development board:

        2. Push the .zip file. Run from your computer:

  2. Unzip the model. From your development board:

  3. Run your model:

Great! You now have this LLM running under Genie.

Serving a UI or API through QAI AppBuilder

To use Genie models from your application you can use the QAI AppBuilderarrow-up-right repository. The AppBuilder repo has both a OpenAI compatible chat completion API, as well as a Web UI to interact with your model (just like llama.cpp).

circle-exclamation
  1. Install the AppBuilder:

  2. Run the Web UI (from the samples/ directory):

    Now open http://192.168.1.253:8976 (replace with your IP) in your web browser (on your computer) to interact with the model. Make sure to select the model first using the "models" dropdown.

    ai-engine-direct-helper WebUI demo
    ai-engine-direct-helper WebUI demo
  3. You can also programmatically access this server using the OpenAI Chat Completions API. E.g. from Python:

    1. Start the server (from the samples/ directory):

    2. From a new terminal, create a new venv and install requests:

    3. Create a new file chat.py:

    4. Run chat.py:

      (Model seems to always return IBM-Granite, you can disregard this)

Tips and tricks

Downloading files from HuggingFace that require authentication

If you want to download files, e.g. the tokenizer.json file from Llama-3.2-1B-Instructarrow-up-right, that require permission or authentication:

  1. Go to the model page on HuggingFacearrow-up-right, sign in (or sign up), and fill in the form to get access to the model.

  2. Create a new HuggingFace access token with 'Read' permissions at https://huggingface.co/settings/tokens, and configure it on your development board:

  3. Once you're granted access you can now download the tokenizer:

Last updated