# LLMs using Genie

A select number of Large Language Models (LLMs) and Vision Language Models (VLMs) can run on the NPU on your Dragonwing development board using the [Qualcomm Gen AI Inference Extensions (Genie)](https://www.qualcomm.com/developer/software/gen-ai-inference-extensions). These models have been ported and optimized by Qualcomm to be as efficient as possible on hardware. Genie only supports a subset of manually ported models, so if your favourite model is not listed, look at [Run LLMs / VLMs using llama.cpp](https://qc-ai-test.gitbook.io/qc-ai-test-docs/running-building-ai-models/llama-cpp) to run models on the GPU as a fallback.

{% hint style="warning" %}
**Not supported on IQ-9075 EVK:** There are no available models yet for the IQ-9075 EVK. Use [llama.cpp](https://qc-ai-test.gitbook.io/qc-ai-test-docs/running-building-ai-models/llama-cpp) instead.
{% endhint %}

## Installing AI Runtime SDK - Community Edition

First install the [AI Runtime SDK - Community Edition](https://softwarecenter.qualcomm.com/catalog/item/Qualcomm_AI_Runtime_Community). Open the terminal on your development board, or an ssh session to your development board, and run:

```bash
# Install the SDK
wget -qO- https://cdn.edgeimpulse.com/qc-ai-docs/device-setup/install_ai_runtime_sdk.sh | bash

# Use the SDK in your current session
source ~/.bash_profile
```

## Finding supported models

Genie-compatible LLM models can be found in a few places:

* [Aplux model zoo](https://aiot.aidlux.com/en/models):
  1. Under 'Chipset', select:
     * RB3 Gen 2 Vision Kit: 'Qualcomm QCS6490'
     * RUBIK Pi 3: 'Qualcomm QCS6490'
     * IQ-9075 EVK: 'Qualcomm QCS9075'
  2. Under 'NLP', select "Text Generation".
* [Qualcomm AI Hub](https://aihub.qualcomm.com/models?domain=Generative+AI):
  1. Under 'Chipset', select:
     * RB3 Gen 2 Vision Kit: 'Qualcomm QCS6490 (Proxy)'
     * RUBIK Pi 3: 'Qualcomm QCS6490 (Proxy)'
     * IQ-9075 EVK: 'Qualcomm QCS9075 (Proxy)'
  2. Under 'Domain/Use Case', select "Generative AI".

As an example, let's deploy the [Qwen2.5-0.5B-Instruct](https://aiot.aidlux.com/en/models/detail/149?modelType=9\&soc=2) model - which runs on QCS6490-based Dragwoning development boards like the Rubik Pi 3 and RB3 Gen 2 Vision Kit.

## Running Qwen2.5-0.5B-Instruct

When you download a model you'll need 3 files:

* One or more `*.serialized.bin` files - these contain the weights of the model.
* `tokenizer.json` - a serialized configuration file that defines how text is split into tokens, mapping between characters, subwords, and their integer IDs used by an LLM. These can typically be downloaded from the model space on HuggingFace. A list of links for Genie-supported models is on [quic/ai-hub-apps: LLM On-Device Deployment > Prepare Genie configs](https://github.com/quic/ai-hub-apps/tree/main/tutorials/llm_on_genie).
* A Genie config file - with instructions on how to run this model through Genie. These can be found on GitHub for models in AI Hub: [quic/ai-hub-apps: tutorials/llm\_on\_genie/configs/genie](https://github.com/quic/ai-hub-apps/tree/main/tutorials/llm_on_genie/configs/genie).

Let's grab all of these and run Qwen2.5-0.5B-Instruct. Open the terminal on your development board, or an ssh session to your development board, and:

1. Download the model onto your development board. Either:
   * Download the model from our CDN (only done for the Qwen model):

     ```bash
     wget -O qnn229_qcs6490_cl4096.zip https://cdn.edgeimpulse.com/qc-ai-docs/models/qwen2.5_0.5b_instruct_aplux_qnn229_qcs6490_cl4096.zip
     ```
   * Download the model from Aplux model zoo:
     1. Go to the [Aplux model zoo: Qwen2.5-0.5B-Instruct](https://aiot.aidlux.com/en/models/detail/149?modelType=9\&soc=2).
     2. Sign up for an Aplux account.
     3. Under 'Device', select the QCS6490.
     4. Click "Download Model & Test code".

        <figure><img src="https://3580193864-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxM5xrbdbelLSl7uN8oac%2Fuploads%2Fgit-blob-50d65ed63d098563fd3b18b0ed09066ca758b00f%2Faplux1.png?alt=media" alt="Downloading Genie-compatible models for the QCS6490"><figcaption><p>Downloading Genie-compatible models for the QCS6490</p></figcaption></figure>
     5. After downloading, push the ZIP file to your development board over ssh:
        1. Find the IP address of your development board. Run on your development board:

           ```bash
           ifconfig | grep -Eo 'inet (addr:)?([0-9]*\.){3}[0-9]*' | grep -Eo '([0-9]*\.){3}[0-9]*' | grep -v '127.0.0.1'

           # ... Example:
           # 192.168.1.253
           ```
        2. Push the .zip file. Run from your computer:

           ```bash
           scp qnn229_qcs6490_cl4096.zip ubuntu@192.168.1.253:~/qnn229_qcs6490_cl4096.zip
           ```
2. Unzip the model. From your development board:

   ```bash
   mkdir -p ~/genie-models/
   unzip -d ~/genie-models/qwen2.5-0.5b-instruct/ qnn229_qcs6490_cl4096.zip
   rm qnn229_qcs6490_cl4096.zip
   ```
3. Run your model:

   ```bash
   cd ~/genie-models/qwen2.5-0.5b-instruct/

   genie-t2t-run -c ./qwen2.5-0.5b-instruct-htp.json -p '<|im_start|>system
       You are Qwen, created by Alibaba Cloud. You are a helpful assistant that responds in English.<|im_end|><|im_start|>user
       What is the capital of the Netherlands?<|im_end|><|im_start|>assistant'

   # Using libGenie.so version 1.9.0
   #
   # [BEGIN]:
   # The capital of the Netherlands is Amsterdam.[END]
   ```

Great! You now have this LLM running under Genie.

## Serving a UI or API through QAI AppBuilder

To use Genie models from your application you can use the [QAI AppBuilder](https://github.com/quic/ai-engine-direct-helper) repository. The AppBuilder repo has both a OpenAI compatible chat completion API, as well as a Web UI to interact with your model (just like [llama.cpp](https://qc-ai-test.gitbook.io/qc-ai-test-docs/running-building-ai-models/llama-cpp)).

{% hint style="warning" %}
**Heavy development:** The AppBuilder is under heavy development. We've tried to pin the versions as much as we can, but using newer versions of the AppBuilder might not work with the instructions below.
{% endhint %}

1. Install the AppBuilder:

   ```bash
   # Build dependency
   sudo apt update && sudo apt install -y yq

   # Clone the repository
   git clone https://github.com/quic/ai-engine-direct-helper
   cd ai-engine-direct-helper
   git checkout 92d9cad
   git submodule update --init --recursive

   # Create a new venv
   python3 -m venv .venv
   source .venv/bin/activate

   # Build the wheel
   pip3 install setuptools
   python setup.py bdist_wheel
   pip3 install ./dist/qai_appbuilder-*-linux_aarch64.whl

   # Install other dependencies
   pip3 install \
       uvicorn==0.35.0 \
       pydantic_settings==2.10.1 \
       fastapi==0.116.1 \
       langchain==0.3.27 \
       langchain-core==0.3.75 \
       langchain-community==0.3.29 \
       sse_starlette==3.0.2 \
       pypdf==6.0.0 \
       python-pptx==1.0.2 \
       docx2txt==0.9 \
       openai==1.107.0 \
       json-repair==0.50.1 \
       qai_hub==0.36.0 \
       py3_wget==1.0.13 \
       torch==2.8.0 \
       transformers==4.56.1 \
       gradio==5.44.1 \
       diffusers==0.35.1

   # Where you've downloaded the weights, and created the config files before
   WEIGHTS_DIR=~/genie-models/qwen2.5-0.5b-instruct/
   MODEL_NAME=qwen2_5-0_5b-instruct

   # Create a new directory and link the files
   mkdir -p samples/genie/python/models/$MODEL_NAME
   cd samples/genie/python/models/$MODEL_NAME

   # Patch up config
   cp $WEIGHTS_DIR/*instruct-htp.json config.json
   jq --arg pwd "$PWD" '.dialog.tokenizer.path |= if startswith($pwd + "/") then . else $pwd + "/" + . end' config.json > tmp && mv tmp config.json
   jq --arg pwd "$PWD" '.dialog.engine.backend.extensions |= if startswith($pwd + "/") then . else $pwd + "/" + . end' config.json > tmp && mv tmp config.json
   jq --arg pwd "$PWD" '.dialog.engine.model.binary["ctx-bins"] |= map(if startswith($pwd + "/") then . else $pwd + "/" + . end)' config.json > tmp && mv tmp config.json

   # Symlink other files
   ln -s $WEIGHTS_DIR/*.json .
   ln -s $WEIGHTS_DIR/*okenizer.json tokenizer.json
   ln -s $WEIGHTS_DIR/*.serialized.bin .
   echo "prompt_tags_1: <|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nGive me a short introduction to large language model.
   prompt_tags_2: <|im_end|>\n<|im_start|>assistant\n" > prompt.conf

   # Navigate back to samples/ directory
   cd ../../../..

   # Create empty tokenizer files, otherwise they will be downloaded... (which will fail)
   if [ ! -f genie/python/models/Phi-3.5-mini/tokenizer.json ]; then
       echo '{}' > genie/python/models/Phi-3.5-mini/tokenizer.json
   fi
   if [ ! -f genie/python/models/IBM-Granite-v3.1-8B/tokenizer.json ]; then
       echo '{}' > genie/python/models/IBM-Granite-v3.1-8B/tokenizer.json
   fi
   ```
2. Run the Web UI (from the `samples/` directory):

   ```bash
   # Find the IP address of your development board
   ifconfig | grep -Eo 'inet (addr:)?([0-9]*\.){3}[0-9]*' | grep -Eo '([0-9]*\.){3}[0-9]*' | grep -v '127.0.0.1'

   # ... Example:
   # 192.168.1.253

   # Run the Web UI
   python webui/GenieWebUI.py
   ```

   Now open <http://192.168.1.253:8976> (replace with your IP) in your web browser (on your computer) to interact with the model. Make sure to select the model first using the "models" dropdown.

   <figure><img src="https://3580193864-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxM5xrbdbelLSl7uN8oac%2Fuploads%2Fgit-blob-d22ee694bd20caf302bda5cc6c697f3c8575ada4%2Fgenie-webui.png?alt=media" alt="ai-engine-direct-helper WebUI demo"><figcaption><p>ai-engine-direct-helper WebUI demo</p></figcaption></figure>
3. You can also programmatically access this server using the OpenAI Chat Completions API. E.g. from Python:
   1. Start the server (from the `samples/` directory):

      ```bash
      python genie/python/GenieAPIService.py --modelname "qwen2_5-0_5b-instruct"   --loadmodel --profile
      ```
   2. From a *new terminal*, create a new venv and install `requests`:

      ```bash
      mkdir -p ~/genie-api-demo
      cd ~/genie-api-demo

      python3 -m venv .venv
      source .venv/bin/activate
      pip3 install requests
      ```
   3. Create a new file `chat.py`:

      ```python
      import requests

      # if running from your own computer, replace localhost with the IP address of your development board
      url = "http://localhost:8910/v1/chat/completions"

      payload = {
          "model": "qwen2_5-0_5b-instruct",
          "messages": [
              {"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": "Explain Qualcomm in one sentence."}
          ],
          "temperature": 0.7,
          "max_tokens": 200
      }

      response = requests.post(url, headers={ "Content-Type": "application/json" }, json=payload)
      print(response.json())
      ```
   4. Run `chat.py`:

      ```bash
      python3 chat.py

      # {'id': 'genie-llm', 'model': 'IBM-Granite', 'object': 'chat.completion', 'created': 1757512757, 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': 'Qualcomm is a leading American technology company that designs, manufactures, and markets mobile phone chips and other wireless communication products.', 'tool_call_id': None, 'tool_calls': None}, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 0, 'completion_tokens': 0, 'total_tokens': 0}}
      ```

      (Model seems to always return `IBM-Granite`, you can disregard this)

## Tips and tricks

### Downloading files from HuggingFace that require authentication

If you want to download files, e.g. the `tokenizer.json` file from [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/), that require permission or authentication:

1. Go to [the model page on HuggingFace](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/), sign in (or sign up), and fill in the form to get access to the model.
2. Create a new HuggingFace access token with 'Read' permissions at <https://huggingface.co/settings/tokens>, and configure it on your development board:

   ```bash
   export HF_TOKEN=hf_gs...

   # Optionally add ^ to ~/.bash_profile to ensure it gets loaded automatically in the future.
   ```
3. Once you're granted access you can now download the tokenizer:

   ```bash
   wget --header="Authorization: Bearer $HF_TOKEN" https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/resolve/main/tokenizer.json
   ```
