> For the complete documentation index, see [llms.txt](https://qc-ai-test.gitbook.io/qc-ai-test-docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://qc-ai-test.gitbook.io/qc-ai-test-docs/running-building-ai-models/llama-cpp.md).

# LLMs/VLMs using Llama.cpp

You can a wide range of Large Language Models (LLMs) and Vision Language Models (VLMs) on your Dragowning development boards using [llama.cpp](https://github.com/ggml-org/llama.cpp). Models running under llama.cpp run on the *GPU*, not on the *NPU*. You can run a subset of models on the NPU via [GENIE](/qc-ai-test-docs/running-building-ai-models/genie.md).

## Builing llama.cpp

You'll need to build some dependencies for llama.cpp. Open the terminal on your development board, or an ssh session to your development board, and run:

1. Install build dependencies:

   ```bash
   sudo apt update
   sudo apt install -y cmake ninja-build curl libcurl4-openssl-dev build-essential
   ```
2. Install the OpenCL headers and ICD loader library:

   ```bash
   mkdir -p ~/dev/llm

   # Symlink the OpenCL shared library
   sudo rm -f /usr/lib/libOpenCL.so
   sudo ln -s /lib/aarch64-linux-gnu/libOpenCL.so.1.0.0 /usr/lib/libOpenCL.so

   # OpenCL headers
   cd ~/dev/llm
   git clone https://github.com/KhronosGroup/OpenCL-Headers
   cd OpenCL-Headers
   git checkout 5d52989617e7ca7b8bb83d7306525dc9f58cdd46
   mkdir -p build && cd build
   cmake .. -G Ninja \
       -DBUILD_TESTING=OFF \
       -DOPENCL_HEADERS_BUILD_TESTING=OFF \
       -DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF \
       -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
   cmake --build . --target install

   # ICD Loader
   cd ~/dev/llm
   git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader
   cd OpenCL-ICD-Loader
   git checkout 02134b05bdff750217bf0c4c11a9b13b63957b04
   mkdir -p build && cd build
   cmake .. -G Ninja \
       -DCMAKE_BUILD_TYPE=Release \
       -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" \
       -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
   cmake --build . --target install

   # Symlink OpenCL headers
   sudo rm -f /usr/include/CL
   sudo ln -s ~/dev/llm/opencl/include/CL/ /usr/include/CL
   ```
3. Build llama.cpp with the OpenCL backend:

   ```bash
   cd ~/dev/llm

   # Clone repository
   git clone https://github.com/ggml-org/llama.cpp
   cd llama.cpp

   # We've tested this commit explicitly, you can try master if you want bleeding edge
   git checkout f6da8cb86a28f0319b40d9d2a957a26a7d875f8c

   # Build
   mkdir -p build
   cd build
   cmake .. -G Ninja \
       -DCMAKE_BUILD_TYPE=Release \
       -DBUILD_SHARED_LIBS=OFF \
       -DGGML_OPENCL=ON
   ninja -j`nproc`
   ```
4. Add the llama.cpp paths to your PATH:

   ```bash
   cd ~/dev/llm/llama.cpp/build/bin

   echo "" >> ~/.bash_profile
   echo "# Begin llama.cpp" >> ~/.bash_profile
   echo "export PATH=\$PATH:$PWD" >> ~/.bash_profile
   echo "# End llama.cpp" >> ~/.bash_profile
   echo "" >> ~/.bash_profile

   # To use the llama.cpp files in your current session
   source ~/.bash_profile
   ```
5. You now have llama.cpp:

   ```bash
   llama-cli --version
   # ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'
   # ggml_opencl: device: 'QUALCOMM Adreno(TM) 635 (OpenCL 3.0 Adreno(TM) 635)'
   # ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: 0808.0.7 Compiler E031.49.02.00
   # ggml_opencl: vector subgroup broadcast support: true
   ```

### Downloading and quantizing a model

To run GPU-accelerated models you'll want pure 4-bit quantized (`Q4_0`) models in GGUF format (the llama.cpp format, [conversion guide](https://github.com/ggml-org/llama.cpp/discussions/2948)). You can either find pre-quantized models, or quantize a model yourself using `llama-quantize`. For example, for Qwen2-1.5B-Instruct:

```bash
# Download fp16 model (https://huggingface.co/Qwen/Qwen2-1.5B-Instruct-GGUF)
wget https://huggingface.co/Qwen/Qwen2-1.5B-Instruct-GGUF/resolve/main/qwen2-1_5b-instruct-fp16.gguf

# Quantize (pure Q4_0)
llama-quantize --pure qwen2-1_5b-instruct-fp16.gguf qwen2-1_5b-instruct-q4_0-pure.gguf Q4_0
```

### Running your first LLM using llama-cli

You're now ready to run the LLM via `llama-cli`. It'll automatically offload layers to the GPU:

```bash
llama-cli -m ./qwen2-1_5b-instruct-q4_0-pure.gguf -no-cnv --no-warmup -b 128 -c 2048 -s 11 -n 128 -p "Knock knock, " -fa off

# ... You'll see:
# load_tensors: offloaded 29/29 layers to GPU
# ...
# Knock knock, 11:59 pm ... rest of the story
```

🚀 You now have an LLM running on the GPU of your device!

### Serving LLMs using llama-server

Next, you can use `llama-server` to start a web server with a chat interface, and an OpenAI compatible chat completions API.

1. First, find the IP address of your development board:

   ```bash
   ifconfig | grep -Eo 'inet (addr:)?([0-9]*\.){3}[0-9]*' | grep -Eo '([0-9]*\.){3}[0-9]*' | grep -v '127.0.0.1'

   # ... Example:
   # 192.168.1.253
   ```
2. Start the server via:

   ```
   llama-server -m ./qwen2-1_5b-instruct-q4_0-pure.gguf --no-warmup -b 128 -c 2048 -s 11 -n 128 --host 0.0.0.0 --port 9876
   ```
3. On your computer, open a web browser and navigate to `http://192.168.1.253:9876` (replace the IP address with the one you found in 1.):

   <figure><img src="/files/TgPM3pgWBqsTe1uA2Axi" alt="Serving LLMs using llama-server"><figcaption><p>Serving LLMs using llama-server</p></figcaption></figure>
4. You can also programmatically access this server using the OpenAI Chat Completions API. E.g. from Python:
   1. Create a new venv and install `requests`:

      ```bash
      python3 -m venv .venv-chat
      source .venv/bin/activate
      pip3 install requests
      ```
   2. Create a new file `chat.py`:

      ```python
      import requests

      # if running from your own computer, replace localhost with the IP address of your development board
      url = "http://localhost:9876/v1/chat/completions"

      payload = {
          "messages": [
              {"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": "Explain Qualcomm in one sentence."}
          ],
          "temperature": 0.7,
          "max_tokens": 200
      }

      response = requests.post(url, headers={ "Content-Type": "application/json" }, json=payload)
      print(response.json())
      ```
   3. Run `chat.py`:

      ```bash
      python3 chat.py

      # ...
      # {'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'role': 'assistant', 'content': 'Qualcomm is a leading global technology company that designs, develops, licenses, and markets semiconductor-based products and mobile platform technologies to major telecommunications and consumer electronics manufacturers worldwide.'}}], 'created': 1757073340, 'model': 'gpt-3.5-turbo', 'system_fingerprint': 'b6362-f6da8cb8', 'object': 'chat.completion', 'usage': {'completion_tokens': 34, 'prompt_tokens': 26, 'total_tokens': 60}, 'id': 'chatcmpl-3O7l005WG1DzN191FTNomJNweHMoH8Is', 'timings': {'prompt_n': 12, 'prompt_ms': 303.581, 'prompt_per_token_ms': 25.298416666666668, 'prompt_per_second': 39.52816546490064, 'predicted_n': 34, 'predicted_ms': 4052.23, 'predicted_per_token_ms': 119.18323529411765, 'predicted_per_second': 8.390441806116632}}
      ```

### Serving multi-modal LLMs

You can also use multi-modal LLMs. For example [SmolVLM-500M-Instruct-GGUF](https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF). Download both the Q4\_0 quantized weights (or quantize them yourself), and download the CLIP encoder `mmproj-*.gguf` file. For example:

```bash
# Download weights
wget https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF/resolve/main/SmolVLM-500M-Instruct-f16.gguf
wget https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF/resolve/main/mmproj-SmolVLM-500M-Instruct-f16.gguf

# Quantize model (mmproj- models are not quantizable via llama-quantize, see below)
llama-quantize --pure SmolVLM-500M-Instruct-f16.gguf SmolVLM-500M-Instruct-q4_0-pure.gguf Q4_0

# Server the model
llama-server -m ./SmolVLM-500M-Instruct-q4_0-pure.gguf --mmproj ./mmproj-SmolVLM-500M-Instruct-f16.gguf --no-warmup -b 128 -c 2048 -s 11 -n 128 --host 0.0.0.0 --port 9876
```

<figure><img src="/files/i1VeS1OG8e4Eo6eyipd5" alt="Serving multi-modal LLMs using llama-server"><figcaption><p>Serving multi-modal LLMs using llama-server</p></figcaption></figure>

{% hint style="info" %}
**CLIP model is still fp16:** The `mmproj` model is still fp16; and thus processing images will be slow. There's code to quantize the CLIP encoder in [older versions of llama.cpp](https://github.com/ggml-org/llama.cpp/pull/11644).
{% endhint %}

## Tips & tricks

### Comparing CPU performance

Add `-ngl 0` to the `llama-*` commands to skip offloading layers to the GPU. Models will run on CPU, and you can compare performance with GPU.

E.g. for the Qwen2-1.5B-Instruct Q4\_0 on RB3 Gen 2 Vision Kit:

**GPU:**

```bash
llama-cli -m ./qwen2-1_5b-instruct-q4_0-pure.gguf -no-cnv --no-warmup -b 128 -c 2048 -s 11 -n 128 -p "Knock knock, " -fa off

# llama_perf_sampler_print:    sampling time =     225.78 ms /   133 runs   (    1.70 ms per token,   589.06 tokens per second)
# llama_perf_context_print:        load time =    5338.13 ms
# llama_perf_context_print: prompt eval time =     201.32 ms /     5 tokens (   40.26 ms per token,    24.84 tokens per second)
# llama_perf_context_print:        eval time =   13214.35 ms /   127 runs   (  104.05 ms per token,     9.61 tokens per second)
# llama_perf_context_print:       total time =   18958.06 ms /   132 tokens
# llama_perf_context_print:    graphs reused =        122
```

**CPU:**

```bash
llama-cli -m ./qwen2-1_5b-instruct-q4_0-pure.gguf -no-cnv --no-warmup -b 128 -ngl 99 -c 2048 -s 11 -n 128 -p "Knock knock, " -fa off -ngl 0

# llama_perf_sampler_print:    sampling time =      23.47 ms /   133 runs   (    0.18 ms per token,  5666.08 tokens per second)
# llama_perf_context_print:        load time =     677.25 ms
# llama_perf_context_print: prompt eval time =     253.39 ms /     5 tokens (   50.68 ms per token,    19.73 tokens per second)
# llama_perf_context_print:        eval time =   17751.29 ms /   127 runs   (  139.77 ms per token,     7.15 tokens per second)
# llama_perf_context_print:       total time =   18487.26 ms /   132 tokens
# llama_perf_context_print:    graphs reused =        122
```

Here the GPU evaluates tokens \~33% faster than the CPU.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://qc-ai-test.gitbook.io/qc-ai-test-docs/running-building-ai-models/llama-cpp.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
