LLMs/VLMs using Llama.cpp

You can a wide range of Large Language Models (LLMs) and Vision Language Models (VLMs) on your Dragowning development boards using llama.cpp. Models running under llama.cpp run on the GPU, not on the NPU. You can run a subset of models on the NPU via GENIE.

Builing llama.cpp

You'll need to build some dependencies for llama.cpp. Open the terminal on your development board, or an ssh session to your development board, and run:

Install build dependencies:

sudo apt update
sudo apt install -y cmake ninja-build curl libcurl4-openssl-dev build-essential

Install the OpenCL headers and ICD loader library:

mkdir -p ~/dev/llm

# Symlink the OpenCL shared library
sudo rm -f /usr/lib/libOpenCL.so
sudo ln -s /lib/aarch64-linux-gnu/libOpenCL.so.1.0.0 /usr/lib/libOpenCL.so

# OpenCL headers
cd ~/dev/llm
git clone https://github.com/KhronosGroup/OpenCL-Headers
cd OpenCL-Headers
git checkout 5d52989617e7ca7b8bb83d7306525dc9f58cdd46
mkdir -p build && cd build
cmake .. -G Ninja \
    -DBUILD_TESTING=OFF \
    -DOPENCL_HEADERS_BUILD_TESTING=OFF \
    -DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF \
    -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
cmake --build . --target install

# ICD Loader
cd ~/dev/llm
git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader
cd OpenCL-ICD-Loader
git checkout 02134b05bdff750217bf0c4c11a9b13b63957b04
mkdir -p build && cd build
cmake .. -G Ninja \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" \
    -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
cmake --build . --target install

# Symlink OpenCL headers
sudo rm -f /usr/include/CL
sudo ln -s ~/dev/llm/opencl/include/CL/ /usr/include/CL

Build llama.cpp with the OpenCL backend:

cd ~/dev/llm

# Clone repository
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# We've tested this commit explicitly, you can try master if you want bleeding edge
git checkout f6da8cb86a28f0319b40d9d2a957a26a7d875f8c

# Build
mkdir -p build
cd build
cmake .. -G Ninja \
    -DCMAKE_BUILD_TYPE=Release \
    -DBUILD_SHARED_LIBS=OFF \
    -DGGML_OPENCL=ON
ninja -j`nproc`

Add the llama.cpp paths to your PATH:

cd ~/dev/llm/llama.cpp/build/bin

echo "" >> ~/.bash_profile
echo "# Begin llama.cpp" >> ~/.bash_profile
echo "export PATH=\$PATH:$PWD" >> ~/.bash_profile
echo "# End llama.cpp" >> ~/.bash_profile
echo "" >> ~/.bash_profile

# To use the llama.cpp files in your current session
source ~/.bash_profile

You now have llama.cpp:

llama-cli --version
# ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'
# ggml_opencl: device: 'QUALCOMM Adreno(TM) 635 (OpenCL 3.0 Adreno(TM) 635)'
# ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: 0808.0.7 Compiler E031.49.02.00
# ggml_opencl: vector subgroup broadcast support: true

Downloading and quantizing a model

To run GPU-accelerated models you'll want pure 4-bit quantized (Q4_0) models in GGUF format (the llama.cpp format, conversion guide). You can either find pre-quantized models, or quantize a model yourself using llama-quantize. For example, for Qwen2-1.5B-Instruct:

# Download fp16 model (https://huggingface.co/Qwen/Qwen2-1.5B-Instruct-GGUF)
wget https://huggingface.co/Qwen/Qwen2-1.5B-Instruct-GGUF/resolve/main/qwen2-1_5b-instruct-fp16.gguf

# Quantize (pure Q4_0)
llama-quantize --pure qwen2-1_5b-instruct-fp16.gguf qwen2-1_5b-instruct-q4_0-pure.gguf Q4_0

Running your first LLM using llama-cli

You're now ready to run the LLM via llama-cli. It'll automatically offload layers to the GPU:

llama-cli -m ./qwen2-1_5b-instruct-q4_0-pure.gguf -no-cnv --no-warmup -b 128 -c 2048 -s 11 -n 128 -p "Knock knock, " -fa off

# ... You'll see:
# load_tensors: offloaded 29/29 layers to GPU
# ...
# Knock knock, 11:59 pm ... rest of the story

🚀 You now have an LLM running on the GPU of your device!

Serving LLMs using llama-server

Next, you can use llama-server to start a web server with a chat interface, and an OpenAI compatible chat completions API.

First, find the IP address of your development board:

ifconfig | grep -Eo 'inet (addr:)?([0-9]*\.){3}[0-9]*' | grep -Eo '([0-9]*\.){3}[0-9]*' | grep -v '127.0.0.1'

# ... Example:
# 192.168.1.253

Start the server via:

llama-server -m ./qwen2-1_5b-instruct-q4_0-pure.gguf --no-warmup -b 128 -c 2048 -s 11 -n 128 --host 0.0.0.0 --port 9876

On your computer, open a web browser and navigate to http://192.168.1.253:9876 (replace the IP address with the one you found in 1.):
Serving LLMs using llama-server

You can also programmatically access this server using the OpenAI Chat Completions API. E.g. from Python:

Create a new venv and install requests:

python3 -m venv .venv-chat
source .venv/bin/activate
pip3 install requests

Create a new file chat.py:

import requests

# if running from your own computer, replace localhost with the IP address of your development board
url = "http://localhost:9876/v1/chat/completions"

payload = {
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain Qualcomm in one sentence."}
    ],
    "temperature": 0.7,
    "max_tokens": 200
}

response = requests.post(url, headers={ "Content-Type": "application/json" }, json=payload)
print(response.json())

Run chat.py:

python3 chat.py

# ...
# {'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'role': 'assistant', 'content': 'Qualcomm is a leading global technology company that designs, develops, licenses, and markets semiconductor-based products and mobile platform technologies to major telecommunications and consumer electronics manufacturers worldwide.'}}], 'created': 1757073340, 'model': 'gpt-3.5-turbo', 'system_fingerprint': 'b6362-f6da8cb8', 'object': 'chat.completion', 'usage': {'completion_tokens': 34, 'prompt_tokens': 26, 'total_tokens': 60}, 'id': 'chatcmpl-3O7l005WG1DzN191FTNomJNweHMoH8Is', 'timings': {'prompt_n': 12, 'prompt_ms': 303.581, 'prompt_per_token_ms': 25.298416666666668, 'prompt_per_second': 39.52816546490064, 'predicted_n': 34, 'predicted_ms': 4052.23, 'predicted_per_token_ms': 119.18323529411765, 'predicted_per_second': 8.390441806116632}}

You can also use multi-modal LLMs. For example SmolVLM-500M-Instruct-GGUF. Download both the Q4_0 quantized weights (or quantize them yourself), and download the CLIP encoder mmproj-*.gguf file. For example:

# Download weights
wget https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF/resolve/main/SmolVLM-500M-Instruct-f16.gguf
wget https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF/resolve/main/mmproj-SmolVLM-500M-Instruct-f16.gguf

# Quantize model (mmproj- models are not quantizable via llama-quantize, see below)
llama-quantize --pure SmolVLM-500M-Instruct-f16.gguf SmolVLM-500M-Instruct-q4_0-pure.gguf Q4_0

# Server the model
llama-server -m ./SmolVLM-500M-Instruct-q4_0-pure.gguf --mmproj ./mmproj-SmolVLM-500M-Instruct-f16.gguf --no-warmup -b 128 -c 2048 -s 11 -n 128 --host 0.0.0.0 --port 9876

Serving multi-modal LLMs using llama-server

CLIP model is still fp16: The mmproj model is still fp16; and thus processing images will be slow. There's code to quantize the CLIP encoder in older versions of llama.cpp.

Tips & tricks

Comparing CPU performance

Add -ngl 0 to the llama-* commands to skip offloading layers to the GPU. Models will run on CPU, and you can compare performance with GPU.

E.g. for the Qwen2-1.5B-Instruct Q4_0 on RB3 Gen 2 Vision Kit:

GPU:

llama-cli -m ./qwen2-1_5b-instruct-q4_0-pure.gguf -no-cnv --no-warmup -b 128 -c 2048 -s 11 -n 128 -p "Knock knock, " -fa off

# llama_perf_sampler_print:    sampling time =     225.78 ms /   133 runs   (    1.70 ms per token,   589.06 tokens per second)
# llama_perf_context_print:        load time =    5338.13 ms
# llama_perf_context_print: prompt eval time =     201.32 ms /     5 tokens (   40.26 ms per token,    24.84 tokens per second)
# llama_perf_context_print:        eval time =   13214.35 ms /   127 runs   (  104.05 ms per token,     9.61 tokens per second)
# llama_perf_context_print:       total time =   18958.06 ms /   132 tokens
# llama_perf_context_print:    graphs reused =        122

CPU:

llama-cli -m ./qwen2-1_5b-instruct-q4_0-pure.gguf -no-cnv --no-warmup -b 128 -ngl 99 -c 2048 -s 11 -n 128 -p "Knock knock, " -fa off -ngl 0

# llama_perf_sampler_print:    sampling time =      23.47 ms /   133 runs   (    0.18 ms per token,  5666.08 tokens per second)
# llama_perf_context_print:        load time =     677.25 ms
# llama_perf_context_print: prompt eval time =     253.39 ms /     5 tokens (   50.68 ms per token,    19.73 tokens per second)
# llama_perf_context_print:        eval time =   17751.29 ms /   127 runs   (  139.77 ms per token,     7.15 tokens per second)
# llama_perf_context_print:       total time =   18487.26 ms /   132 tokens
# llama_perf_context_print:    graphs reused =        122

Here the GPU evaluates tokens ~33% faster than the CPU.

PreviousRun context binaries (.bin/.dlc)NextLLMs using Genie

Last updated 20 days ago