Llama.cpp
You can a wide range of Large Language Models (LLMs) and Vision Language Models (VLMs) on your Dragowning development boards using llama.cpp. Models running under llama.cpp run on the GPU, not on the NPU. You can run a subset of models on the NPU via GENIE.
Builing llama.cpp
You'll need to build some dependencies for llama.cpp. Open the terminal on your development board, or an ssh session to your development board, and run:
Install build dependencies:
sudo apt update sudo apt install -y cmake ninja-build curl libcurl4-openssl-dev
Install the OpenCL headers and ICD loader library:
mkdir -p ~/dev/llm # Symlink the OpenCL shared library sudo rm -f /usr/lib/libOpenCL.so sudo ln -s /lib/aarch64-linux-gnu/libOpenCL.so.1.0.0 /usr/lib/libOpenCL.so # OpenCL headers cd ~/dev/llm git clone https://github.com/KhronosGroup/OpenCL-Headers cd OpenCL-Headers git checkout 5d52989617e7ca7b8bb83d7306525dc9f58cdd46 mkdir -p build && cd build cmake .. -G Ninja \ -DBUILD_TESTING=OFF \ -DOPENCL_HEADERS_BUILD_TESTING=OFF \ -DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF \ -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl" cmake --build . --target install # ICD Loader cd ~/dev/llm git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader cd OpenCL-ICD-Loader git checkout 02134b05bdff750217bf0c4c11a9b13b63957b04 mkdir -p build && cd build cmake .. -G Ninja \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" \ -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl" cmake --build . --target install # Symlink OpenCL headers sudo rm -f /usr/include/CL sudo ln -s ~/dev/llm/opencl/include/CL/ /usr/include/CL
Build llama.cpp with the OpenCL backend:
cd ~/dev/llm # Clone repository git clone https://github.com/ggml-org/llama.cpp cd llama.cpp # We've tested this commit explicitly, you can try master if you want bleeding edge git checkout f6da8cb86a28f0319b40d9d2a957a26a7d875f8c # Build mkdir -p build cd build cmake .. -G Ninja \ -DCMAKE_BUILD_TYPE=Release \ -DBUILD_SHARED_LIBS=OFF \ -DGGML_OPENCL=ON ninja -j`nproc`
Add the llama.cpp paths to your PATH:
cd ~/dev/llm/llama.cpp/build/bin echo "" >> ~/.bash_profile echo "# Begin llama.cpp" >> ~/.bash_profile echo "export PATH=\$PATH:$PWD" >> ~/.bash_profile echo "# End llama.cpp" >> ~/.bash_profile echo "" >> ~/.bash_profile # To use the llama.cpp files in your current session source ~/.bash_profile
You now have llama.cpp:
llama-cli --version # ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)' # ggml_opencl: device: 'QUALCOMM Adreno(TM) 635 (OpenCL 3.0 Adreno(TM) 635)' # ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: 0808.0.7 Compiler E031.49.02.00 # ggml_opencl: vector subgroup broadcast support: true
Downloading and quantizing a model
To run GPU-accelerated models you'll want pure 4-bit quantized (Q4_0
) models in GGUF format (the llama.cpp format, conversion guide). You can either find pre-quantized models, or quantize a model yourself using llama-quantize
. For example, for Qwen2-1.5B-Instruct:
Grab Qwen2-1.5B-Instruct in fp16 format from HuggingFace, and quantize using
llama-quantize
:# Download fp16 model wget https://huggingface.co/Qwen/Qwen2-1.5B-Instruct-GGUF/resolve/main/qwen2-1_5b-instruct-fp16.gguf # Quantize (pure Q4_0) llama-quantize --pure qwen2-1_5b-instruct-fp16.gguf qwen2-1_5b-instruct-q4_0-pure.gguf Q4_0
Now follow either the llama.cpp compilation instructions to run this model.
Running your first LLM using llama-cli
You're now ready to run the LLM via llama-cli
. It'll automatically offload layers to the GPU:
llama-cli -m ./qwen2-1_5b-instruct-q4_0-pure.gguf -no-cnv --no-warmup -b 128 -c 2048 -s 11 -n 128 -p "Knock knock, " -fa off
# ... You'll see:
# load_tensors: offloaded 29/29 layers to GPU
# ...
# Knock knock, 11:59 pm ... rest of the story
🚀 You now have an LLM running on the GPU of your device!
Serving LLMs using llama-server
Next, you can use llama-server
to start a web server with a chat interface, and an OpenAI compatible chat completions API.
First, find the IP address of your development board:
ifconfig | grep -Eo 'inet (addr:)?([0-9]*\.){3}[0-9]*' | grep -Eo '([0-9]*\.){3}[0-9]*' | grep -v '127.0.0.1' # ... Example: # 192.168.1.253
Start the server via:
llama-server -m ./qwen2-1_5b-instruct-q4_0-pure.gguf --no-warmup -b 128 -c 2048 -s 11 -n 128 --host 0.0.0.0 --port 9876
On your computer, open a web browser and navigate to
http://192.168.1.253:9876
(replace the IP address with the one you found in 1.):Serving LLMs using llama-server You can also programmatically access this server using the OpenAI Chat Completions API. E.g. from Python:
Create a new venv and install
requests
:python3 -m venv .venv-chat source .venv/bin/activate pip3 install requests
Create a new file
chat.py
:import requests # if running from your own computer, replace localhost with the IP address of your development board url = "http://localhost:9876/v1/chat/completions" payload = { "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain Qualcomm in one sentence."} ], "temperature": 0.7, "max_tokens": 200 } response = requests.post(url, headers={ "Content-Type": "application/json" }, json=payload) print(response.json())
Run
chat.py
:python3 chat.py # ... # {'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'role': 'assistant', 'content': 'Qualcomm is a leading global technology company that designs, develops, licenses, and markets semiconductor-based products and mobile platform technologies to major telecommunications and consumer electronics manufacturers worldwide.'}}], 'created': 1757073340, 'model': 'gpt-3.5-turbo', 'system_fingerprint': 'b6362-f6da8cb8', 'object': 'chat.completion', 'usage': {'completion_tokens': 34, 'prompt_tokens': 26, 'total_tokens': 60}, 'id': 'chatcmpl-3O7l005WG1DzN191FTNomJNweHMoH8Is', 'timings': {'prompt_n': 12, 'prompt_ms': 303.581, 'prompt_per_token_ms': 25.298416666666668, 'prompt_per_second': 39.52816546490064, 'predicted_n': 34, 'predicted_ms': 4052.23, 'predicted_per_token_ms': 119.18323529411765, 'predicted_per_second': 8.390441806116632}}
Serving multi-modal LLMs
You can also use multi-modal LLMs. For example SmolVLM-500M-Instruct-GGUF. Download both the Q4_0 quantized weights (or quantize them yourself), and download the CLIP encoder mmproj-*.gguf
file. For example:
# Download weights
wget https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF/resolve/main/SmolVLM-500M-Instruct-f16.gguf
wget https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF/resolve/main/mmproj-SmolVLM-500M-Instruct-f16.gguf
# Quantize model (mmproj- models are not quantizable via llama-quantize, see below)
llama-quantize --pure SmolVLM-500M-Instruct-f16.gguf SmolVLM-500M-Instruct-q4_0-pure.gguf Q4_0
# Server the model
llama-server -m ./SmolVLM-500M-Instruct-q4_0-pure.gguf --mmproj ./mmproj-SmolVLM-500M-Instruct-f16.gguf --no-warmup -b 128 -c 2048 -s 11 -n 128 --host 0.0.0.0 --port 9876

Tips & tricks
Comparing CPU performance
Add -ngl 0
to the llama-*
commands to skip offloading layers to the GPU. Models will run on CPU, and you can compare performance with GPU.
E.g. for the Qwen2-1.5B-Instruct Q4_0 on RB3 Gen 2 Vision Kit:
GPU:
llama-cli -m ./qwen2-1_5b-instruct-q4_0-pure.gguf -no-cnv --no-warmup -b 128 -c 2048 -s 11 -n 128 -p "Knock knock, " -fa off
# llama_perf_sampler_print: sampling time = 225.78 ms / 133 runs ( 1.70 ms per token, 589.06 tokens per second)
# llama_perf_context_print: load time = 5338.13 ms
# llama_perf_context_print: prompt eval time = 201.32 ms / 5 tokens ( 40.26 ms per token, 24.84 tokens per second)
# llama_perf_context_print: eval time = 13214.35 ms / 127 runs ( 104.05 ms per token, 9.61 tokens per second)
# llama_perf_context_print: total time = 18958.06 ms / 132 tokens
# llama_perf_context_print: graphs reused = 122
CPU:
llama-cli -m ./qwen2-1_5b-instruct-q4_0-pure.gguf -no-cnv --no-warmup -b 128 -ngl 99 -c 2048 -s 11 -n 128 -p "Knock knock, " -fa off -ngl 0
# llama_perf_sampler_print: sampling time = 23.47 ms / 133 runs ( 0.18 ms per token, 5666.08 tokens per second)
# llama_perf_context_print: load time = 677.25 ms
# llama_perf_context_print: prompt eval time = 253.39 ms / 5 tokens ( 50.68 ms per token, 19.73 tokens per second)
# llama_perf_context_print: eval time = 17751.29 ms / 127 runs ( 139.77 ms per token, 7.15 tokens per second)
# llama_perf_context_print: total time = 18487.26 ms / 132 tokens
# llama_perf_context_print: graphs reused = 122
Here the GPU evaluates tokens ~33% faster than the CPU.
Last updated