Llama.cpp

You can a wide range of Large Language Models (LLMs) and Vision Language Models (VLMs) on your Dragowning development boards using llama.cpp. Models running under llama.cpp run on the GPU, not on the NPU. You can run a subset of models on the NPU via GENIE.

Builing llama.cpp

You'll need to build some dependencies for llama.cpp. Open the terminal on your development board, or an ssh session to your development board, and run:

  1. Install build dependencies:

    sudo apt update
    sudo apt install -y cmake ninja-build curl libcurl4-openssl-dev
  2. Install the OpenCL headers and ICD loader library:

    mkdir -p ~/dev/llm
    
    # Symlink the OpenCL shared library
    sudo rm -f /usr/lib/libOpenCL.so
    sudo ln -s /lib/aarch64-linux-gnu/libOpenCL.so.1.0.0 /usr/lib/libOpenCL.so
    
    # OpenCL headers
    cd ~/dev/llm
    git clone https://github.com/KhronosGroup/OpenCL-Headers
    cd OpenCL-Headers
    git checkout 5d52989617e7ca7b8bb83d7306525dc9f58cdd46
    mkdir -p build && cd build
    cmake .. -G Ninja \
        -DBUILD_TESTING=OFF \
        -DOPENCL_HEADERS_BUILD_TESTING=OFF \
        -DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF \
        -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
    cmake --build . --target install
    
    # ICD Loader
    cd ~/dev/llm
    git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader
    cd OpenCL-ICD-Loader
    git checkout 02134b05bdff750217bf0c4c11a9b13b63957b04
    mkdir -p build && cd build
    cmake .. -G Ninja \
        -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" \
        -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
    cmake --build . --target install
    
    # Symlink OpenCL headers
    sudo rm -f /usr/include/CL
    sudo ln -s ~/dev/llm/opencl/include/CL/ /usr/include/CL
  3. Build llama.cpp with the OpenCL backend:

    cd ~/dev/llm
    
    # Clone repository
    git clone https://github.com/ggml-org/llama.cpp
    cd llama.cpp
    
    # We've tested this commit explicitly, you can try master if you want bleeding edge
    git checkout f6da8cb86a28f0319b40d9d2a957a26a7d875f8c
    
    # Build
    mkdir -p build
    cd build
    cmake .. -G Ninja \
        -DCMAKE_BUILD_TYPE=Release \
        -DBUILD_SHARED_LIBS=OFF \
        -DGGML_OPENCL=ON
    ninja -j`nproc`
  4. Add the llama.cpp paths to your PATH:

    cd ~/dev/llm/llama.cpp/build/bin
    
    echo "" >> ~/.bash_profile
    echo "# Begin llama.cpp" >> ~/.bash_profile
    echo "export PATH=\$PATH:$PWD" >> ~/.bash_profile
    echo "# End llama.cpp" >> ~/.bash_profile
    echo "" >> ~/.bash_profile
    
    # To use the llama.cpp files in your current session
    source ~/.bash_profile
  5. You now have llama.cpp:

    llama-cli --version
    # ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'
    # ggml_opencl: device: 'QUALCOMM Adreno(TM) 635 (OpenCL 3.0 Adreno(TM) 635)'
    # ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: 0808.0.7 Compiler E031.49.02.00
    # ggml_opencl: vector subgroup broadcast support: true

Downloading and quantizing a model

To run GPU-accelerated models you'll want pure 4-bit quantized (Q4_0) models in GGUF format (the llama.cpp format, conversion guide). You can either find pre-quantized models, or quantize a model yourself using llama-quantize. For example, for Qwen2-1.5B-Instruct:

  1. Grab Qwen2-1.5B-Instruct in fp16 format from HuggingFace, and quantize using llama-quantize:

    # Download fp16 model
    wget https://huggingface.co/Qwen/Qwen2-1.5B-Instruct-GGUF/resolve/main/qwen2-1_5b-instruct-fp16.gguf
    
    # Quantize (pure Q4_0)
    llama-quantize --pure qwen2-1_5b-instruct-fp16.gguf qwen2-1_5b-instruct-q4_0-pure.gguf Q4_0
  2. Now follow either the llama.cpp compilation instructions to run this model.

Running your first LLM using llama-cli

You're now ready to run the LLM via llama-cli. It'll automatically offload layers to the GPU:

llama-cli -m ./qwen2-1_5b-instruct-q4_0-pure.gguf -no-cnv --no-warmup -b 128 -c 2048 -s 11 -n 128 -p "Knock knock, " -fa off

# ... You'll see:
# load_tensors: offloaded 29/29 layers to GPU
# ...
# Knock knock, 11:59 pm ... rest of the story

🚀 You now have an LLM running on the GPU of your device!

Serving LLMs using llama-server

Next, you can use llama-server to start a web server with a chat interface, and an OpenAI compatible chat completions API.

  1. First, find the IP address of your development board:

    ifconfig | grep -Eo 'inet (addr:)?([0-9]*\.){3}[0-9]*' | grep -Eo '([0-9]*\.){3}[0-9]*' | grep -v '127.0.0.1'
    
    # ... Example:
    # 192.168.1.253
  2. Start the server via:

    llama-server -m ./qwen2-1_5b-instruct-q4_0-pure.gguf --no-warmup -b 128 -c 2048 -s 11 -n 128 --host 0.0.0.0 --port 9876
  3. On your computer, open a web browser and navigate to http://192.168.1.253:9876 (replace the IP address with the one you found in 1.):

    Serving LLMs using llama-server
    Serving LLMs using llama-server
  4. You can also programmatically access this server using the OpenAI Chat Completions API. E.g. from Python:

    1. Create a new venv and install requests:

      python3 -m venv .venv-chat
      source .venv/bin/activate
      pip3 install requests
    2. Create a new file chat.py:

      import requests
      
      # if running from your own computer, replace localhost with the IP address of your development board
      url = "http://localhost:9876/v1/chat/completions"
      
      payload = {
          "messages": [
              {"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": "Explain Qualcomm in one sentence."}
          ],
          "temperature": 0.7,
          "max_tokens": 200
      }
      
      response = requests.post(url, headers={ "Content-Type": "application/json" }, json=payload)
      print(response.json())
    3. Run chat.py:

      python3 chat.py
      
      # ...
      # {'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'role': 'assistant', 'content': 'Qualcomm is a leading global technology company that designs, develops, licenses, and markets semiconductor-based products and mobile platform technologies to major telecommunications and consumer electronics manufacturers worldwide.'}}], 'created': 1757073340, 'model': 'gpt-3.5-turbo', 'system_fingerprint': 'b6362-f6da8cb8', 'object': 'chat.completion', 'usage': {'completion_tokens': 34, 'prompt_tokens': 26, 'total_tokens': 60}, 'id': 'chatcmpl-3O7l005WG1DzN191FTNomJNweHMoH8Is', 'timings': {'prompt_n': 12, 'prompt_ms': 303.581, 'prompt_per_token_ms': 25.298416666666668, 'prompt_per_second': 39.52816546490064, 'predicted_n': 34, 'predicted_ms': 4052.23, 'predicted_per_token_ms': 119.18323529411765, 'predicted_per_second': 8.390441806116632}}

Serving multi-modal LLMs

You can also use multi-modal LLMs. For example SmolVLM-500M-Instruct-GGUF. Download both the Q4_0 quantized weights (or quantize them yourself), and download the CLIP encoder mmproj-*.gguf file. For example:

# Download weights
wget https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF/resolve/main/SmolVLM-500M-Instruct-f16.gguf
wget https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF/resolve/main/mmproj-SmolVLM-500M-Instruct-f16.gguf

# Quantize model (mmproj- models are not quantizable via llama-quantize, see below)
llama-quantize --pure SmolVLM-500M-Instruct-f16.gguf SmolVLM-500M-Instruct-q4_0-pure.gguf Q4_0

# Server the model
llama-server -m ./SmolVLM-500M-Instruct-q4_0-pure.gguf --mmproj ./mmproj-SmolVLM-500M-Instruct-f16.gguf --no-warmup -b 128 -c 2048 -s 11 -n 128 --host 0.0.0.0 --port 9876
Serving multi-modal LLMs using llama-server
Serving multi-modal LLMs using llama-server

CLIP model is still fp16: The mmproj model is still fp16; and thus processing images will be slow. There's code to quantize the CLIP encoder in older versions of llama.cpp.

Tips & tricks

Comparing CPU performance

Add -ngl 0 to the llama-* commands to skip offloading layers to the GPU. Models will run on CPU, and you can compare performance with GPU.

E.g. for the Qwen2-1.5B-Instruct Q4_0 on RB3 Gen 2 Vision Kit:

GPU:

llama-cli -m ./qwen2-1_5b-instruct-q4_0-pure.gguf -no-cnv --no-warmup -b 128 -c 2048 -s 11 -n 128 -p "Knock knock, " -fa off

# llama_perf_sampler_print:    sampling time =     225.78 ms /   133 runs   (    1.70 ms per token,   589.06 tokens per second)
# llama_perf_context_print:        load time =    5338.13 ms
# llama_perf_context_print: prompt eval time =     201.32 ms /     5 tokens (   40.26 ms per token,    24.84 tokens per second)
# llama_perf_context_print:        eval time =   13214.35 ms /   127 runs   (  104.05 ms per token,     9.61 tokens per second)
# llama_perf_context_print:       total time =   18958.06 ms /   132 tokens
# llama_perf_context_print:    graphs reused =        122

CPU:

llama-cli -m ./qwen2-1_5b-instruct-q4_0-pure.gguf -no-cnv --no-warmup -b 128 -ngl 99 -c 2048 -s 11 -n 128 -p "Knock knock, " -fa off -ngl 0

# llama_perf_sampler_print:    sampling time =      23.47 ms /   133 runs   (    0.18 ms per token,  5666.08 tokens per second)
# llama_perf_context_print:        load time =     677.25 ms
# llama_perf_context_print: prompt eval time =     253.39 ms /     5 tokens (   50.68 ms per token,    19.73 tokens per second)
# llama_perf_context_print:        eval time =   17751.29 ms /   127 runs   (  139.77 ms per token,     7.15 tokens per second)
# llama_perf_context_print:       total time =   18487.26 ms /   132 tokens
# llama_perf_context_print:    graphs reused =        122

Here the GPU evaluates tokens ~33% faster than the CPU.

Last updated