LLMs/VLMs using Llama.cpp

You can a wide range of Large Language Models (LLMs) and Vision Language Models (VLMs) on your Dragowning development boards using llama.cpparrow-up-right. Models running under llama.cpp run on the GPU, not on the NPU. You can run a subset of models on the NPU via GENIE.

Builing llama.cpp

You'll need to build some dependencies for llama.cpp. Open the terminal on your development board, or an ssh session to your development board, and run:

  1. Install build dependencies:

    sudo apt update
    sudo apt install -y cmake ninja-build curl libcurl4-openssl-dev build-essential
  2. Install the OpenCL headers and ICD loader library:

    mkdir -p ~/dev/llm
    
    # Symlink the OpenCL shared library
    sudo rm -f /usr/lib/libOpenCL.so
    sudo ln -s /lib/aarch64-linux-gnu/libOpenCL.so.1.0.0 /usr/lib/libOpenCL.so
    
    # OpenCL headers
    cd ~/dev/llm
    git clone https://github.com/KhronosGroup/OpenCL-Headers
    cd OpenCL-Headers
    git checkout 5d52989617e7ca7b8bb83d7306525dc9f58cdd46
    mkdir -p build && cd build
    cmake .. -G Ninja \
        -DBUILD_TESTING=OFF \
        -DOPENCL_HEADERS_BUILD_TESTING=OFF \
        -DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF \
        -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
    cmake --build . --target install
    
    # ICD Loader
    cd ~/dev/llm
    git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader
    cd OpenCL-ICD-Loader
    git checkout 02134b05bdff750217bf0c4c11a9b13b63957b04
    mkdir -p build && cd build
    cmake .. -G Ninja \
        -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" \
        -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
    cmake --build . --target install
    
    # Symlink OpenCL headers
    sudo rm -f /usr/include/CL
    sudo ln -s ~/dev/llm/opencl/include/CL/ /usr/include/CL
  3. Build llama.cpp with the OpenCL backend:

    cd ~/dev/llm
    
    # Clone repository
    git clone https://github.com/ggml-org/llama.cpp
    cd llama.cpp
    
    # We've tested this commit explicitly, you can try master if you want bleeding edge
    git checkout f6da8cb86a28f0319b40d9d2a957a26a7d875f8c
    
    # Build
    mkdir -p build
    cd build
    cmake .. -G Ninja \
        -DCMAKE_BUILD_TYPE=Release \
        -DBUILD_SHARED_LIBS=OFF \
        -DGGML_OPENCL=ON
    ninja -j`nproc`
  4. Add the llama.cpp paths to your PATH:

    cd ~/dev/llm/llama.cpp/build/bin
    
    echo "" >> ~/.bash_profile
    echo "# Begin llama.cpp" >> ~/.bash_profile
    echo "export PATH=\$PATH:$PWD" >> ~/.bash_profile
    echo "# End llama.cpp" >> ~/.bash_profile
    echo "" >> ~/.bash_profile
    
    # To use the llama.cpp files in your current session
    source ~/.bash_profile
  5. You now have llama.cpp:

    llama-cli --version
    # ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'
    # ggml_opencl: device: 'QUALCOMM Adreno(TM) 635 (OpenCL 3.0 Adreno(TM) 635)'
    # ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: 0808.0.7 Compiler E031.49.02.00
    # ggml_opencl: vector subgroup broadcast support: true

Downloading and quantizing a model

To run GPU-accelerated models you'll want pure 4-bit quantized (Q4_0) models in GGUF format (the llama.cpp format, conversion guidearrow-up-right). You can either find pre-quantized models, or quantize a model yourself using llama-quantize. For example, for Qwen2-1.5B-Instruct:

Running your first LLM using llama-cli

You're now ready to run the LLM via llama-cli. It'll automatically offload layers to the GPU:

🚀 You now have an LLM running on the GPU of your device!

Serving LLMs using llama-server

Next, you can use llama-server to start a web server with a chat interface, and an OpenAI compatible chat completions API.

  1. First, find the IP address of your development board:

  2. Start the server via:

  3. On your computer, open a web browser and navigate to http://192.168.1.253:9876 (replace the IP address with the one you found in 1.):

    Serving LLMs using llama-server
    Serving LLMs using llama-server
  4. You can also programmatically access this server using the OpenAI Chat Completions API. E.g. from Python:

    1. Create a new venv and install requests:

    2. Create a new file chat.py:

    3. Run chat.py:

Serving multi-modal LLMs

You can also use multi-modal LLMs. For example SmolVLM-500M-Instruct-GGUFarrow-up-right. Download both the Q4_0 quantized weights (or quantize them yourself), and download the CLIP encoder mmproj-*.gguf file. For example:

Serving multi-modal LLMs using llama-server
Serving multi-modal LLMs using llama-server
circle-info

CLIP model is still fp16: The mmproj model is still fp16; and thus processing images will be slow. There's code to quantize the CLIP encoder in older versions of llama.cpparrow-up-right.

Tips & tricks

Comparing CPU performance

Add -ngl 0 to the llama-* commands to skip offloading layers to the GPU. Models will run on CPU, and you can compare performance with GPU.

E.g. for the Qwen2-1.5B-Instruct Q4_0 on RB3 Gen 2 Vision Kit:

GPU:

CPU:

Here the GPU evaluates tokens ~33% faster than the CPU.

Last updated