Whisper

Whisper is OpenAI's general-purpose automatic speech recognition (ASR) model. You can use it for audio transcription, translation, and language identification. You can run Whisper on the NPU of your Dragonwing development board using Qualcomm's VoiceAI ASR, or on the CPU using whisper.cpp.

Running Whisper on the NPU with VoiceAI ASR

  1. Open a terminal on your development board, and set up the base requirements for this example:

    sudo apt install -y cmake pulseaudio-utils
  2. Install the AI Runtime SDK - Community Edition:

    wget -qO- https://cdn.edgeimpulse.com/qc-ai-docs/device-setup/install_ai_runtime_sdk_2.35.sh | bash
  3. Install VoiceAI ASR:

    cd ~/
    
    # https://softwarecenter.qualcomm.com/catalog/item/VoiceAI_ASR, temp mirrored here for devrel purposes, TOOD: remove before launch
    wget https://cdn.edgeimpulse.com/qc-ai-docs/sdk/VoiceAI_ASR_2.1.0.0.zip
    unzip VoiceAI_ASR_2.1.0.0.zip -d voiceai_asr
    
    cd voiceai_asr/2.1.0.0/
    
    # Put the path to VoiceAI ASR in your bash_profile (so it's available under VOICEAI_ROOT)
    echo "" >> ~/.bash_profile
    echo "# Begin VoiceAI ASR" >> ~/.bash_profile
    echo "export VOICEAI_ROOT=$PWD" >> ~/.bash_profile
    echo "# End VoiceAI ASR" >> ~/.bash_profile
    echo "" >> ~/.bash_profile
    
    # Re-load the environment variables
    source ~/.bash_profile
    
    # Symlink Whisper libraries
    cd $VOICEAI_ROOT/whisper_sdk/libs/npu/rpc_libraries/linux/whisper_all_quantized/
    sudo ln -s $PWD/*.so /usr/lib/
  4. Build the voice-ai-ref example:

    cd $VOICEAI_ROOT/whisper_sdk/sampleapp/npu_rpc_linux_sample/voice-ai-ref
    
    # overwrite the main.cpp example
    wget -O src/main.cpp https://cdn.edgeimpulse.com/qc-ai-docs/code/voiceai_ref_2.1.0.0_main.cpp
    
    # Symlink Whisper libraries for build
    mkdir -p libs/arm64-v8a/
    cd libs/arm64-v8a/
    ln -s $VOICEAI_ROOT/whisper_sdk/libs/npu/rpc_libraries/linux/whisper_all_quantized/*.so .
    cd ../../
    
    mkdir -p build
    cd build
    cmake ..
    make -j`nproc`
  5. Download a precompiled Whisper model for the NPU:

    mkdir -p ~/whisper_models/model_qnn_226/
    cd ~/whisper_models/model_qnn_226/
    
    ln -s $QAIRT_SRC_ROOT/lib/hexagon-v68/unsigned/libQnnHtpV68Skel.so .
    ln -s $VOICEAI_ROOT/whisper_sdk/libs/npu/rpc_libraries/assets/speech_float.eai .
    
    # TODO: Download decoder_model_htp.bin / encoder_model_htp.bin / vocab.bin (ask Jan)
  1. You can now transcribe WAV files:

    cd $VOICEAI_ROOT/whisper_sdk/sampleapp/npu_rpc_linux_sample/voice-ai-ref/build
    
    # Download sample file
    wget -O jfk.wav https://raw.githubusercontent.com/ggml-org/whisper.cpp/refs/heads/master/samples/jfk.wav
    
    # Transcribe:
    ./voice-ai-ref -f jfk.wav -l en -t transcribe -m ~/whisper-models/model_qnn_226/ | grep -v "No usable logger handle was found" | grep -v "Logs will be sent to"
    
    # ... Expected result:
    # VoiceAIRef final result =  And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country. [language: English]
  1. Or even do live transcription:

    1. Connect a microphone to your development board.

    2. Find the name of your microphone:

      pactl list short sources
      # 49	alsa_output.platform-sound.stereo-fallback.monitor	PipeWire	s24-32le 2ch 48000Hz	SUSPENDED
      # 76	alsa_input.usb-046d_C922_Pro_Stream_Webcam_C72F6EDF-02.analog-stereo	PipeWire	s16le 2ch 32000Hz	SUSPENDED
      
      # To use the USB webcam, use "alsa_input.usb-046d_C922_Pro_Stream_Webcam_C72F6EDF-02.analog-stereo" as the name
    3. Run live transcription:

      ./voice-ai-ref -r -l en -t transcribe -m ~/whisper-models/model_qnn_226/ -d "alsa_input.usb-046d_C922_Pro_Stream_Webcam_C72F6EDF-02.analog-stereo" | grep -v "No usable logger handle was found" | grep -v "Logs will be sent to"
      
      # VoiceAIRef final result =  Hi, this is to see if I can do live transcription on my Rubik Pi. [language: English]
  2. 🚀 You now have fully offline transcription of audio on your development board! VoiceAI ASR does not have bindings to higher level languages (like Python), so if you want to use Whisper in your application it's easiest to just spawn the voice-ai-ref binary, and read data from stdout.

Running Whisper on the CPU with whisper.cpp

Alternatively you can run Whisper on the CPU (with less performance) using whisper.cpp (or any of the other popular Whisper libraries).

Here's instructions for whisper.cpp. Open the terminal on your development board, or an ssh session to your development board, and run:

  1. Install build dependencies:

    sudo apt update
    sudo apt install -y libsdl2-dev libsdl2-2.0-0 libasound2-dev

TODO: libsdl2-dev this gave me some trouble... Need to check on a fresh system.

  1. Build whisper.cpp:

    mkdir -p ~/dev/llm/
    cd ~/dev/llm/
    
    git clone https://github.com/ggml-org/whisper.cpp.git
    cd whisper.cpp
    git checkout v1.7.6
    
    # Build (CPU)
    cmake -B build-cpu -DWHISPER_SDL2=ON
    cmake --build build-cpu -j`nproc` --config Release
  2. Add the whisper.cpp paths to your PATH:

    cd ~/dev/llm/whisper.cpp/build-cpu/bin
    
    echo "" >> ~/.bash_profile
    echo "# Begin whisper.cpp" >> ~/.bash_profile
    echo "export PATH=\$PATH:$PWD" >> ~/.bash_profile
    echo "# End whisper.cpp" >> ~/.bash_profile
    echo "" >> ~/.bash_profile
    
    # To use the whisper.cpp files in your current session
    source ~/.bash_profile
  3. You now transcribe some audio using whisper.cpp:

    # Download model
    cd ~/dev/llm/whisper.cpp
    sh ./models/download-ggml-model.sh tiny.en-q5_1
    
    # Transcribe text
    whisper-cli -m models/ggml-tiny.en-q5_1.bin -f samples/jfk.wav
    
    # [00:00:00.000 --> 00:00:10.480]
    # and so my fellow Americans ask not what your country can do for you ask what you can do for your country
  4. You can also live transcribe audio:

    1. Connect a microphone to your development board.

    2. Find your microphone ID:

      SDL_AUDIODRIVER=alsa whisper-stream -m models/ggml-tiny.en-q5_1.bin
      # init: found 2 capture devices:
      # init:    - Capture device #0: 'qcm6490-rb3-vision-snd-card, '
      # init:    - Capture device #1: 'Yeti Stereo Microphone, USB Audio'
      
      # If you want "Yeti Stereo Microphone, USB Audio" then the ID is 1
    3. Start live transcribing:

      SDL_AUDIODRIVER=alsa whisper-stream -m models/ggml-tiny.en-q5_1.bin -c 1
      
      # main: processing 48000 samples (step = 3.0 sec / len = 10.0 sec / keep = 0.2 sec), 4 threads, lang = en, task = transcribe, timestamps = 0 ...
      # main: n_new_line = 2, no_context = 1
      #
      # [Start speaking]
      # This is a test to see if you can transcribe text live on your Qualcomm device

Running on the GPU with OpenCL

You can build binaries that run on the GPU too via:

  1. First follow the steps in llama.cpp under "Install the OpenCL headers and ICD loader library".

  2. Build a binary with OpenCL:

    cd ~/dev/llm/whisper.cpp
    
    cmake -B build-gpu -DGGML_OPENCL=ON  -DWHISPER_SDL2=ON
    cmake --build build-gpu -j`nproc` --config Release
    
    # Find the binary in:
    #     build-gpu/bin/whisper-cli

But this does not run faster than on CPU, at least on QCS6490 (even with Q4_0 quantized weights).

Last updated