Run SLMs on Snapdragon devices with NPUs

Learn how to run SLMs on Snapdragon devices with ONNX Runtime.

Models

Models supported currently are:

  • Phi-3.5 mini instruct
  • Llama 3.2 3B

Devices with Snapdragon NPUs requires models in a specific size and format.

Instructions to generate models in this format can be found in Build models for snapdragon

Once you have built or downloaded the model, place the model assets in a known location. These assets consist of:

  • genai_config.json
  • tokenizer.json
  • tokenizer_config.json
  • special_tokens_map.json
  • quantizer.onnx
  • dequantizer.onnx
  • position-processor.onnx
  • a set of transformer model binaries
    • Qualcomm context binaries (*.bin)
    • Context binary meta data (*.json)
    • ONNX wrapper models (*.onnx)

Python application

If your device has Python installed, you can run a simple question and answering script to query the model.

Install the runtime

pip install onnxruntime-genai 

Download the script

curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-qa.py -o model-qa.py 

Run the script

This script assumes that the model assets are in a folder called models\Phi-3.5-mini-instruct

python .\model-qa.py -e cpu -g -v --system_prompt "You are a helpful assistant. Be brief and concise." --chat_template "<|user|>\n{input} <|end|>\n<|assistant|>" -m ..\..\models\Phi-3.5-mini-instruct 

A look inside the Python script

The complete Python script is published here: https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/model-qa.py. The script utilizes the API in the following standard way:

  1. Load the model

    model = og.Model(config) 

    This loads the model into memory.

  2. Create pre processors and tokenize system prompt

     tokenizer = og.Tokenizer(model) tokenizer_stream = tokenizer.create_stream() # Optional  system_tokens = tokenizer.encode(system_prompt) 

    This creates a tokenizer and a tokenizer stream which allows tokens to be returned to the user as they are generated.

  3. Interactive input loop

    while True: # Read prompt  # Run the generation, streaming the output tokens 
  4. Generative loop

    # 1. Pre-process the prompt into tokens input_tokens = tokenizer.encode(prompt) # 2. Create parameters and generator (KV cache etc) and process the prompt params = og.GeneratorParams(model) params.set_search_options(**search_options) generator = og.Generator(model, params) generator.append_tokens(system_tokens + input_tokens) # 3. Loop until all output tokens are generated, printing # out the decoded token while not generator.is_done(): generator.generate_next_token() new_token = generator.get_next_tokens()[0] print(tokenizer_stream.decode(new_token), end="", flush=True) print() # Delete the generator to free the captured graph before creating another one  del generator 

C++ Application

To run the models on snadragon NPU within a C++ application, use the code from here: https://github.com/microsoft/onnxruntime-genai/tree/main/examples/c.

Building and running this application requires a Windows PC with a Snapdragon NPU, as well as:

  • cmake
  • Visual Studio 2022

Clone the repo

 git clone https://github.com/microsoft/onnxruntime-genai cd examples\c 

Install onnxruntime

Currently requires the nightly build of onnxruntime, as there are up to the minute changes to QNN support for language models.

Download the nightly version of the ONNX Runtime QNN binaries from here

 mkdir onnxruntime-win-arm64-qnn move Microsoft.ML.OnnxRuntime.QNN.1.22.0-dev-20250225-0548-e46c0d8.nupkg onnxruntime-win-arm64-qnn cd onnxruntime-win-arm64-qnn tar xvzf Microsoft.ML.OnnxRuntime.QNN.1.22.0-dev-20250225-0548-e46c0d8.nupkg copy runtimes\win-arm64\native\* ..\..\..\lib cd .. 

Install onnxruntime-genai

 curl https://github.com/microsoft/onnxruntime-genai/releases/download/v0.6.0/onnxruntime-genai-0.6.0-win-arm64.zip -o onnxruntime-genai-win-arm64.zip tar xvf onnxruntime-genai-win-arm64.zip cd onnxruntime-genai-0.6.0-win-arm64 copy include\* ..\include copy lib\* ..\lib 

Build the sample

 cmake -A arm64 -S . -B build -DPHI3-QA=ON cd build cmake --build . --config Release 

Run the sample

 cd Release .\phi3_qa.exe <path_to_model> 

A look inside the C++ sample

The C++ application is published here: https://github.com/microsoft/onnxruntime-genai/blob/main/examples/c/src/phi3_qa.cpp. The script utilizes the API in the following standard way:

  1. Load the model

    auto model = OgaModel::Create(*config); 

    This loads the model into memory.

  2. Create pre processors

    auto tokenizer = OgaTokenizer::Create(*model); auto tokenizer_stream = OgaTokenizerStream::Create(*tokenizer); 

    This creates a tokenizer and a tokenizer stream which allows tokens to be returned to the user as they are generated.

  3. Interactive input loop

    while True: # Read prompt  # Run the generation, streaming the output tokens 
  4. Generative loop

    # 1. Pre-process the prompt into tokens auto sequences = OgaSequences::Create(); tokenizer->Encode(prompt.c_str(), *sequences); # 2. Create parameters and generator (KV cache etc) and process the prompt auto params = OgaGeneratorParams::Create(*model); params->SetSearchOption("max_length", 1024); auto generator = OgaGenerator::Create(*model, *params); generator->AppendTokenSequences(*sequences); # 3. Loop until all output tokens are generated, printing # out the decoded token while (!generator->IsDone()) { generator->GenerateNextToken(); if (is_first_token) { timing.RecordFirstTokenTimestamp(); is_first_token = false; } const auto num_tokens = generator->GetSequenceCount(0); const auto new_token = generator->GetSequenceData(0)[num_tokens - 1]; std::cout << tokenizer_stream->Decode(new_token) << std::flush; }