Welcome to Intel® NPU Acceleration Library’s documentation!#
The Intel® NPU Acceleration Library is a Python library designed to boost the efficiency of your applications by leveraging the power of the Intel Neural Processing Unit (NPU) to perform high-speed computations on compatible hardware.
Installation#
Check that your system has an available NPU (how-to).
You can install the packet in your machine with
pip install intel-npu-acceleration-library
Run a LLaMA model on the NPU#
To run LLM models you need to install the transformers library
pip install transformers
You are now up and running! You can create a simple script like the following one to run a LLM on the NPU
from transformers import AutoTokenizer, TextStreamer from intel_npu_acceleration_library import NPUModelForCausalLM import torch model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" model = NPUModelForCausalLM.from_pretrained(model_id, use_cache=True, dtype=torch.int8).eval() tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True) tokenizer.pad_token_id = tokenizer.eos_token_id streamer = TextStreamer(tokenizer, skip_special_tokens=True) query = input("Ask something: ") prefix = tokenizer(query, return_tensors="pt")["input_ids"] generation_kwargs = dict( input_ids=prefix, streamer=streamer, do_sample=True, top_k=50, top_p=0.9, max_new_tokens=512, ) print("Run inference") _ = model.generate(**generation_kwargs)
Take note that you only need to use intel_npu_acceleration_library.compile to offload the heavy computation to the NPU.
Feel free to check Usage and LLM and the examples folder for additional use-cases and examples.
Site map#
Library overview:
Developements guide:
API Reference: