- [2024/01] Supported INT4 inference on Intel GPUs including Intel Data Center GPU Max Series (e.g., PVC) and Intel Arc A-Series (e.g., ARC). Check out the examples and scripts.
 - [2024/01] Demonstrated Intel Hybrid Copilot in CES 2024 Great Minds Session "Bringing the Limitless Potential of AI Everywhere".
 - [2023/12] Supported QLoRA on CPUs to make fine-tuning on client CPU possible. Check out the blog and readme for more details.
 - [2023/11] Released top-1 7B-sized LLM NeuralChat-v3-1 and DPO dataset. Check out the nice video published by WorldofAI.
 - [2023/11] Published a 4-bit chatbot demo (based on NeuralChat) available on Intel Hugging Face Space. Welcome to have a try! To setup the demo locally, please follow the instructions.
 
pip install intel-extension-for-transformersFor more installation methods, please refer to Installation Page
Intel® Extension for Transformers is an innovative toolkit designed to accelerate GenAI/LLM everywhere with the optimal performance of Transformer-based models on various Intel platforms, including Intel Gaudi2, Intel CPU, and Intel GPU. The toolkit provides the below key features and examples:
-  
Seamless user experience of model compressions on Transformer-based models by extending Hugging Face transformers APIs and leveraging Intel® Neural Compressor
 -  
Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper Fast Distilbert on CPUs and QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, and NeurIPS 2021's paper Prune Once for All: Sparse Pre-Trained Language Models)
 -  
Optimized Transformer-based model packages such as Stable Diffusion, GPT-J-6B, GPT-NEOX, BLOOM-176B, T5, Flan-T5, and end-to-end workflows such as SetFit-based text classification and document level sentiment analysis (DLSA)
 -  
NeuralChat, a customizable chatbot framework to create your own chatbot within minutes by leveraging a rich set of plugins such as Knowledge Retrieval, Speech Interaction, Query Caching, and Security Guardrail. This framework supports Intel Gaudi2/CPU/GPU.
 -  
Inference of Large Language Model (LLM) in pure C/C++ with weight-only quantization kernels for Intel CPU and Intel GPU (TBD), supporting GPT-NEOX, LLAMA, MPT, FALCON, BLOOM-7B, OPT, ChatGLM2-6B, GPT-J-6B, and Dolly-v2-3B. Support AMX, VNNI, AVX512F and AVX2 instruction set. We've boosted the performance of Intel CPUs, with a particular focus on the 4th generation Intel Xeon Scalable processor, codenamed Sapphire Rapids.
 
| Hardware | Fine-Tuning | Inference | ||
| Full | PEFT | 8-bit | 4-bit | |
| Intel Gaudi2 | ✔ | ✔ | WIP (FP8) | - | 
| Intel Xeon Scalable Processors | ✔ | ✔ | ✔ (INT8, FP8) | ✔ (INT4, FP4, NF4) | 
| Intel Xeon CPU Max Series | ✔ | ✔ | ✔ (INT8, FP8) | ✔ (INT4, FP4, NF4) | 
| Intel Data Center GPU Max Series | WIP | WIP | WIP (INT8) | ✔ (INT4) | 
| Intel Arc A-Series | - | - | WIP (INT8) | ✔ (INT4) | 
| Intel Core Processors | - | ✔ | ✔ (INT8, FP8) | ✔ (INT4, FP4, NF4) | 
In the table above, "-" means not applicable or not started yet.
| Software | Fine-Tuning | Inference | ||
| Full | PEFT | 8-bit | 4-bit | |
| PyTorch | 2.0.1+cpu, 2.0.1a0 (gpu)  |  2.0.1+cpu, 2.0.1a0 (gpu)  |  2.1.0+cpu, 2.0.1a0 (gpu)  |  2.1.0+cpu, 2.0.1a0 (gpu)  |  
| Intel® Extension for PyTorch | 2.1.0+cpu, 2.0.110+xpu  |  2.1.0+cpu, 2.0.110+xpu  |  2.1.0+cpu, 2.0.110+xpu  |  2.1.0+cpu, 2.0.110+xpu  |  
| Transformers | 4.35.2(CPU), 4.31.0 (Intel GPU)  |  4.35.2(CPU), 4.31.0 (Intel GPU)  |  4.35.2(CPU), 4.31.0 (Intel GPU)  |  4.35.2(CPU), 4.31.0 (Intel GPU)  |  
| Synapse AI | 1.13.0 | 1.13.0 | 1.13.0 | 1.13.0 | 
| Gaudi2 driver | 1.13.0-ee32e42 | 1.13.0-ee32e42 | 1.13.0-ee32e42 | 1.13.0-ee32e42 | 
| intel-level-zero-gpu | 1.3.26918.50-736~22.04 | 1.3.26918.50-736~22.04 | 1.3.26918.50-736~22.04 | 1.3.26918.50-736~22.04 | 
Please refer to the detailed requirements in CPU, Gaudi2, Intel GPU.
Below is the sample code to create your chatbot. See more examples.
NeuralChat provides OpenAI-compatible RESTful APIs for chat, so you can use NeuralChat as a drop-in replacement for OpenAI APIs. You can start NeuralChat server either using the Shell command or Python code.
# Shell Command neuralchat_server start --config_file ./server/config/neuralchat.yaml# Python Code from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor server_executor = NeuralChatServerExecutor() server_executor(config_file="./server/config/neuralchat.yaml", log_file="./neuralchat.log")NeuralChat service can be accessible through OpenAI client library, curl commands, and requests library. See more in NeuralChat.
from intel_extension_for_transformers.neural_chat import build_chatbot chatbot = build_chatbot() response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")Below is the sample code to use the extended Transformers APIs. See more examples.
from transformers import AutoTokenizer from intel_extension_for_transformers.transformers import AutoModelForCausalLM model_name = "Intel/neural-chat-7b-v3-1" prompt = "Once upon a time, there existed a little girl," tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) inputs = tokenizer(prompt, return_tensors="pt").input_ids model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True) outputs = model.generate(inputs)You can also load the low-bit model quantized by GPTQ/AWQ/RTN/AutoRound algorithm.
from transformers import AutoTokenizer from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig # Download Hugging Face GPTQ/AWQ model or use local quantize model model_name = "PATH_TO_MODEL" # local path to model woq_config = WeightOnlyQuantConfig(use_gptq=True) # use_awq=True for AWQ; use_autoround=True for AutoRound prompt = "Once upon a time, a little girl" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) inputs = tokenizer(prompt, return_tensors="pt").input_ids model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True) outputs = model.generate(inputs)import intel_extension_for_pytorch as ipex from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM from transformers import AutoTokenizer device_map = "xpu" model_name ="Qwen/Qwen-7B" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) prompt = "Once upon a time, there existed a little girl," inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device_map) model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map=device_map, load_in_4bit=True) model = ipex.optimize_transformers(model, inplace=True, dtype=torch.float16, woq=True, device=device_map) output = model.generate(inputs)Note: Please refer to the example and script for more details.
Below is the sample code to use the extended Langchain APIs. See more examples.
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline from langchain.chains import RetrievalQA from langchain_core.vectorstores import VectorStoreRetriever from intel_extension_for_transformers.langchain.vectorstores import Chroma retriever = VectorStoreRetriever(vectorstore=Chroma(...)) retrievalQA = RetrievalQA.from_llm(llm=HuggingFacePipeline(...), retriever=retriever)You can access the validated models, accuracy and performance from Release data or Medium blog.
| OVERVIEW | |||||||
|---|---|---|---|---|---|---|---|
| NeuralChat | Neural Speed | ||||||
| NEURALCHAT | |||||||
| Chatbot on Intel CPU | Chatbot on Intel GPU | Chatbot on Gaudi | |||||
| Chatbot on Client | More Notebooks | ||||||
| NEURAL SPEED | |||||||
| Neural Speed | Streaming LLM | Low Precision Kernels | Tensor Parallelism | ||||
| LLM COMPRESSION | |||||||
| SmoothQuant (INT8) | Weight-only Quantization (INT4/FP4/NF4/INT8) | QLoRA on CPU | |||||
| GENERAL COMPRESSION | |||||||
| Quantization | Pruning | Distillation | Orchestration | ||||
| Neural Architecture Search | Export | Metrics | Objectives | ||||
| Pipeline | Length Adaptive | Early Exit | Data Augmentation | ||||
| TUTORIALS & RESULTS | |||||||
| Tutorials | LLM List | General Model List | Model Performance | ||||
- LLM Infinite Inference (up to 4M tokens)
 
streamingLLM_v2.mp4
- LLM QLoRA on Client CPU
 
QLoRA.on.Core.i9-12900.mp4
- CES 2024: CES 2024 Great Minds Keynote: Bringing the Limitless Potential of AI Everywhere: Intel Hybrid Copilot demo (Jan 2024)
 - Blog published on Medium: Connect an AI agent with your API: Intel Neural-Chat 7b LLM can replace Open AI Function Calling (Dec 2023)
 - NeurIPS'2023 on Efficient Natural Language and Speech Processing: Efficient LLM Inference on CPUs (Nov 2023)
 - Blog published on Hugging Face: Intel Neural-Chat 7b: Fine-Tuning on Gaudi2 for Top LLM Performance (Nov 2023)
 - Blog published on VMware: AI without GPUs: A Technical Brief for VMware Private AI with Intel (Nov 2023)
 
-  
Excellent open-source projects: bitsandbytes, FastChat, fastRAG, ggml, gptq, llama.cpp, lm-evauation-harness, peft, trl, streamingllm and many others.
 -  
Thanks to all the contributors.
 
Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach us, and we look forward to our collaborations on Intel Extension for Transformers!