- NeuralChat, a customizable chatbot framework under Intel® Extension for Transformers, is now available for you to create your own chatbot within minutes! It supports a rich set of plugins Knowledge Retrieval, Speech Interaction, Query Caching, Security Guardrail, and multiple architectures such as Intel® Xeon® Scalable Processors and Habana Gaudi® Accelerator. Check out the below sample code and have a try now!
 
# pip install intel-extension-for-transformers from intel_extension_for_transformers.neural_chat import build_chatbot chatbot = build_chatbot() response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")- 💬NeuralChat v1.1, a fine-tuned chat model based on MPT-7B using a mixed set of instruction datasets, is available on Hugging Face, together with the release of INT8 quantization recipes and benchmark results.
 
pip install intel-extension-for-transformersFor more installation method, please refer to Installation Page
Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids). The toolkit provides the below key features and examples:
-  
Seamless user experience of model compressions on Transformer-based models by extending Hugging Face transformers APIs and leveraging Intel® Neural Compressor
 -  
Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper Fast Distilbert on CPUs and QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, and NeurIPS 2021's paper Prune Once for All: Sparse Pre-Trained Language Models)
 -  
Optimized Transformer-based model packages such as Stable Diffusion, GPT-J-6B, GPT-NEOX, BLOOM-176B, T5, Flan-T5 and end-to-end workflows such as SetFit-based text classification and document level sentiment analysis (DLSA)
 -  
NeuralChat, a customizable chatbot framework to create your own chatbot within minutes by leveraging a rich set of plugins and SOTA optimizations
 -  
Inference of Large Language Model (LLM) in pure C/C++ with weight-only quantization kernels. It already enabled GPT-NEOX, LLAMA, MPT, FALCON, BLOOM-7B, OPT, ChatGLM2-6B, GPT-J-6B and Dolly-v2-3B
 
from datasets import load_dataset, load_metric from transformers import AutoConfig,AutoModelForSequenceClassification,AutoTokenizer raw_datasets = load_dataset("glue", "sst2") tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english") raw_datasets = raw_datasets.map(lambda e: tokenizer(e['sentence'], truncation=True, padding='max_length', max_length=128), batched=True)from intel_extension_for_transformers.transformers import QuantizationConfig, metrics, objectives from intel_extension_for_transformers.transformers.trainer import NLPTrainer config = AutoConfig.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",num_labels=2) model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",config=config) model.config.label2id = {0: 0, 1: 1} model.config.id2label = {0: 'NEGATIVE', 1: 'POSITIVE'} # Replace transformers.Trainer with NLPTrainer # trainer = transformers.Trainer(...) trainer = NLPTrainer(model=model, train_dataset=raw_datasets["train"], eval_dataset=raw_datasets["validation"], tokenizer=tokenizer ) q_config = QuantizationConfig(metrics=[metrics.Metric(name="eval_loss", greater_is_better=False)]) model = trainer.quantize(quant_config=q_config) input = tokenizer("I like Intel Extension for Transformers", return_tensors="pt") output = model(**input).logits.argmax().item()For more quick samples, please refer to Get Started Page. For more validated examples, please refer to Support Model Matrix
| Model | FP32 | BF16 | INT8 | 
|---|---|---|---|
| EleutherAI/gpt-j-6B | 4163.67 (ms) | 1879.61 (ms) | 1612.24 (ms) | 
| CompVis/stable-diffusion-v1-4 | 10.33 (s) | 3.02 (s) | N/A | 
Note*: GPT-J-6B software/hardware configuration please refer to text-generation. Stable-diffusion software/hardware configuration please refer to text-to-image
| OVERVIEW | |||||||
|---|---|---|---|---|---|---|---|
| Model Compression | NeuralChat | Neural Engine | Kernel Libraries | ||||
| MODEL COMPRESSION | |||||||
| Quantization | Pruning | Distillation | Orchestration | ||||
| Neural Architecture Search | Export | Metrics/Objectives | Pipeline | ||||
| NEURAL ENGINE | |||||||
| Model Compilation | Custom Pattern | Deployment | Profiling | ||||
| KERNEL LIBRARIES | |||||||
| Sparse GEMM Kernels | Custom INT8 Kernels | Profiling | Benchmark | ||||
| ALGORITHMS | |||||||
| Length Adaptive | Data Augmentation | ||||||
| TUTORIALS AND RESULTS | |||||||
| Tutorials | Supported Models | Model Performance | Kernel Performance | ||||
- Blog published on Medium: Faster Stable Diffusion Inference with Intel Extension for Transformers (July 2023)
 - Blog of Intel Developer News: The Moat Is Trust, Or Maybe Just Responsible AI (July 2023)
 - Blog of Intel Developer News: Create Your Own Custom Chatbot (July 2023)
 - Blog of Intel Developer News: Accelerate Llama 2 with Intel AI Hardware and Software Optimizations (July 2023)
 - Arxiv: An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs (June 2023)
 - Blog published on Medium: Simplify Your Custom Chatbot Deployment (June 2023)
 - Blog published on Medium: Create Your Own Custom Chatbot (April 2023)
 
View Full Publication List.
Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach us and look forward to our collaborations on Intel Extension for Transformers!