Skip to content
This repository was archived by the owner on Oct 25, 2024. It is now read-only.

Commit e6ecb21

Browse files
authored
Add documentation for LLM quantization recipes. (#1095)
* Add documentation for LLM quantization recipes Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>
1 parent a4aba8d commit e6ecb21

File tree

1 file changed

+228
-0
lines changed

1 file changed

+228
-0
lines changed
Lines changed: 228 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,228 @@
1+
# Step-by-Step recipes for LLM quantization
2+
3+
This document describes the step-by-step instructions to run large language models (LLMs) on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with [PyTorch](https://pytorch.org/) and [Intel® Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch).
4+
5+
The scripts [run_generation.py](./run_generation.py) provide two quantization approaches respectively (SmoothQuant, Weight-Only Quantization) based on [Intel® Neural Compressor](https://github.com/intel/neural-compressor) and return last token prediction accuracy by `trainer`.
6+
7+
# Validated Models
8+
9+
| Model Name |
10+
| :----------------------------------------------------: |
11+
| [EleutherAI/gpt-j-6b](#eleutheraigpt-j-6b) |
12+
| [facebook/opt-1.3b](#facebookopt-13b) |
13+
| [facebook/opt-30b](#facebookopt-30b) |
14+
| [meta-llama/Llama-2-7b-hf](#meta-llamallama-2-7b-hf) |
15+
| [meta-llama/Llama-2-13b-hf](#meta-llamallama-2-13b-hf) |
16+
| [meta-llama/Llama-2-70b-hf](#meta-llamallama-2-70b-hf) |
17+
| [tiiuae/falcon-40b](#tiiuaefalcon-40b) |
18+
19+
# Prerequisite
20+
21+
```bash
22+
# Installation
23+
git clone https://github.com/intel/intel-extension-for-transformers.git
24+
25+
# install ITREX
26+
cd intel-extension-for-transformers
27+
git checkout a4aba8ddb07c9b744b6ac106502ec059e0c47960
28+
pip install -r requirements.txt
29+
pip install -v .
30+
31+
# install requirements
32+
cd examples/huggingface/pytorch/text-generation/quantization
33+
pip install -r requirements.txt
34+
pip install neural-compressor==2.4.1
35+
pip install transformers==4.32.0
36+
pip install torch==2.1.1+cpu --index-url https://download.pytorch.org/whl/cpu
37+
pip install intel-extension-for-pytorch==2.1.100
38+
pip uninstall lm_eval -y
39+
pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@cc9778fbe4fa1a709be2abed9deb6180fd40e7e2
40+
```
41+
42+
# Run Quantization and evaluate INT8 accuracy
43+
44+
## EleutherAI/gpt-j-6b
45+
46+
### SmoothQuant
47+
48+
```bash
49+
python run_generation.py \
50+
--model EleutherAI/gpt-j-6b \
51+
--output_dir ./saved_results \
52+
--trust_remote_code True \
53+
--fallback_add \
54+
--tasks lambada_openai \
55+
--int8 --sq --accuracy \
56+
--batch_size 1 \
57+
--alpha 0.85
58+
```
59+
60+
### Weight-Only Quantization
61+
62+
```bash
63+
python run_generation.py \
64+
--model EleutherAI/gpt-j-6b \
65+
--output_dir ./saved_results \
66+
--woq \
67+
--accuracy
68+
```
69+
70+
## facebook/opt-1.3b
71+
72+
### SmoothQuant
73+
74+
```bash
75+
python run_generation.py \
76+
--model facebook/opt-1.3b \
77+
--output_dir ./saved_results \
78+
--trust_remote_code True \
79+
--tasks lambada_openai \
80+
--int8 --sq --accuracy \
81+
--batch_size 1 \
82+
--alpha 0.9
83+
```
84+
85+
### Weight-Only Quantization
86+
87+
```bash
88+
python run_generation.py \
89+
--model facebook/opt-1.3b \
90+
--output_dir ./saved_results \
91+
--woq \
92+
--accuracy
93+
```
94+
95+
## facebook/opt-30b
96+
97+
### SmoothQuant
98+
99+
```bash
100+
python run_generation.py \
101+
--model facebook/opt-30b \
102+
--output_dir ./saved_results \
103+
--trust_remote_code True \
104+
--tasks lambada_openai \
105+
--int8 --sq --accuracy \
106+
--batch_size 1 \
107+
--alpha 0.5
108+
```
109+
110+
### Weight-Only Quantization
111+
112+
```bash
113+
python run_generation.py \
114+
--model facebook/opt-30b \
115+
--output_dir ./saved_results \
116+
--woq \
117+
--accuracy
118+
```
119+
120+
## meta-llama/Llama-2-7b-hf
121+
122+
### SmoothQuant
123+
124+
```bash
125+
python run_generation.py \
126+
--model meta-llama/Llama-2-7b-hf \
127+
--output_dir ./saved_results \
128+
--trust_remote_code True \
129+
--calib_len 2048 \
130+
--fallback_add \
131+
--calib_shuffle False \
132+
--tasks lambada_openai \
133+
--int8 --sq --accuracy \
134+
--batch_size 1 \
135+
--recipes "{'smooth_quant': True, 'smooth_quant_args': {'alpha': 'auto', 'folding': False, 'default_alpha': 0.8, 'auto_alpha_args': {'alpha_min': 0.8, 'alpha_max': 0.99, 'alpha_step': 0.01, 'shared_criterion': 'mean'}}}"
136+
```
137+
138+
### Weight-Only Quantization
139+
140+
```bash
141+
python run_generation.py \
142+
--model meta-llama/Llama-2-7b-hf \
143+
--output_dir ./saved_results \
144+
--woq \
145+
--accuracy
146+
```
147+
148+
## meta-llama/Llama-2-13b-hf
149+
150+
### SmoothQuant
151+
152+
```bash
153+
python run_generation.py \
154+
--model meta-llama/Llama-2-13b-hf \
155+
--output_dir ./saved_results \
156+
--trust_remote_code True \
157+
--calib_len 1024 \
158+
--fallback_add \
159+
--calib_shuffle False \
160+
--tasks lambada_openai \
161+
--int8 --sq --accuracy \
162+
--batch_size 1 \
163+
--recipes "{'smooth_quant': True, 'smooth_quant_args': {'alpha': 'auto', 'folding': False, 'default_alpha': 0.8, 'auto_alpha_args': {'alpha_min': 0.75, 'alpha_max': 0.99, 'alpha_step': 0.01, 'shared_criterion': 'max'}}}"
164+
```
165+
166+
### Weight-Only Quantization
167+
168+
```bash
169+
python run_generation.py \
170+
--model meta-llama/Llama-2-13b-hf \
171+
--output_dir ./saved_results \
172+
--woq \
173+
--accuracy
174+
```
175+
176+
## meta-llama/Llama-2-70b-hf
177+
178+
### SmoothQuant
179+
180+
```bash
181+
python run_generation.py \
182+
--model meta-llama/Llama-2-70b-hf \
183+
--output_dir ./saved_results \
184+
--trust_remote_code True \
185+
--tasks lambada_openai \
186+
--int8 --sq --accuracy \
187+
--batch_size 1 \
188+
--alpha 0.8
189+
```
190+
191+
### Weight-Only Quantization
192+
193+
```bash
194+
python run_generation.py \
195+
--model meta-llama/Llama-2-70b-hf \
196+
--output_dir ./saved_results \
197+
--woq \
198+
--accuracy
199+
```
200+
201+
## tiiuae/falcon-40b
202+
203+
```bash
204+
pip install transformers==4.33.3 # for tiiuae/falcon-40b
205+
```
206+
207+
### SmoothQuant
208+
209+
```bash
210+
python run_generation.py \
211+
--model tiiuae/falcon-40b \
212+
--output_dir ./saved_results \
213+
--trust_remote_code True \
214+
--tasks lambada_openai \
215+
--int8 --sq --accuracy \
216+
--batch_size 1 \
217+
--alpha 0.9
218+
```
219+
220+
### Weight-Only Quantization
221+
222+
```bash
223+
python run_generation.py \
224+
--model tiiuae/falcon-40b \
225+
--output_dir ./saved_results \
226+
--woq \
227+
--accuracy
228+
```

0 commit comments

Comments
 (0)