Skip to content

Commit 31938b5

Browse files
Merge pull request #3 from codewithdark-git/quant-optim-docs-tests
Refactor: Improve Quantization Suite & Benchmarking
2 parents 9f7d2f4 + d2666f3 commit 31938b5

File tree

11 files changed

+1007
-307
lines changed

11 files changed

+1007
-307
lines changed

docs/api_reference/quantization.rst

Lines changed: 22 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -60,21 +60,23 @@ GGUF provides an efficient format with CTransformers integration:
6060
model=model,
6161
bits=4, # Quantization bits
6262
group_size=32, # Group size
63-
use_packed=True # Enable weight packing
63+
use_packed=True, # Enable weight packing
64+
cpu_offload=False, # Offload to CPU
6465
)
6566
6667
# Quantize model
67-
quantized_model = quantizer.quantize()
68+
quantized_model = quantizer.quantize() # Calibration data can be optionally passed here
6869
6970
# Export to GGUF format
7071
quantizer.convert_to_gguf("model-q4.gguf")
7172
7273
Choosing the Right Method
7374
------------------------
7475

75-
- **GPTQ**: Best for highest accuracy with slightly slower quantization
76+
- **GPTQ**: Best for highest accuracy with slightly slower quantization. The GPTQ method in QuantLLM involves computing Hessian matrix information. This information is primarily used for activation-based weight reordering when `actorder=True`. Users should note that the detailed iterative weight updates using the full Hessian inverse, as found in some canonical GPTQ literature, may not be fully implemented in the current layer quantization step. The system logs warnings if the Hessian is computed but not fully utilized in this manner.
7677
- **AWQ**: Best balance of speed and accuracy, good for general use
77-
- **GGUF**: Best for deployment and inference with CTransformers
78+
- **GGUF**: Best for deployment and inference with CTransformers. Key parameters include:
79+
- `cpu_offload: bool = False`: If True, attempts to offload parts of the computation and model data to CPU memory, reducing GPU memory usage at the cost of speed. Defaults to False.
7880

7981
Resource Requirements
8082
------------------
@@ -95,8 +97,21 @@ Common Parameters
9597
All quantizers support these common parameters:
9698

9799
- **bits**: Number of quantization bits (2-8)
98-
- **group_size**: Size of quantization groups
99-
- **calibration_data**: Data used for computing statistics
100+
- **group_size**: Size of quantization groups (behavior can vary; e.g., -1 for per-tensor in AWQ, specific positive values for GPTQ/GGUF grouping)
101+
- **calibration_data**: Data used for computing statistics (optional for some GGUF modes, but recommended for others)
102+
- **device**: Specifies the primary computation device ('cpu' or 'cuda') for the quantizer.
103+
104+
Specific parameters for GPTQ:
105+
- **actorder**: Enables activation ordering, potentially improving accuracy.
106+
- **use_triton**: Enables the use of Triton kernels. Note: While this flag is present, custom Triton kernels specifically for accelerating GPTQ's core quantization algorithm (like Hessian computation or iterative weight updates) are not currently integrated into `GPTQQuantizer`. General model optimization kernels from `quantllm.quant.kernels` might be applicable separately.
107+
108+
Specific parameters for AWQ:
109+
- **zero_point**: Enables zero-point computation for activations.
110+
- **version**: Specifies the AWQ algorithm version.
111+
112+
Specific parameters for GGUF:
113+
- **use_packed**: Enables weight packing for smaller model size.
114+
- **cpu_offload**: If True, offloads parts of computation/model to CPU, reducing GPU memory. Defaults to False.
100115

101116
Example Workflow
102117
--------------
@@ -133,4 +148,4 @@ Here's a complete example of quantizing a model:
133148
inputs = tokenizer("Hello, world!", return_tensors="pt")
134149
outputs = quantized_model(**inputs)
135150
136-
For more detailed examples, see the `examples/quantization_examples.py` file in the repository.
151+
For a detailed example, refer to the 'Example Workflow' section presented earlier in this document.

quantllm/quant/awq.py

Lines changed: 86 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,22 @@ def __init__(
2323
batch_size: int = 2,
2424
device: Optional[Union[str, torch.device]] = None
2525
):
26+
"""
27+
Initializes the AWQQuantizer.
28+
29+
Args:
30+
model (PreTrainedModel): The model to be quantized.
31+
bits (int, optional): Number of bits for quantization. Defaults to 4.
32+
group_size (int, optional): Size of the quantization group. Defaults to 128.
33+
zero_point (bool, optional): Whether to use zero-point quantization for activations. Defaults to True.
34+
scale_dtype (str, optional): Data type for scales. Defaults to "fp32".
35+
version (str, optional): AWQ algorithm version (e.g., "v1", "v2"). Defaults to "v2".
36+
enable_mnn_kernel (bool, optional): Whether to enable MNN kernel (if applicable). Defaults to False.
37+
batch_size (int, optional): Batch size for calibration data processing. Defaults to 2.
38+
device (Optional[Union[str, torch.device]], optional):
39+
The device for quantization operations ('cpu', 'cuda', etc.).
40+
Inherited from BaseQuantizer. Defaults to None (auto-detection).
41+
"""
2642
super().__init__(model=model, bits=bits, device=device)
2743
self.group_size = group_size
2844
self.zero_point = zero_point
@@ -35,12 +51,6 @@ def __init__(
3551
self.act_scales = {}
3652
self.weight_scales = {}
3753

38-
def _clear_memory(self):
39-
"""Clear GPU memory and run garbage collection."""
40-
if torch.cuda.is_available():
41-
torch.cuda.empty_cache()
42-
gc.collect()
43-
4454
def quantize(
4555
self,
4656
calibration_data: Optional[torch.Tensor] = None,
@@ -65,15 +75,12 @@ def quantize(
6575
batch = calibration_data[step:end_idx]
6676

6777
# Collect statistics for this batch
68-
self._collect_activation_stats(batch)
78+
self._collect_activation_stats(batch) # Removed num_steps argument
6979

7080
# Clean up batch
7181
del batch
7282
self._clear_memory()
7383

74-
# Process collected statistics
75-
self._process_activation_stats()
76-
7784
# Quantize the model layer by layer
7885
for name, module in self.model.named_modules():
7986
if isinstance(module, nn.Linear):
@@ -97,8 +104,7 @@ def quantize(
97104
return self.model
98105
def _collect_activation_stats(
99106
self,
100-
data: torch.Tensor,
101-
num_steps: int
107+
data: torch.Tensor # Removed num_steps parameter
102108
):
103109
"""Collect activation statistics for each layer."""
104110

@@ -124,61 +130,58 @@ def fn(module, input, output):
124130
module.register_forward_hook(hook_fn(name))
125131
)
126132

127-
# Run calibration in smaller batches
133+
# Run calibration (forward pass on the provided data batch)
128134
with torch.no_grad():
129-
batch_size = 2 # Small batch size to prevent OOM
130-
for step in range(num_steps):
131-
# Clear CUDA cache periodically
132-
if step % 10 == 0:
133-
torch.cuda.empty_cache()
134-
135-
# Process a small batch
136-
start_idx = (step * batch_size) % len(data)
137-
end_idx = min(start_idx + batch_size, len(data))
138-
batch = data[start_idx:end_idx]
139-
140-
# Move batch to appropriate device
141-
device = next(self.model.parameters()).device
142-
batch = batch.to(device)
143-
144-
self.model(batch)
145-
146-
# Move batch back to CPU to free GPU memory
147-
batch = batch.cpu()
148-
135+
# Ensure data is on the primary device for model processing
136+
data_on_device = move_to_device(data, self.device_manager.primary_device)
137+
self.model(data_on_device)
138+
# Data can be moved back to CPU if it's large and memory is a concern,
139+
# but hooks should have already captured necessary info to CPU.
140+
# For simplicity here, we assume hooks manage CPU transfer if needed.
141+
# del data_on_device # Optionally delete if memory is very tight
142+
149143
# Remove hooks
150144
for handle in handles:
151145
handle.remove()
152146

153-
# Move model to CPU temporarily to free GPU memory
154-
self.model = self.model.cpu()
155-
torch.cuda.empty_cache()
147+
# model is already on self.device_manager.primary_device from the quantize method's perspective
148+
# or moved by prepare_calibration_data.
149+
# The processing of act_scales should happen after all batches are processed.
150+
# However, the current structure calls this per batch.
151+
# For now, let's keep the quantile calculation here, but ideally, it would be after the main loop in `quantize`.
152+
# To avoid issues with model device, let's ensure model is on CPU for this CPU-bound operation,
153+
# then move it back if it was on GPU.
156154

155+
original_model_device = self.model.device # Store original device
156+
self.model = move_to_device(self.model, torch.device('cpu'))
157+
self._clear_memory()
158+
157159
# Process collected statistics on CPU
158160
for name in self.act_scales:
159-
scales = torch.stack(self.act_scales[name])
160-
# Use 99.9th percentile for more robust statistics
161-
self.act_scales[name] = torch.quantile(scales, 0.999)
162-
163-
# Move model back to GPU
164-
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
165-
self.model = self.model.to(device)
166-
167-
# Process collected statistics
168-
for name in self.act_scales:
169-
scales = torch.stack(self.act_scales[name])
170-
# Use 99.9th percentile for more robust statistics
171-
self.act_scales[name] = torch.quantile(scales, 0.999)
161+
if self.act_scales[name]: # Ensure list is not empty
162+
scales_list = self.act_scales[name]
163+
# If scales_list contains tensors that are not on CPU, move them.
164+
# Assuming they are already on CPU due to `scale.cpu()` in hook.
165+
scales_tensor = torch.stack(scales_list)
166+
self.act_scales[name] = torch.quantile(scales_tensor, 0.999)
167+
else:
168+
# Handle cases where a layer might not have collected scales (e.g. not used in forward pass)
169+
self.logger.log_warning(f"No activation scales collected for layer {name}. Using default scale of 1.0.")
170+
self.act_scales[name] = torch.tensor(1.0, device='cpu') # Default to a CPU tensor
171+
172+
# Restore model to its original device
173+
self.model = move_to_device(self.model, original_model_device)
174+
# The duplicated block of "Process collected statistics" is now removed.
172175

173176
def _quantize_layer(
174177
self,
175178
layer: nn.Linear,
176179
act_scale: torch.Tensor
177180
) -> QuantizedLinear:
178181
"""Quantize a single layer using AWQ."""
179-
device = next(layer.parameters()).device
180-
181-
# Initialize quantized layer
182+
target_device = self.device_manager.primary_device
183+
184+
# Initialize quantized layer and move to target device
182185
quantized = QuantizedLinear(
183186
layer.in_features,
184187
layer.out_features,
@@ -193,40 +196,46 @@ def _quantize_layer(
193196
format="awq"
194197
)
195198
)
196-
197-
# Copy bias if exists
199+
quantized = move_to_device(quantized, target_device)
200+
201+
# Ensure layer parameters are on the target_device for computation
202+
layer = move_to_device(layer, target_device)
203+
204+
# Copy bias if exists, ensuring it's on the target device
198205
if layer.bias is not None:
199-
quantized.bias.data.copy_(layer.bias.data)
206+
quantized.bias.data.copy_(layer.bias.data) # Bias already on target_device due to layer move
200207

201208
# Get weight matrix
202-
W = layer.weight.data.clone()
209+
W = layer.weight.data.clone() # W is on target_device
203210

204-
# Scale weights by activation scale
205-
W = W / act_scale.view(1, -1)
211+
# Ensure act_scale is on the same device as W before division
212+
act_scale_on_device = move_to_device(act_scale, W.device)
213+
W = W / act_scale_on_device.view(1, -1)
206214

207215
# Compute quantization scales per group
216+
# All computations for scales and zero_points should happen on target_device
208217
if self.group_size > 0:
209218
n_groups = W.shape[0] // self.group_size
210219
W_groups = W.view(n_groups, self.group_size, -1)
211220

212-
scales = []
213-
zero_points = [] if self.zero_point else None
221+
scales_list = [] # Renamed from scales to scales_list
222+
zero_points_list = [] if self.zero_point else None # Renamed
214223

215224
for idx in range(n_groups):
216225
group = W_groups[idx]
217226
max_abs = torch.max(torch.abs(group))
218-
scale = (2 ** (self.bits - 1) - 1) / max_abs
219-
scales.append(scale)
227+
current_scale = (2 ** (self.bits - 1) - 1) / max_abs # Renamed from scale
228+
scales_list.append(current_scale)
220229

221230
if self.zero_point:
222-
zero_point = -(torch.max(group) + torch.min(group)) / 2 * scale
223-
zero_points.append(zero_point)
231+
current_zero_point = -(torch.max(group) + torch.min(group)) / 2 * current_scale # Renamed
232+
zero_points_list.append(current_zero_point)
224233

225-
scales = torch.stack(scales)
234+
scales = torch.stack(scales_list)
226235
if self.zero_point:
227-
zero_points = torch.stack(zero_points)
236+
zero_points = torch.stack(zero_points_list)
228237
else:
229-
zero_points = torch.zeros_like(scales)
238+
zero_points = torch.zeros_like(scales, device=target_device) # Ensure on target_device
230239
else:
231240
max_abs = torch.max(torch.abs(W), dim=1)[0]
232241
scales = (2 ** (self.bits - 1) - 1) / max_abs
@@ -235,18 +244,23 @@ def _quantize_layer(
235244
min_vals = torch.min(W, dim=1)[0]
236245
zero_points = -(max_vals + min_vals) / 2 * scales
237246
else:
238-
zero_points = torch.zeros_like(scales)
247+
zero_points = torch.zeros_like(scales, device=target_device) # Ensure on target_device
239248

240249
# Quantize weights
250+
# W, scales, zero_points are on target_device
241251
W_quant = torch.round(W * scales.view(-1, 1) - zero_points.view(-1, 1))
252+
W_quant = W_quant.to(torch.int8) # Cast to int8
242253

243254
# Store quantized weights and parameters
244-
quantized.weight_quantized.copy_(W_quant.to(torch.int8))
245-
quantized.weight_scale.copy_(1.0 / scales)
246-
quantized.weight_zero_point.copy_(zero_points)
255+
# quantized module and its buffers are already on target_device
256+
quantized.weight_quantized.copy_(W_quant) # W_quant is already on target_device and int8
257+
quantized.weight_scale.copy_(1.0 / scales) # scales is on target_device
258+
quantized.weight_zero_point.copy_(zero_points) # zero_points is on target_device
247259

248260
# Store additional AWQ-specific information
261+
# Ensure act_scale is on the same device as the quantized layer's parameters
249262
if hasattr(quantized, 'act_scale'):
250-
quantized.act_scale.copy_(act_scale)
263+
# act_scale_on_device was already computed and is on target_device
264+
quantized.act_scale.copy_(act_scale_on_device)
251265

252266
return quantized

0 commit comments

Comments
 (0)