You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/api_reference/quantization.rst
+22-7Lines changed: 22 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -60,21 +60,23 @@ GGUF provides an efficient format with CTransformers integration:
60
60
model=model,
61
61
bits=4, # Quantization bits
62
62
group_size=32, # Group size
63
-
use_packed=True# Enable weight packing
63
+
use_packed=True, # Enable weight packing
64
+
cpu_offload=False, # Offload to CPU
64
65
)
65
66
66
67
# Quantize model
67
-
quantized_model = quantizer.quantize()
68
+
quantized_model = quantizer.quantize()# Calibration data can be optionally passed here
68
69
69
70
# Export to GGUF format
70
71
quantizer.convert_to_gguf("model-q4.gguf")
71
72
72
73
Choosing the Right Method
73
74
------------------------
74
75
75
-
- **GPTQ**: Best for highest accuracy with slightly slower quantization
76
+
- **GPTQ**: Best for highest accuracy with slightly slower quantization. The GPTQ method in QuantLLM involves computing Hessian matrix information. This information is primarily used for activation-based weight reordering when `actorder=True`. Users should note that the detailed iterative weight updates using the full Hessian inverse, as found in some canonical GPTQ literature, may not be fully implemented in the current layer quantization step. The system logs warnings if the Hessian is computed but not fully utilized in this manner.
76
77
- **AWQ**: Best balance of speed and accuracy, good for general use
77
-
- **GGUF**: Best for deployment and inference with CTransformers
78
+
- **GGUF**: Best for deployment and inference with CTransformers. Key parameters include:
79
+
- `cpu_offload: bool = False`: If True, attempts to offload parts of the computation and model data to CPU memory, reducing GPU memory usage at the cost of speed. Defaults to False.
78
80
79
81
Resource Requirements
80
82
------------------
@@ -95,8 +97,21 @@ Common Parameters
95
97
All quantizers support these common parameters:
96
98
97
99
- **bits**: Number of quantization bits (2-8)
98
-
- **group_size**: Size of quantization groups
99
-
- **calibration_data**: Data used for computing statistics
100
+
- **group_size**: Size of quantization groups (behavior can vary; e.g., -1 for per-tensor in AWQ, specific positive values for GPTQ/GGUF grouping)
101
+
- **calibration_data**: Data used for computing statistics (optional for some GGUF modes, but recommended for others)
102
+
- **device**: Specifies the primary computation device ('cpu' or 'cuda') for the quantizer.
- **use_triton**: Enables the use of Triton kernels. Note: While this flag is present, custom Triton kernels specifically for accelerating GPTQ's core quantization algorithm (like Hessian computation or iterative weight updates) are not currently integrated into `GPTQQuantizer`. General model optimization kernels from `quantllm.quant.kernels` might be applicable separately.
107
+
108
+
Specific parameters for AWQ:
109
+
- **zero_point**: Enables zero-point computation for activations.
110
+
- **version**: Specifies the AWQ algorithm version.
111
+
112
+
Specific parameters for GGUF:
113
+
- **use_packed**: Enables weight packing for smaller model size.
114
+
- **cpu_offload**: If True, offloads parts of computation/model to CPU, reducing GPU memory. Defaults to False.
100
115
101
116
Example Workflow
102
117
--------------
@@ -133,4 +148,4 @@ Here's a complete example of quantizing a model:
0 commit comments