You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Refactor: Improve Quantization Suite & Benchmarking
This commit introduces a series of enhancements to the quantization library and its benchmarking capabilities. Key improvements include: 1. **Core Refactoring:** * Standardized `BaseQuantizer` and `DeviceManager` usage across AWQ, GPTQ, and GGUF quantizers for improved consistency and reduced code duplication. 2. **Quantizer Enhancements & Fixes:** * **AWQ:** Fixed bugs in activation statistics collection, removed redundant code, and ensured robust device handling. * **GPTQ:** Added extensive logging to clarify `use_triton` status, Hessian matrix size, and the current utilization of the Hessian in the quantization algorithm. Ensured device consistency. * **GGUF:** Fully integrated a `cpu_offload` parameter to allow for CPU offloading during quantization and GGUF file conversion, significantly aiding in low-GPU memory scenarios. Ensured robust device handling. 3. **Benchmarking Utility:** * `QuantizationBenchmark` now provides more granular performance metrics, including detailed timings for various steps (model copy, quantizer init, quantization, inference) and peak memory usage (GB) at various stages. 4. **Unit Tests:** * Added a comprehensive suite of unit tests for AWQ, GPTQ, and GGUF quantizers. Tests cover various parameters (bits, group_size, method-specifics), CPU/GPU execution, output consistency, and features like GGUF conversion and `cpu_offload`. 5. **Documentation:** * Updated API reference (`quantization.rst`) and code docstrings to reflect all changes, new features, and clarifications (e.g., GGUF's `cpu_offload`, GPTQ's Triton/Hessian usage, new benchmark metrics). * Added missing `__init__` docstrings to all quantizer classes. * Resolved a dangling reference to an example file in the documentation. These changes aim to make the quantization library more robust, understandable, memory-efficient (especially GGUF), and maintainable, while providing better tools for performance analysis.
Copy file name to clipboardExpand all lines: docs/api_reference/quantization.rst
+22-7Lines changed: 22 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -60,21 +60,23 @@ GGUF provides an efficient format with CTransformers integration:
60
60
model=model,
61
61
bits=4, # Quantization bits
62
62
group_size=32, # Group size
63
-
use_packed=True# Enable weight packing
63
+
use_packed=True, # Enable weight packing
64
+
cpu_offload=False, # Offload to CPU
64
65
)
65
66
66
67
# Quantize model
67
-
quantized_model = quantizer.quantize()
68
+
quantized_model = quantizer.quantize()# Calibration data can be optionally passed here
68
69
69
70
# Export to GGUF format
70
71
quantizer.convert_to_gguf("model-q4.gguf")
71
72
72
73
Choosing the Right Method
73
74
------------------------
74
75
75
-
- **GPTQ**: Best for highest accuracy with slightly slower quantization
76
+
- **GPTQ**: Best for highest accuracy with slightly slower quantization. The GPTQ method in QuantLLM involves computing Hessian matrix information. This information is primarily used for activation-based weight reordering when `actorder=True`. Users should note that the detailed iterative weight updates using the full Hessian inverse, as found in some canonical GPTQ literature, may not be fully implemented in the current layer quantization step. The system logs warnings if the Hessian is computed but not fully utilized in this manner.
76
77
- **AWQ**: Best balance of speed and accuracy, good for general use
77
-
- **GGUF**: Best for deployment and inference with CTransformers
78
+
- **GGUF**: Best for deployment and inference with CTransformers. Key parameters include:
79
+
- `cpu_offload: bool = False`: If True, attempts to offload parts of the computation and model data to CPU memory, reducing GPU memory usage at the cost of speed. Defaults to False.
78
80
79
81
Resource Requirements
80
82
------------------
@@ -95,8 +97,21 @@ Common Parameters
95
97
All quantizers support these common parameters:
96
98
97
99
- **bits**: Number of quantization bits (2-8)
98
-
- **group_size**: Size of quantization groups
99
-
- **calibration_data**: Data used for computing statistics
100
+
- **group_size**: Size of quantization groups (behavior can vary; e.g., -1 for per-tensor in AWQ, specific positive values for GPTQ/GGUF grouping)
101
+
- **calibration_data**: Data used for computing statistics (optional for some GGUF modes, but recommended for others)
102
+
- **device**: Specifies the primary computation device ('cpu' or 'cuda') for the quantizer.
- **use_triton**: Enables the use of Triton kernels. Note: While this flag is present, custom Triton kernels specifically for accelerating GPTQ's core quantization algorithm (like Hessian computation or iterative weight updates) are not currently integrated into `GPTQQuantizer`. General model optimization kernels from `quantllm.quant.kernels` might be applicable separately.
107
+
108
+
Specific parameters for AWQ:
109
+
- **zero_point**: Enables zero-point computation for activations.
110
+
- **version**: Specifies the AWQ algorithm version.
111
+
112
+
Specific parameters for GGUF:
113
+
- **use_packed**: Enables weight packing for smaller model size.
114
+
- **cpu_offload**: If True, offloads parts of computation/model to CPU, reducing GPU memory. Defaults to False.
100
115
101
116
Example Workflow
102
117
--------------
@@ -133,4 +148,4 @@ Here's a complete example of quantizing a model:
0 commit comments