Refactor: Improve Quantization Suite & Benchmarking #3
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
This commit introduces a series of enhancements to the quantization library and its benchmarking capabilities.
Key improvements include:
Core Refactoring:
BaseQuantizerandDeviceManagerusage across AWQ, GPTQ, and GGUF quantizers for improved consistency and reduced code duplication.Quantizer Enhancements & Fixes:
use_tritonstatus, Hessian matrix size, and the current utilization of the Hessian in the quantization algorithm. Ensured device consistency.cpu_offloadparameter to allow for CPU offloading during quantization and GGUF file conversion, significantly aiding in low-GPU memory scenarios. Ensured robust device handling.Benchmarking Utility:
QuantizationBenchmarknow provides more granular performance metrics, including detailed timings for various steps (model copy, quantizer init, quantization, inference) and peak memory usage (GB) at various stages.Unit Tests:
cpu_offload.Documentation:
quantization.rst) and code docstrings to reflect all changes, new features, and clarifications (e.g., GGUF'scpu_offload, GPTQ's Triton/Hessian usage, new benchmark metrics).__init__docstrings to all quantizer classes.These changes aim to make the quantization library more robust, understandable, memory-efficient (especially GGUF), and maintainable, while providing better tools for performance analysis.