QuantLLM is a Python library designed for developers, researchers, and teams who want to fine-tune and deploy large language models (LLMs) efficiently using 4-bit and 8-bit quantization techniques. It provides a modular and flexible framework for:
- Loading and quantizing models with advanced configurations
- LoRA / QLoRA-based fine-tuning with customizable parameters
- Dataset management with preprocessing and splitting
- Training and evaluation with comprehensive metrics
- Model checkpointing and versioning
- Hugging Face Hub integration for model sharing
The goal of QuantLLM is to democratize LLM training, especially in low-resource environments, while keeping the workflow intuitive, modular, and production-ready.
| Feature | Description | 
|---|---|
| ✅ Quantized Model Loading | Load HuggingFace models with various quantization techniques (including AWQ, GPTQ, GGUF) in 4-bit or 8-bit precision, featuring customizable settings. | 
| ✅ Advanced Dataset Management | Load, preprocess, and split datasets with flexible configurations | 
| ✅ LoRA / QLoRA Fine-Tuning | Memory-efficient fine-tuning with customizable LoRA parameters | 
| ✅ Comprehensive Training | Advanced training loop with mixed precision, gradient accumulation, and early stopping | 
| ✅ Model Evaluation | Flexible evaluation with custom metrics and batch processing | 
| ✅ Checkpoint Management | Save, resume, and manage training checkpoints with versioning | 
| ✅ Hub Integration | Push models and checkpoints to Hugging Face Hub with authentication | 
| ✅ Configuration Management | YAML/JSON config support for reproducible experiments | 
| ✅ Logging and Monitoring | Comprehensive logging and Weights & Biases integration | 
pip install quantllmFor detailed usage examples and API documentation, please refer to our:
- CPU: 4+ cores
- RAM: 16GB
- Storage: 20GB free space
- Python: 3.8+
- GPU: NVIDIA GPU with 8GB+ VRAM
- RAM: 32GB
- Storage: 50GB+ SSD
- CUDA: 11.7+
| Model Size | 4-bit (GPU RAM) | 8-bit (GPU RAM) | CPU RAM (min) | 
|---|---|---|---|
| 3B params | ~6GB | ~9GB | 16GB | 
| 7B params | ~12GB | ~18GB | 32GB | 
| 13B params | ~20GB | ~32GB | 64GB | 
| 70B params | ~90GB | ~140GB | 256GB | 
| QuantLLM | Python | PyTorch | Transformers | CUDA | 
|---|---|---|---|---|
| latest | ≥3.10 | ≥2.0.0 | ≥4.30.0 | ≥11.7 | 
- Multi-GPU training support
- AutoML for hyperparameter tuning
- Integration of additional advanced quantization algorithms and techniques.
- Custom model architecture support
- Enhanced logging and visualization
- Model compression techniques
- Deployment optimizations
We welcome contributions! Please see our CONTRIBUTE.md for guidelines and setup instructions.
This project is licensed under the MIT License - see the LICENSE file for details.
- HuggingFace for their amazing Transformers library
- bitsandbytes for quantization
- PEFT for parameter-efficient fine-tuning
- Weights & Biases for experiment tracking
- GitHub Issues: Create an issue
- Documentation: Read the docs
- Discord: Join our community
- Email: support@quantllm.ai