Skip to content

QuantLLM is a Python library designed for developers, researchers, and teams who want to fine-tune and deploy large language models (LLMs) efficiently using 4-bit and 8-bit quantization techniques.

Notifications You must be signed in to change notification settings

codewithdark-git/QuantLLM

Repository files navigation

QuantLLM Logo

🧠 QuantLLM v2.0

QuantLLM v2.0
🚀 One Line to Rule Them All

Downloads PyPI - Version Python License Stars

Load → Quantize → Fine-tune → Export · Any LLM · One Line Each

Quick StartFeaturesExamplesModelsDocs


🤔 Why QuantLLM?

Challenge Without QuantLLM With QuantLLM
Loading 7B model 50+ lines of config turbo("model")
Quantization setup Complex BitsAndBytes config Automatic
Fine-tuning LoRA config + Trainer setup model.finetune(data)
GGUF export Manual llama.cpp workflow model.export("gguf")
Memory management Manual offloading code Built-in

QuantLLM handles the complexity so you can focus on building.


⚡ Quick Start

Installation

# From GitHub (recommended) pip install git+https://github.com/codewithdark-git/QuantLLM.git # With all features pip install "quantllm[full] @ git+https://github.com/codewithdark-git/QuantLLM.git"

Your First Model in 3 Lines

from quantllm import turbo # Load with automatic 4-bit quantization, Flash Attention, optimal settings model = turbo("meta-llama/Llama-3-8B") # Generate text print(model.generate("Explain quantum computing in simple terms"))

That's it. QuantLLM automatically:

  • ✅ Detects your GPU and memory
  • ✅ Chooses optimal quantization (4-bit on most GPUs)
  • ✅ Enables Flash Attention 2 if available
  • ✅ Configures batch size and memory management

✨ Features

🎯 Ultra-Simple API

# One line - everything automatic model = turbo("mistralai/Mistral-7B") # Override if needed model = turbo("Qwen/Qwen2-7B", bits=4, max_length=8192)

⚡ Speed Optimizations

  • Triton Kernels - Fused dequant+matmul
  • torch.compile - Graph optimization
  • Flash Attention 2 - Fast attention
  • Weight Caching - No re-dequantization

🧠 45+ Model Architectures

Llama 2/3, Mistral, Mixtral, Qwen/Qwen2, Phi-1/2/3, Gemma, Falcon, GPT-NeoX, StableLM, ChatGLM, Yi, DeepSeek, InternLM, Baichuan, StarCoder, BLOOM, OPT, MPT...

📦 6 Export Formats

  • GGUF - llama.cpp, Ollama, LM Studio
  • ONNX - ONNX Runtime, TensorRT
  • SafeTensors - HuggingFace
  • MLX - Apple Silicon
  • AWQ - AutoAWQ
  • PyTorch - Standard .pt

🔧 Zero-Config Smart Defaults

  • Hardware auto-detection (GPU, memory, capabilities)
  • Optimal quantization selection
  • Automatic batch size calculation
  • Memory-aware loading

💾 Memory Optimizations

  • Dynamic CPU ↔ GPU offloading
  • Gradient checkpointing
  • CPU optimizer states
  • Layer-wise memory tracking

🎮 Usage Examples

Chat with Any Model

from quantllm import turbo model = turbo("meta-llama/Llama-3-8B") # Simple generation response = model.generate( "Write a Python function to calculate fibonacci numbers", max_new_tokens=200, temperature=0.7, ) print(response) # Chat format messages = [ {"role": "system", "content": "You are a helpful coding assistant."}, {"role": "user", "content": "How do I read a file in Python?"}, ] response = model.chat(messages) print(response)

Fine-Tune with Your Data

from quantllm import turbo model = turbo("mistralai/Mistral-7B") # Simple - everything auto-configured model.finetune("training_data.json", epochs=3) # Advanced - full control model.finetune( "training_data.json", epochs=5, learning_rate=2e-4, lora_r=32, lora_alpha=64, batch_size=4, output_dir="./fine-tuned-model", )

Supported data formats:

[ {"instruction": "What is Python?", "output": "Python is a programming language..."}, {"text": "Full text for language modeling"}, {"prompt": "Question here", "completion": "Answer here"} ]

Export to Multiple Formats

from quantllm import turbo model = turbo("microsoft/phi-3-mini") # GGUF for llama.cpp / Ollama / LM Studio model.export("gguf", "phi3-q4.gguf", quantization="Q4_K_M") # GGUF quantization options: # Q2_K, Q3_K_S, Q3_K_M, Q4_0, Q4_K_S, Q4_K_M, Q5_0, Q5_K_M, Q6_K, Q8_0 # ONNX for TensorRT / ONNX Runtime model.export("onnx", "phi3.onnx") # SafeTensors for HuggingFace model.export("safetensors", "./phi3-hf/") # MLX for Apple Silicon Macs model.export("mlx", "./phi3-mlx/", quantization="4bit")

Push to HuggingFace Hub

from quantllm import turbo from quantllm.hub import QuantLLMHubManager # Load and fine-tune model = turbo("microsoft/phi-2") model.finetune("my_data.json", epochs=3) # Setup Hub manager manager = QuantLLMHubManager( repo_id="username/my-fine-tuned-model", hf_token="your_token", ) # Track training manager.track_hyperparameters({ "learning_rate": 0.001, "epochs": 3, "base_model": "microsoft/phi-2", }) # Save and push manager.save_final_model(model.model, format="safetensors") manager.push()

🧠 Supported Models

QuantLLM supports 45+ model architectures out of the box:

Category Models
Llama Family Llama 2, Llama 3, CodeLlama
Mistral Family Mistral 7B, Mixtral 8x7B
Qwen Family Qwen, Qwen2, Qwen2-MoE
Microsoft Phi-1, Phi-2, Phi-3
Google Gemma, Gemma 2
Falcon Falcon 7B/40B/180B
Code Models StarCoder, StarCoder2, CodeGen
Chinese ChatGLM, Yi, Baichuan, InternLM
Other DeepSeek, StableLM, MPT, BLOOM, OPT, GPT-NeoX

📦 Installation Options

# Basic installation pip install git+https://github.com/codewithdark-git/QuantLLM.git # With GGUF export support pip install "quantllm[gguf] @ git+https://github.com/codewithdark-git/QuantLLM.git" # With Triton kernels (Linux only) pip install "quantllm[triton] @ git+https://github.com/codewithdark-git/QuantLLM.git" # With Flash Attention pip install "quantllm[flash] @ git+https://github.com/codewithdark-git/QuantLLM.git" # Full installation (all features) pip install "quantllm[full] @ git+https://github.com/codewithdark-git/QuantLLM.git" # Hub lifecycle (for HuggingFace integration) pip install git+https://github.com/codewithdark-git/huggingface-lifecycle.git

💻 Hardware Requirements

Configuration RAM GPU VRAM Recommended For
🟢 CPU Only 8GB+ None Testing, small models (1-3B)
🔵 Entry GPU 16GB 6-8GB 7B models (4-bit)
🟣 Mid-Range 32GB 12-24GB 13B-30B models
🟠 High-End 64GB+ 24-80GB 70B+ models

Tested GPUs

  • NVIDIA: RTX 3060, 3070, 3080, 3090, 4070, 4080, 4090, A100, H100
  • AMD: RX 7900 XTX (with ROCm)
  • Apple: M1, M2, M3 (via MLX export)

📚 Documentation

Resource Description
📖 Examples Working code examples
📚 API Reference Full API documentation
🎓 Tutorials Step-by-step guides
🐛 Issues Report bugs

🏗️ Architecture

quantllm/ ├── core/ # Core functionality │ ├── turbo_model.py # Main TurboModel API │ ├── smart_config.py # Auto-configuration │ ├── hardware.py # Hardware detection │ ├── compilation.py # torch.compile integration │ ├── flash_attention.py # Flash Attention 2 │ ├── memory.py # Memory optimization │ ├── training.py # Training utilities │ └── export.py # Universal exporter ├── kernels/ # Custom kernels │ └── triton/ # Triton fused kernels ├── quant/ # Quantization │ ├── gguf_converter.py # GGUF export (45 models) │ └── quantization_engine.py ├── hub/ # HuggingFace integration │ └── hf_manager.py # Lifecycle management └── utils/ # Utilities 

🤝 Contributing

We welcome contributions! Here's how to get started:

# Clone the repository git clone https://github.com/codewithdark-git/QuantLLM.git cd QuantLLM # Install in development mode pip install -e ".[dev]" # Run tests pytest # Format code black quantllm/ isort quantllm/

Areas for Contribution

  • 🆕 New model architecture support
  • 🔧 Performance optimizations
  • 📚 Documentation improvements
  • 🐛 Bug fixes
  • ✨ New export formats

📈 Benchmarks

Coming soon! We're working on comprehensive benchmarks comparing:

  • Inference speed vs vanilla transformers
  • Memory usage comparisons
  • Quantization quality metrics
  • Export format performance

📜 License

MIT License - see LICENSE for details.


About

QuantLLM is a Python library designed for developers, researchers, and teams who want to fine-tune and deploy large language models (LLMs) efficiently using 4-bit and 8-bit quantization techniques.

Topics

Resources

Stars

Watchers

Forks

Sponsor this project

Packages

No packages published

Contributors 2

  •  
  •  

Languages