Load → Quantize → Fine-tune → Export · Any LLM · One Line Each
Quick Start • Features • Examples • Models • Docs
| Challenge | Without QuantLLM | With QuantLLM |
|---|---|---|
| Loading 7B model | 50+ lines of config | turbo("model") |
| Quantization setup | Complex BitsAndBytes config | Automatic |
| Fine-tuning | LoRA config + Trainer setup | model.finetune(data) |
| GGUF export | Manual llama.cpp workflow | model.export("gguf") |
| Memory management | Manual offloading code | Built-in |
QuantLLM handles the complexity so you can focus on building.
# From GitHub (recommended) pip install git+https://github.com/codewithdark-git/QuantLLM.git # With all features pip install "quantllm[full] @ git+https://github.com/codewithdark-git/QuantLLM.git"from quantllm import turbo # Load with automatic 4-bit quantization, Flash Attention, optimal settings model = turbo("meta-llama/Llama-3-8B") # Generate text print(model.generate("Explain quantum computing in simple terms"))That's it. QuantLLM automatically:
- ✅ Detects your GPU and memory
- ✅ Chooses optimal quantization (4-bit on most GPUs)
- ✅ Enables Flash Attention 2 if available
- ✅ Configures batch size and memory management
# One line - everything automatic model = turbo("mistralai/Mistral-7B") # Override if needed model = turbo("Qwen/Qwen2-7B", bits=4, max_length=8192) |
|
| Llama 2/3, Mistral, Mixtral, Qwen/Qwen2, Phi-1/2/3, Gemma, Falcon, GPT-NeoX, StableLM, ChatGLM, Yi, DeepSeek, InternLM, Baichuan, StarCoder, BLOOM, OPT, MPT... |
|
|
|
from quantllm import turbo model = turbo("meta-llama/Llama-3-8B") # Simple generation response = model.generate( "Write a Python function to calculate fibonacci numbers", max_new_tokens=200, temperature=0.7, ) print(response) # Chat format messages = [ {"role": "system", "content": "You are a helpful coding assistant."}, {"role": "user", "content": "How do I read a file in Python?"}, ] response = model.chat(messages) print(response)from quantllm import turbo model = turbo("mistralai/Mistral-7B") # Simple - everything auto-configured model.finetune("training_data.json", epochs=3) # Advanced - full control model.finetune( "training_data.json", epochs=5, learning_rate=2e-4, lora_r=32, lora_alpha=64, batch_size=4, output_dir="./fine-tuned-model", )Supported data formats:
[ {"instruction": "What is Python?", "output": "Python is a programming language..."}, {"text": "Full text for language modeling"}, {"prompt": "Question here", "completion": "Answer here"} ]from quantllm import turbo model = turbo("microsoft/phi-3-mini") # GGUF for llama.cpp / Ollama / LM Studio model.export("gguf", "phi3-q4.gguf", quantization="Q4_K_M") # GGUF quantization options: # Q2_K, Q3_K_S, Q3_K_M, Q4_0, Q4_K_S, Q4_K_M, Q5_0, Q5_K_M, Q6_K, Q8_0 # ONNX for TensorRT / ONNX Runtime model.export("onnx", "phi3.onnx") # SafeTensors for HuggingFace model.export("safetensors", "./phi3-hf/") # MLX for Apple Silicon Macs model.export("mlx", "./phi3-mlx/", quantization="4bit")from quantllm import turbo from quantllm.hub import QuantLLMHubManager # Load and fine-tune model = turbo("microsoft/phi-2") model.finetune("my_data.json", epochs=3) # Setup Hub manager manager = QuantLLMHubManager( repo_id="username/my-fine-tuned-model", hf_token="your_token", ) # Track training manager.track_hyperparameters({ "learning_rate": 0.001, "epochs": 3, "base_model": "microsoft/phi-2", }) # Save and push manager.save_final_model(model.model, format="safetensors") manager.push()QuantLLM supports 45+ model architectures out of the box:
| Category | Models |
|---|---|
| Llama Family | Llama 2, Llama 3, CodeLlama |
| Mistral Family | Mistral 7B, Mixtral 8x7B |
| Qwen Family | Qwen, Qwen2, Qwen2-MoE |
| Microsoft | Phi-1, Phi-2, Phi-3 |
| Gemma, Gemma 2 | |
| Falcon | Falcon 7B/40B/180B |
| Code Models | StarCoder, StarCoder2, CodeGen |
| Chinese | ChatGLM, Yi, Baichuan, InternLM |
| Other | DeepSeek, StableLM, MPT, BLOOM, OPT, GPT-NeoX |
# Basic installation pip install git+https://github.com/codewithdark-git/QuantLLM.git # With GGUF export support pip install "quantllm[gguf] @ git+https://github.com/codewithdark-git/QuantLLM.git" # With Triton kernels (Linux only) pip install "quantllm[triton] @ git+https://github.com/codewithdark-git/QuantLLM.git" # With Flash Attention pip install "quantllm[flash] @ git+https://github.com/codewithdark-git/QuantLLM.git" # Full installation (all features) pip install "quantllm[full] @ git+https://github.com/codewithdark-git/QuantLLM.git" # Hub lifecycle (for HuggingFace integration) pip install git+https://github.com/codewithdark-git/huggingface-lifecycle.git| Configuration | RAM | GPU VRAM | Recommended For |
|---|---|---|---|
| 🟢 CPU Only | 8GB+ | None | Testing, small models (1-3B) |
| 🔵 Entry GPU | 16GB | 6-8GB | 7B models (4-bit) |
| 🟣 Mid-Range | 32GB | 12-24GB | 13B-30B models |
| 🟠 High-End | 64GB+ | 24-80GB | 70B+ models |
- NVIDIA: RTX 3060, 3070, 3080, 3090, 4070, 4080, 4090, A100, H100
- AMD: RX 7900 XTX (with ROCm)
- Apple: M1, M2, M3 (via MLX export)
| Resource | Description |
|---|---|
| 📖 Examples | Working code examples |
| 📚 API Reference | Full API documentation |
| 🎓 Tutorials | Step-by-step guides |
| 🐛 Issues | Report bugs |
quantllm/ ├── core/ # Core functionality │ ├── turbo_model.py # Main TurboModel API │ ├── smart_config.py # Auto-configuration │ ├── hardware.py # Hardware detection │ ├── compilation.py # torch.compile integration │ ├── flash_attention.py # Flash Attention 2 │ ├── memory.py # Memory optimization │ ├── training.py # Training utilities │ └── export.py # Universal exporter ├── kernels/ # Custom kernels │ └── triton/ # Triton fused kernels ├── quant/ # Quantization │ ├── gguf_converter.py # GGUF export (45 models) │ └── quantization_engine.py ├── hub/ # HuggingFace integration │ └── hf_manager.py # Lifecycle management └── utils/ # Utilities We welcome contributions! Here's how to get started:
# Clone the repository git clone https://github.com/codewithdark-git/QuantLLM.git cd QuantLLM # Install in development mode pip install -e ".[dev]" # Run tests pytest # Format code black quantllm/ isort quantllm/- 🆕 New model architecture support
- 🔧 Performance optimizations
- 📚 Documentation improvements
- 🐛 Bug fixes
- ✨ New export formats
Coming soon! We're working on comprehensive benchmarks comparing:
- Inference speed vs vanilla transformers
- Memory usage comparisons
- Quantization quality metrics
- Export format performance
MIT License - see LICENSE for details.
Made with ❤️ by Dark Coder