🧠 QuantLLM v2.0

🚀 One Line to Rule Them All

Load → Quantize → Fine-tune → Export · Any LLM · One Line Each

Quick Start • Features • Examples • Models • Docs

🤔 Why QuantLLM?

Challenge	Without QuantLLM	With QuantLLM
Loading 7B model	50+ lines of config	`turbo("model")`
Quantization setup	Complex BitsAndBytes config	Automatic
Fine-tuning	LoRA config + Trainer setup	`model.finetune(data)`
GGUF export	Manual llama.cpp workflow	`model.export("gguf")`
Memory management	Manual offloading code	Built-in

QuantLLM handles the complexity so you can focus on building.

⚡ Quick Start

Installation

# From GitHub (recommended) pip install git+https://github.com/codewithdark-git/QuantLLM.git # With all features pip install "quantllm[full] @ git+https://github.com/codewithdark-git/QuantLLM.git"

Your First Model in 3 Lines

from quantllm import turbo # Load with automatic 4-bit quantization, Flash Attention, optimal settings model = turbo("meta-llama/Llama-3-8B") # Generate text print(model.generate("Explain quantum computing in simple terms"))

That's it. QuantLLM automatically:

✅ Detects your GPU and memory
✅ Chooses optimal quantization (4-bit on most GPUs)
✅ Enables Flash Attention 2 if available
✅ Configures batch size and memory management

✨ Features

🎯 Ultra-Simple API # One line - everything automatic model = turbo("mistralai/Mistral-7B") # Override if needed model = turbo("Qwen/Qwen2-7B", bits=4, max_length=8192)	⚡ Speed Optimizations Triton Kernels - Fused dequant+matmul torch.compile - Graph optimization Flash Attention 2 - Fast attention Weight Caching - No re-dequantization
🧠 45+ Model Architectures Llama 2/3, Mistral, Mixtral, Qwen/Qwen2, Phi-1/2/3, Gemma, Falcon, GPT-NeoX, StableLM, ChatGLM, Yi, DeepSeek, InternLM, Baichuan, StarCoder, BLOOM, OPT, MPT...	📦 6 Export Formats GGUF - llama.cpp, Ollama, LM Studio ONNX - ONNX Runtime, TensorRT SafeTensors - HuggingFace MLX - Apple Silicon AWQ - AutoAWQ PyTorch - Standard .pt
🔧 Zero-Config Smart Defaults Hardware auto-detection (GPU, memory, capabilities) Optimal quantization selection Automatic batch size calculation Memory-aware loading	💾 Memory Optimizations Dynamic CPU ↔ GPU offloading Gradient checkpointing CPU optimizer states Layer-wise memory tracking

🎮 Usage Examples

Chat with Any Model

from quantllm import turbo model = turbo("meta-llama/Llama-3-8B") # Simple generation response = model.generate( "Write a Python function to calculate fibonacci numbers", max_new_tokens=200, temperature=0.7, ) print(response) # Chat format messages = [ {"role": "system", "content": "You are a helpful coding assistant."}, {"role": "user", "content": "How do I read a file in Python?"}, ] response = model.chat(messages) print(response)

Fine-Tune with Your Data

from quantllm import turbo model = turbo("mistralai/Mistral-7B") # Simple - everything auto-configured model.finetune("training_data.json", epochs=3) # Advanced - full control model.finetune( "training_data.json", epochs=5, learning_rate=2e-4, lora_r=32, lora_alpha=64, batch_size=4, output_dir="./fine-tuned-model", )

Supported data formats:

[ {"instruction": "What is Python?", "output": "Python is a programming language..."}, {"text": "Full text for language modeling"}, {"prompt": "Question here", "completion": "Answer here"} ]

Export to Multiple Formats

from quantllm import turbo model = turbo("microsoft/phi-3-mini") # GGUF for llama.cpp / Ollama / LM Studio model.export("gguf", "phi3-q4.gguf", quantization="Q4_K_M") # GGUF quantization options: # Q2_K, Q3_K_S, Q3_K_M, Q4_0, Q4_K_S, Q4_K_M, Q5_0, Q5_K_M, Q6_K, Q8_0 # ONNX for TensorRT / ONNX Runtime model.export("onnx", "phi3.onnx") # SafeTensors for HuggingFace model.export("safetensors", "./phi3-hf/") # MLX for Apple Silicon Macs model.export("mlx", "./phi3-mlx/", quantization="4bit")

Push to HuggingFace Hub

from quantllm import turbo from quantllm.hub import QuantLLMHubManager # Load and fine-tune model = turbo("microsoft/phi-2") model.finetune("my_data.json", epochs=3) # Setup Hub manager manager = QuantLLMHubManager( repo_id="username/my-fine-tuned-model", hf_token="your_token", ) # Track training manager.track_hyperparameters({ "learning_rate": 0.001, "epochs": 3, "base_model": "microsoft/phi-2", }) # Save and push manager.save_final_model(model.model, format="safetensors") manager.push()

🧠 Supported Models

QuantLLM supports 45+ model architectures out of the box:

Category	Models
Llama Family	Llama 2, Llama 3, CodeLlama
Mistral Family	Mistral 7B, Mixtral 8x7B
Qwen Family	Qwen, Qwen2, Qwen2-MoE
Microsoft	Phi-1, Phi-2, Phi-3
Google	Gemma, Gemma 2
Falcon	Falcon 7B/40B/180B
Code Models	StarCoder, StarCoder2, CodeGen
Chinese	ChatGLM, Yi, Baichuan, InternLM
Other	DeepSeek, StableLM, MPT, BLOOM, OPT, GPT-NeoX

📦 Installation Options

# Basic installation pip install git+https://github.com/codewithdark-git/QuantLLM.git # With GGUF export support pip install "quantllm[gguf] @ git+https://github.com/codewithdark-git/QuantLLM.git" # With Triton kernels (Linux only) pip install "quantllm[triton] @ git+https://github.com/codewithdark-git/QuantLLM.git" # With Flash Attention pip install "quantllm[flash] @ git+https://github.com/codewithdark-git/QuantLLM.git" # Full installation (all features) pip install "quantllm[full] @ git+https://github.com/codewithdark-git/QuantLLM.git" # Hub lifecycle (for HuggingFace integration) pip install git+https://github.com/codewithdark-git/huggingface-lifecycle.git

💻 Hardware Requirements

Configuration	RAM	GPU VRAM	Recommended For
🟢 CPU Only	8GB+	None	Testing, small models (1-3B)
🔵 Entry GPU	16GB	6-8GB	7B models (4-bit)
🟣 Mid-Range	32GB	12-24GB	13B-30B models
🟠 High-End	64GB+	24-80GB	70B+ models

Tested GPUs

NVIDIA: RTX 3060, 3070, 3080, 3090, 4070, 4080, 4090, A100, H100
AMD: RX 7900 XTX (with ROCm)
Apple: M1, M2, M3 (via MLX export)

📚 Documentation

Resource	Description
📖 Examples	Working code examples
📚 API Reference	Full API documentation
🎓 Tutorials	Step-by-step guides
🐛 Issues	Report bugs

🏗️ Architecture

quantllm/ ├── core/ # Core functionality │ ├── turbo_model.py # Main TurboModel API │ ├── smart_config.py # Auto-configuration │ ├── hardware.py # Hardware detection │ ├── compilation.py # torch.compile integration │ ├── flash_attention.py # Flash Attention 2 │ ├── memory.py # Memory optimization │ ├── training.py # Training utilities │ └── export.py # Universal exporter ├── kernels/ # Custom kernels │ └── triton/ # Triton fused kernels ├── quant/ # Quantization │ ├── gguf_converter.py # GGUF export (45 models) │ └── quantization_engine.py ├── hub/ # HuggingFace integration │ └── hf_manager.py # Lifecycle management └── utils/ # Utilities

🤝 Contributing

We welcome contributions! Here's how to get started:

# Clone the repository git clone https://github.com/codewithdark-git/QuantLLM.git cd QuantLLM # Install in development mode pip install -e ".[dev]" # Run tests pytest # Format code black quantllm/ isort quantllm/

Areas for Contribution

🆕 New model architecture support
🔧 Performance optimizations
📚 Documentation improvements
🐛 Bug fixes
✨ New export formats

📈 Benchmarks

Coming soon! We're working on comprehensive benchmarks comparing:

Inference speed vs vanilla transformers
Memory usage comparisons
Quantization quality metrics
Export format performance

📜 License

MIT License - see LICENSE for details.

Made with ❤️ by Dark Coder

⭐ Star us on GitHub • 💖 Sponsor

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.github		.github
docs		docs
examples		examples
quantllm		quantllm
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTE.md		CONTRIBUTE.md
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Uh oh!

codewithdark-git/QuantLLM

Folders and files

Latest commit

History

Repository files navigation

🧠 QuantLLM v2.0

🤔 Why QuantLLM?

⚡ Quick Start

Installation

Your First Model in 3 Lines

✨ Features

🎯 Ultra-Simple API

⚡ Speed Optimizations

🧠 45+ Model Architectures

📦 6 Export Formats

🔧 Zero-Config Smart Defaults

💾 Memory Optimizations

🎮 Usage Examples

Chat with Any Model

Fine-Tune with Your Data

Export to Multiple Formats

Push to HuggingFace Hub

🧠 Supported Models

📦 Installation Options

💻 Hardware Requirements

Tested GPUs

📚 Documentation

🏗️ Architecture

🤝 Contributing

Areas for Contribution

📈 Benchmarks

📜 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages