The High-Throughput and Memory-Efficient inference and serving engine for LLMs
Easy, fast, and cost-efficient LLM serving for everyone.
Easy
Deploy the widest range of open-source models on any hardware. Includes a drop-in OpenAI-compatible API for instant integration.
Fast
Maximize throughput with PagedAttention. Advanced scheduling and continuous batching ensure peak GPU utilization.
Cost Efficient
Slash inference costs by maximizing hardware efficiency. We make high-performance LLMs affordable and accessible to everyone.
Quick Start
Select your preferences and run the install command. Stable represents the most currently tested and supported version of vLLM. Nightly is available if you want the latest builds.
📦 Requires Python 3.10+. Python 3.12+ recommended.
⚡ We recommend uv for faster and more reliable installation.
🔧 For other platforms, see docs.vllm.ai
🎉 See what's new in
💡 Compatible with all CUDA 12.x versions (12.0 - 12.9)
Sponsors
vLLM is a community project. Our compute resources for development and testing are supported by the following organizations. Thank you for your support!
Cash Donations
Compute Resources
We collect donation through GitHub and OpenCollective. We plan to use the fund to support the development, maintenance, and adoption of vLLM.
Universal Compatibility
One engine, endless possibilities. Run any model on any hardware.
Open Models
Latest trending open-source models, optimized & production-ready
Got questions?
We're here to help.
Whether you're just getting started or debugging a complex deployment, our community is open to everyone. No question is too basic!
Resources
Explore recipes, benchmarks, and roadmap