Menu
Theme
vLLM Logo
vLLM

The High-Throughput and Memory-Efficient inference and serving engine for LLMs

Easy, fast, and cost-efficient LLM serving for everyone.

Documentation

Easy

Deploy the widest range of open-source models on any hardware. Includes a drop-in OpenAI-compatible API for instant integration.

Fast

Maximize throughput with PagedAttention. Advanced scheduling and continuous batching ensure peak GPU utilization.

Cost Efficient

Slash inference costs by maximizing hardware efficiency. We make high-performance LLMs affordable and accessible to everyone.

Quick Start

Select your preferences and run the install command. Stable represents the most currently tested and supported version of vLLM. Nightly is available if you want the latest builds.

📦 Requires Python 3.10+. Python 3.12+ recommended.

⚡ We recommend uv for faster and more reliable installation.

🔧 For other platforms, see docs.vllm.ai

🎉 See what's new in

Build
StableNightly
Platform
CUDAROCmXPUCPU
Package
Python (uv)PythonDocker
CUDA Version
CUDA 12.9CUDA 13.0
Run this Command:
uv pip install vllm

💡 Compatible with all CUDA 12.x versions (12.0 - 12.9)

Sponsors

vLLM is a community project. Our compute resources for development and testing are supported by the following organizations. Thank you for your support!

Cash Donations

a16z
Sequoia Capital
Skywork AI
ZhenFund

Compute Resources

Alibaba Cloud
AMD
Anyscale
AWS
Crusoe Cloud
Google Cloud
IBM
Intel
Lambda Lab
Nebius
Novita AI
NVIDIA
Red Hat
Roblox
RunPod
UC Berkeley
Volcengine

We collect donation through GitHub and OpenCollective. We plan to use the fund to support the development, maintenance, and adoption of vLLM.

Everyone welcome!

Got questions? We're here to help.

Whether you're just getting started or debugging a complex deployment, our community is open to everyone. No question is too basic!

Fast & friendly responses
Active maintainers