Skip to content

Commit a60f29d

Browse files
committed
vLLM AI inference serving example
1 parent 0598f07 commit a60f29d

File tree

3 files changed

+215
-0
lines changed

3 files changed

+215
-0
lines changed

ai/vllm-deployment/README.md

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# AI Inference with vLLM on Kubernetes
2+
3+
## Purpose / What You'll Learn
4+
5+
This example demonstrates how to deploy a server for AI inference using [vLLM](https://docs.vllm.ai/en/latest/) on Kubernetes. You’ll learn how to:
6+
7+
- Set up vLLM inference server with a model downloaded from [Hugging Face](https://huggingface.co/).
8+
- Expose the inference endpoint using a Kubernetes `Service`.
9+
- Set up port forwarding from your local machine to the inference `Service` in the Kubernetes cluster.
10+
- Send a sample prediction request to the server using `curl`.
11+
12+
---
13+
14+
## 📚 Table of Contents
15+
16+
- [Prerequisites](#prerequisites)
17+
- [Detailed Steps & Explanation](#detailed-steps--explanation)
18+
- [Verification / Seeing it Work](#verification--seeing-it-work)
19+
- [Configuration Customization](#configuration-customization)
20+
- [Cleanup](#cleanup)
21+
- [Further Reading / Next Steps](#further-reading--next-steps)
22+
23+
---
24+
25+
## Prerequisites
26+
27+
- Kubernetes cluster which has access to Nvidia GPU's (tested with GKE autopilot cluster v1.32).
28+
- Hugging Face account token with permissions for model (example model: `google/gemma-3-1b-it`)
29+
- `kubectl` configured to communicate with cluster and in PATH
30+
- `curl` binary in PATH
31+
32+
#### Cloud Provider Prerequisites
33+
34+
##### Google Kubernetes Engine (GKE)
35+
* Uncomment GKE-specific `label`(s) and `nodeSelector`(s) in `vllm-deployment.yaml`
36+
37+
##### Elastic Kubernetes Service (EKS)
38+
* **GPU-Enabled Nodes**: Your cluster must have a node group with GPU-enabled EC2 instances (e.g., instances from the p or g families, like p3.2xlarge or g4dn.xlarge).
39+
* **Nvidia Device Plugin**: You must have the Nvidia device plugin for Kubernetes installed in your cluster. This is typically deployed as a DaemonSet and is responsible for exposing the nvidia.com/gpu resource from the nodes to the Kubernetes scheduler. Without this plugin, Kubernetes won't be aware of the GPUs, and your pods will fail to schedule.
40+
41+
##### Azure Kubernetes Service (AKS)
42+
* **GPU-Enabled Node Pool**: Your cluster must have a node pool that uses GPU-enabled virtual machines. In Azure, these are typically from the NC, ND, or NV series (e.g., Standard_NC6s_v3).
43+
* **Nvidia Device Plugin**: The Nvidia device plugin for Kubernetes must be installed on your cluster. This plugin is usually deployed as a DaemonSet and is responsible for discovering the GPUs on each node and exposing the nvidia.com/gpu resource to the Kubernetes scheduler. Without this plugin, Kubernetes will not be aware of the GPUs, and any pod requesting them will remain in a Pending state.
44+
45+
---
46+
47+
## Detailed Steps & Explanation
48+
49+
1. Ensure Hugging Face permissions to retrieve model:
50+
51+
```bash
52+
# Env var HF_TOKEN contains hugging face account token
53+
kubectl create secret generic hf-secret \
54+
--from-literal=hf_token=$HF_TOKEN
55+
```
56+
57+
2. Apply vLLM server:
58+
59+
```bash
60+
kubectl apply -f vllm-deployment.yaml
61+
```
62+
63+
- Wait for deployment to reconcile, creating vLLM pod(s):
64+
65+
```bash
66+
kubectl wait --for=condition=Available --timeout=900s deployment/vllm-gemma-deployment
67+
kubectl get pods -l app=gemma-server -w
68+
```
69+
70+
- View vLLM pod logs:
71+
72+
```bash
73+
kubectl logs -f -l app=gemma-server
74+
```
75+
76+
Expected output:
77+
78+
```
79+
INFO: Automatically detected platform cuda.
80+
...
81+
INFO [launcher.py:34] Route: /v1/chat/completions, Methods: POST
82+
...
83+
INFO: Started server process [13]
84+
INFO: Waiting for application startup.
85+
INFO: Application startup complete.
86+
Default STARTUP TCP probe succeeded after 1 attempt for container "vllm--google--gemma-3-1b-it-1" on port 8080.
87+
...
88+
```
89+
90+
3. Create service:
91+
92+
```bash
93+
# ClusterIP service on port 8080 in front of vllm deployment
94+
kubectl apply -f vllm-service.yaml
95+
```
96+
97+
## Verification / Seeing it Work
98+
99+
1. Forward local requests to vLLM service:
100+
101+
```bash
102+
# Forward a local port (e.g., 8080) to the service port (e.g., 8080)
103+
kubectl port-forward service/vllm-service 8080:8080
104+
```
105+
106+
2. Send request to local forwarding port:
107+
108+
```bash
109+
curl -X POST http://localhost:8080/v1/chat/completions \
110+
-H "Content-Type: application/json" \
111+
-d '{
112+
"model": "google/gemma-3-1b-it",
113+
"messages": [{"role": "user", "content": "Explain Quantum Computing in simple terms."}],
114+
"max_tokens": 100
115+
}'
116+
```
117+
118+
Expected output (or similar):
119+
120+
```json
121+
{"id":"chatcmpl-462b3e153fd34e5ca7f5f02f3bcb6b0c","object":"chat.completion","created":1753164476,"model":"google/gemma-3-1b-it","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Okay, let’s break down quantum computing in a way that’s hopefully understandable without getting lost in too much jargon. Here's the gist:\n\n**1. Classical Computers vs. Quantum Computers:**\n\n* **Classical Computers:** These are the computers you use every day – laptops, phones, servers. They store information as *bits*. A bit is like a light switch: it's either on (1) or off (0). Everything a classical computer does – from playing games","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":16,"total_tokens":116,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null}
122+
```
123+
124+
---
125+
126+
## Configuration Customization
127+
128+
- Update `MODEL_ID` within deployment manifest to serve different model (ensure Hugging Face access token contains these permissions).
129+
- Change the number of `vLLM` pod replicas in the deployment manifest.
130+
---
131+
132+
## Cleanup
133+
134+
```bash
135+
kubectl delete -f vllm-service.yaml
136+
kubectl delete -f vllm-deployment.yaml
137+
kubectl delete -f secret/hf_secret
138+
```
139+
140+
---
141+
142+
## Further Reading / Next Steps
143+
144+
- [vLLM AI Inference Server](https://docs.vllm.ai/en/latest/)
145+
- [Hugging Face Security Tokens](https://huggingface.co/docs/hub/en/security-tokens)
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
apiVersion: apps/v1
2+
kind: Deployment
3+
metadata:
4+
name: vllm-gemma-deployment
5+
spec:
6+
replicas: 1
7+
selector:
8+
matchLabels:
9+
app: gemma-server
10+
template:
11+
metadata:
12+
labels:
13+
app: gemma-server
14+
# Labels for better functionality within GKE.
15+
# ai.gke.io/model: gemma-3-1b-it
16+
# ai.gke.io/inference-server: vllm
17+
# examples.ai.gke.io/source: user-guide
18+
spec:
19+
containers:
20+
- name: inference-server
21+
image: vllm/vllm-openai:latest
22+
resources:
23+
requests:
24+
cpu: "2"
25+
memory: "10Gi"
26+
ephemeral-storage: "10Gi"
27+
nvidia.com/gpu: "1"
28+
limits:
29+
cpu: "2"
30+
memory: "10Gi"
31+
ephemeral-storage: "10Gi"
32+
nvidia.com/gpu: "1"
33+
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
34+
args:
35+
- --model=$(MODEL_ID)
36+
- --tensor-parallel-size=1
37+
- --host=0.0.0.0
38+
- --port=8080
39+
env:
40+
# 1 billion parameter model (smallest gemma model)
41+
- name: MODEL_ID
42+
value: google/gemma-3-1b-it
43+
- name: HUGGING_FACE_HUB_TOKEN
44+
valueFrom:
45+
secretKeyRef:
46+
name: hf-secret
47+
key: hf_token
48+
volumeMounts:
49+
- mountPath: /dev/shm
50+
name: dshm
51+
volumes:
52+
- name: dshm
53+
emptyDir:
54+
medium: Memory
55+
# GKE specific node selectors to ensure a particular (Nvidia L4) GPU.
56+
# nodeSelector:
57+
# cloud.google.com/gke-accelerator: nvidia-l4
58+
# cloud.google.com/gke-gpu-driver-version: latest
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
apiVersion: v1
2+
kind: Service
3+
metadata:
4+
name: vllm-service
5+
spec:
6+
selector:
7+
app: gemma-server
8+
type: ClusterIP
9+
ports:
10+
- protocol: TCP
11+
port: 8080
12+
targetPort: 8080

0 commit comments

Comments
 (0)