kubernetes
diff --git a/‎ai/vllm-deployment/README.md‎
Lines changed: 145 additions & 0 deletions b/‎ai/vllm-deployment/README.md‎
Lines changed: 145 additions & 0 deletions
diff --git a/‎ai/vllm-deployment/vllm-deployment.yaml‎
Lines changed: 58 additions & 0 deletions b/‎ai/vllm-deployment/vllm-deployment.yaml‎
Lines changed: 58 additions & 0 deletions
diff --git a/‎ai/vllm-deployment/vllm-service.yaml‎
Lines changed: 12 additions & 0 deletions b/‎ai/vllm-deployment/vllm-service.yaml‎
Lines changed: 12 additions & 0 deletions
@@ -0,0 +1,145 @@
+# AI Inference with vLLM on Kubernetes
+
+## Purpose / What You'll Learn
+
+This example demonstrates how to deploy a server for AI inference using [vLLM](https://docs.vllm.ai/en/latest/) on Kubernetes. You’ll learn how to:
+
+- Set up vLLM inference server with a model downloaded from [Hugging Face](https://huggingface.co/).
+- Expose the inference endpoint using a Kubernetes `Service`.
+- Set up port forwarding from your local machine to the inference `Service` in the Kubernetes cluster.
+- Send a sample prediction request to the server using `curl`.
+
+---
+
+## 📚 Table of Contents
+
+- [Prerequisites](#prerequisites)
+- [Detailed Steps & Explanation](#detailed-steps--explanation)
+- [Verification / Seeing it Work](#verification--seeing-it-work)
+- [Configuration Customization](#configuration-customization)
+- [Cleanup](#cleanup)
+- [Further Reading / Next Steps](#further-reading--next-steps)
+
+---
+
+## Prerequisites
+
+- Kubernetes cluster which has access to Nvidia GPU's (tested with GKE autopilot cluster v1.32).
+- Hugging Face account token with permissions for model (example model: `google/gemma-3-1b-it`)
+- `kubectl` configured to communicate with cluster and in PATH
+- `curl` binary in PATH
+
+#### Cloud Provider Prerequisites
+
+##### Google Kubernetes Engine (GKE)
+ * Uncomment GKE-specific `label`(s) and `nodeSelector`(s) in `vllm-deployment.yaml`
+
+##### Elastic Kubernetes Service (EKS)
+ * **GPU-Enabled Nodes**: Your cluster must have a node group with GPU-enabled EC2 instances (e.g., instances from the p or g families, like p3.2xlarge or g4dn.xlarge).
+ * **Nvidia Device Plugin**: You must have the Nvidia device plugin for Kubernetes installed in your cluster. This is typically deployed as a DaemonSet and is responsible for exposing the nvidia.com/gpu resource from the nodes to the Kubernetes scheduler. Without this plugin, Kubernetes won't be aware of the GPUs, and your pods will fail to schedule.
+
+##### Azure Kubernetes Service (AKS)
+ * **GPU-Enabled Node Pool**: Your cluster must have a node pool that uses GPU-enabled virtual machines. In Azure, these are typically from the NC, ND, or NV series (e.g., Standard_NC6s_v3).
+ * **Nvidia Device Plugin**: The Nvidia device plugin for Kubernetes must be installed on your cluster. This plugin is usually deployed as a DaemonSet and is responsible for discovering the GPUs on each node and exposing the nvidia.com/gpu resource to the Kubernetes scheduler. Without this plugin, Kubernetes will not be aware of the GPUs, and any pod requesting them will remain in a Pending state.
+
+---
+
+## Detailed Steps & Explanation
+
+1. Ensure Hugging Face permissions to retrieve model:
+
+```bash
+# Env var HF_TOKEN contains hugging face account token
+kubectl create secret generic hf-secret \
+ --from-literal=hf_token=$HF_TOKEN
+```
+
+2. Apply vLLM server:
+
+```bash
+kubectl apply -f vllm-deployment.yaml
+```
+
+ - Wait for deployment to reconcile, creating vLLM pod(s):
+
+```bash
+kubectl wait --for=condition=Available --timeout=900s deployment/vllm-gemma-deployment
+kubectl get pods -l app=gemma-server -w
+```
+
+ - View vLLM pod logs:
+
+```bash
+kubectl logs -f -l app=gemma-server
+```
+
+Expected output:
+
+```
+ INFO: Automatically detected platform cuda.
+ ...
+ INFO [launcher.py:34] Route: /v1/chat/completions, Methods: POST
+ ...
+ INFO: Started server process [13]
+ INFO: Waiting for application startup.
+ INFO: Application startup complete.
+ Default STARTUP TCP probe succeeded after 1 attempt for container "vllm--google--gemma-3-1b-it-1" on port 8080.
+...
+```
+
+3. Create service:
+
+```bash
+# ClusterIP service on port 8080 in front of vllm deployment
+kubectl apply -f vllm-service.yaml
+```
+
+## Verification / Seeing it Work
+
+1. Forward local requests to vLLM service:
+
+```bash
+# Forward a local port (e.g., 8080) to the service port (e.g., 8080)
+kubectl port-forward service/vllm-service 8080:8080
+```
+
+2. Send request to local forwarding port:
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+-H "Content-Type: application/json" \
+-d '{
+ "model": "google/gemma-3-1b-it",
+ "messages": [{"role": "user", "content": "Explain Quantum Computing in simple terms."}],
+ "max_tokens": 100
+}'
+```
+
+Expected output (or similar):
+
+```json
+{"id":"chatcmpl-462b3e153fd34e5ca7f5f02f3bcb6b0c","object":"chat.completion","created":1753164476,"model":"google/gemma-3-1b-it","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Okay, let’s break down quantum computing in a way that’s hopefully understandable without getting lost in too much jargon. Here's the gist:\n\n**1. Classical Computers vs. Quantum Computers:**\n\n* **Classical Computers:** These are the computers you use every day – laptops, phones, servers. They store information as *bits*. A bit is like a light switch: it's either on (1) or off (0). Everything a classical computer does – from playing games","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":16,"total_tokens":116,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null}
+```
+
+---
+
+## Configuration Customization
+
+- Update `MODEL_ID` within deployment manifest to serve different model (ensure Hugging Face access token contains these permissions).
+- Change the number of `vLLM` pod replicas in the deployment manifest.
+---
+
+## Cleanup
+
+```bash
+kubectl delete -f vllm-service.yaml
+kubectl delete -f vllm-deployment.yaml
+kubectl delete -f secret/hf_secret
+```
+
+---
+
+## Further Reading / Next Steps
+
+- [vLLM AI Inference Server](https://docs.vllm.ai/en/latest/) 
+- [Hugging Face Security Tokens](https://huggingface.co/docs/hub/en/security-tokens)
@@ -0,0 +1,58 @@
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+ name: vllm-gemma-deployment
+spec:
+ replicas: 1
+ selector:
+ matchLabels:
+ app: gemma-server
+ template:
+ metadata:
+ labels:
+ app: gemma-server
+ # Labels for better functionality within GKE.
+ # ai.gke.io/model: gemma-3-1b-it
+ # ai.gke.io/inference-server: vllm
+ # examples.ai.gke.io/source: user-guide
+ spec:
+ containers:
+ - name: inference-server
+ image: vllm/vllm-openai:latest
+ resources:
+ requests:
+ cpu: "2"
+ memory: "10Gi"
+ ephemeral-storage: "10Gi"
+ nvidia.com/gpu: "1"
+ limits:
+ cpu: "2"
+ memory: "10Gi"
+ ephemeral-storage: "10Gi"
+ nvidia.com/gpu: "1"
+ command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
+ args:
+ - --model=$(MODEL_ID)
+ - --tensor-parallel-size=1
+ - --host=0.0.0.0
+ - --port=8080
+ env:
+ # 1 billion parameter model (smallest gemma model)
+ - name: MODEL_ID
+ value: google/gemma-3-1b-it
+ - name: HUGGING_FACE_HUB_TOKEN
+ valueFrom:
+ secretKeyRef:
+ name: hf-secret
+ key: hf_token
+ volumeMounts:
+ - mountPath: /dev/shm
+ name: dshm
+ volumes:
+ - name: dshm
+ emptyDir:
+ medium: Memory
+ # GKE specific node selectors to ensure a particular (Nvidia L4) GPU.
+ # nodeSelector:
+ # cloud.google.com/gke-accelerator: nvidia-l4
+ # cloud.google.com/gke-gpu-driver-version: latest
@@ -0,0 +1,12 @@
+apiVersion: v1
+kind: Service
+metadata:
+ name: vllm-service
+spec:
+ selector:
+ app: gemma-server
+ type: ClusterIP
+ ports:
+ - protocol: TCP
+ port: 8080
+ targetPort: 8080