Serve an LLM with multiple GPUs in GKE

This tutorial demonstrates how to deploy and serve a large language model (LLM) using multiple GPUs on GKE for efficient and scalable inference. You create a GKE cluster that uses multiple L4 GPUs and you prepare infrastructure to serve any of the following models:

Depending on the data format of the model, the required number of GPUs varies. In this tutorial, each model uses two L4 GPUs. To learn more, see Calculating the amount of GPUs.

This tutorial is intended for Machine learning (ML) engineers, Platform admins and operators, and for Data and AI specialists who are interested in using Kubernetes container orchestration capabilities for serving LLMs. To learn more about common roles and example tasks referenced in Google Cloud content, see Common GKE user roles and tasks.

Before reading this page, ensure that you're familiar with the following:

Prepare your environment

In the Google Cloud console, start a Cloud Shell instance:
Open Cloud Shell
Set the default environment variables:
```
gcloud config set project PROJECT_ID gcloud config set billing/quota_project PROJECT_ID export PROJECT_ID=$(gcloud config get project) export CONTROL_PLANE_LOCATION=us-central1 
```
Replace the PROJECT_ID with your Google Cloud project ID.

Note: If your Cloud Shell instance disconnects throughout the tutorial execution, repeat the preceding step.

Create a GKE cluster and node pool

You can serve LLMs on GPUs in a GKE Autopilot or Standard cluster. We recommend that you use a Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that's the best fit for your workloads, see Choose a GKE mode of operation.

Autopilot

In Cloud Shell, run the following command:

gcloud container clusters create-auto l4-demo \  --project=${PROJECT_ID} \  --location=${CONTROL_PLANE_LOCATION} \  --release-channel=rapid

GKE creates an Autopilot cluster with CPU and GPU nodes as requested by the deployed workloads.

Configure kubectl to communicate with your cluster:

gcloud container clusters get-credentials l4-demo --location=${CONTROL_PLANE_LOCATION}

Standard

In Cloud Shell, run the following command to create a Standard cluster that uses Workload Identity Federation for GKE:

gcloud container clusters create l4-demo \  --location ${CONTROL_PLANE_LOCATION} \  --workload-pool ${PROJECT_ID}.svc.id.goog \  --enable-image-streaming \  --node-locations=${CONTROL_PLANE_LOCATION}-a \  --workload-pool=${PROJECT_ID}.svc.id.goog \  --machine-type n2d-standard-4 \  --num-nodes 1 --min-nodes 1 --max-nodes 5 \  --release-channel=rapid

The cluster creation might take several minutes.

Run the following command to create a node pool for your cluster:
```
gcloud container node-pools create g2-standard-24 --cluster l4-demo \  --location ${CONTROL_PLANE_LOCATION} \  --accelerator type=nvidia-l4,count=2,gpu-driver-version=latest \  --machine-type g2-standard-24 \  --enable-autoscaling --enable-image-streaming \  --num-nodes=0 --min-nodes=0 --max-nodes=3 \  --node-locations ${CONTROL_PLANE_LOCATION}-a,${CONTROL_PLANE_LOCATION}-c \  --spot 
```
GKE creates the following resources for the LLM:
- A public Google Kubernetes Engine (GKE) Standard edition cluster.
- A node pool with g2-standard-24 machine type scaled down to 0 nodes. You aren't charged for any GPUs until you launch Pods that request GPUs. This node pool provisions Spot VMs, which are priced lower than the default standard Compute Engine VMs and provide no guarantee of availability. You can remove the --spot flag from this command, and the cloud.google.com/gke-spot node selector in the text-generation-inference.yaml config to use on-demand VMs.

Configure kubectl to communicate with your cluster:

gcloud container clusters get-credentials l4-demo --location=${CONTROL_PLANE_LOCATION}

Prepare your workload

This section shows how to set up your workload depending on the model you want to use. This tutorial uses Kubernetes Deployments to deploy the model. A Deployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster..

Llama 3 70b

Set the default environment variables:
```
export HF_TOKEN=HUGGING_FACE_TOKEN 
```
Replace the HUGGING_FACE_TOKEN with your HuggingFace token.

Create a Kubernetes secret for the HuggingFace token:

kubectl create secret generic l4-demo \  --from-literal=HUGGING_FACE_TOKEN=${HF_TOKEN} \  --dry-run=client -o yaml | kubectl apply -f -

Create the following text-generation-inference.yaml Deployment manifest:

apiVersion: apps/v1 kind: Deployment metadata:  name: llm spec:  replicas: 1  selector:  matchLabels:  app: llm  template:  metadata:  labels:  app: llm  spec:  containers:  - name: llm  image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-1.ubuntu2204.py310  resources:  requests:  cpu: "10"  memory: "60Gi"  nvidia.com/gpu: "2"  limits:  cpu: "10"  memory: "60Gi"  nvidia.com/gpu: "2"  env:  - name: MODEL_ID  value: meta-llama/Meta-Llama-3-70B-Instruct  - name: NUM_SHARD  value: "2"  - name: MAX_INPUT_TOKENS  value: "2048"  - name: PORT  value: "8080"  - name: QUANTIZE  value: bitsandbytes-nf4  - name: HUGGING_FACE_HUB_TOKEN  valueFrom:  secretKeyRef:  name: l4-demo  key: HUGGING_FACE_TOKEN  volumeMounts:  - mountPath: /dev/shm  name: dshm  # mountPath is set to /tmp as it's the path where the HUGGINGFACE_HUB_CACHE environment  # variable in the TGI DLCs is set to instead of the default /data set within the TGI default image.  # i.e. where the downloaded model from the Hub will be stored  - mountPath: /tmp  name: ephemeral-volume  volumes:  - name: dshm  emptyDir:  medium: Memory  - name: ephemeral-volume  ephemeral:  volumeClaimTemplate:  metadata:  labels:  type: ephemeral  spec:  accessModes: ["ReadWriteOnce"]  storageClassName: "premium-rwo"  resources:  requests:  storage: 150Gi  nodeSelector:  cloud.google.com/gke-accelerator: "nvidia-l4"  cloud.google.com/gke-spot: "true"

In this manifest:

NUM_SHARD must be 2 because the model requires two NVIDIA L4 GPUs.
QUANTIZE is set to bitsandbytes-nf4 which means that the model is loaded in 4 bit instead of 32 bits. This allows GKE to reduce the amount of GPU memory needed and improves the inference speed. However, the model accuracy can decrease. To learn how to calculate the GPUs to request, see Calculating the amount of GPUs.

Apply the manifest:

kubectl apply -f text-generation-inference.yaml

The output is similar to the following:

deployment.apps/llm created

Verify the status of the model:

kubectl get deploy

The output is similar to the following:

NAME READY UP-TO-DATE AVAILABLE AGE llm 1/1 1 1 20m

View the logs from the running deployment:

kubectl logs -l app=llm

The output is similar to the following:

{"timestamp":"2024-03-09T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291} {"timestamp":"2024-03-09T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328} {"timestamp":"2024-03-09T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329} {"timestamp":"2024-03-09T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343}

Mixtral 8x7b

Set the default environment variables:
```
export HF_TOKEN=HUGGING_FACE_TOKEN 
```
Replace the HUGGING_FACE_TOKEN with your HuggingFace token.

Create a Kubernetes secret for the HuggingFace token:

kubectl create secret generic l4-demo \  --from-literal=HUGGING_FACE_TOKEN=${HF_TOKEN} \  --dry-run=client -o yaml | kubectl apply -f -

Create the following text-generation-inference.yaml Deployment manifest:

apiVersion: apps/v1 kind: Deployment metadata:  name: llm spec:  replicas: 1  selector:  matchLabels:  app: llm  template:  metadata:  labels:  app: llm  spec:  containers:  - name: llm  image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311  resources:  requests:  cpu: "5"  memory: "40Gi"  nvidia.com/gpu: "2"  limits:  cpu: "5"  memory: "40Gi"  nvidia.com/gpu: "2"  env:  - name: MODEL_ID  value: mistralai/Mixtral-8x7B-Instruct-v0.1  - name: NUM_SHARD  value: "2"  - name: PORT  value: "8080"  - name: QUANTIZE  value: bitsandbytes-nf4  - name: HUGGING_FACE_HUB_TOKEN  valueFrom:  secretKeyRef:  name: l4-demo  key: HUGGING_FACE_TOKEN   volumeMounts:  - mountPath: /dev/shm  name: dshm  # mountPath is set to /tmp as it's the path where the HF_HOME environment  # variable in the TGI DLCs is set to instead of the default /data set within the TGI default image.  # i.e. where the downloaded model from the Hub will be stored  - mountPath: /tmp  name: ephemeral-volume  volumes:  - name: dshm  emptyDir:  medium: Memory  - name: ephemeral-volume  ephemeral:  volumeClaimTemplate:  metadata:  labels:  type: ephemeral  spec:  accessModes: ["ReadWriteOnce"]  storageClassName: "premium-rwo"  resources:  requests:  storage: 100Gi  nodeSelector:  cloud.google.com/gke-accelerator: "nvidia-l4"  cloud.google.com/gke-spot: "true"

In this manifest:

NUM_SHARD must be 2 because the model requires two NVIDIA L4 GPUs.
QUANTIZE is set to bitsandbytes-nf4 which means that the model is loaded in 4 bit instead of 32 bits. This allows GKE to reduce the amount of GPU memory needed and improves the inference speed. However, this may reduce model accuracy. To learn how to calculate the GPUs to request, see Calculating the amount of GPUs.

Apply the manifest:

kubectl apply -f text-generation-inference.yaml

The output is similar to the following:

deployment.apps/llm created

Verify the status of the model:
```
watch kubectl get deploy 
```
When the Deployment is ready, the output is similar to the following:
```
NAME READY UP-TO-DATE AVAILABLE AGE llm 1/1 1 1 10m 
```
To exit the watch, type CTRL + C.

View the logs from the running deployment:

kubectl logs -l app=llm

The output is similar to the following:

{"timestamp":"2024-03-09T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291} {"timestamp":"2024-03-09T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328} {"timestamp":"2024-03-09T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329} {"timestamp":"2024-03-09T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343}

Falcon 40b

Create the following text-generation-inference.yaml Deployment manifest:

apiVersion: apps/v1 kind: Deployment metadata:  name: llm spec:  replicas: 1  selector:  matchLabels:  app: llm  template:  metadata:  labels:  app: llm  spec:  containers:  - name: llm  image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.1-4.ubuntu2204.py310  resources:  requests:  cpu: "10"  memory: "60Gi"  nvidia.com/gpu: "2"  limits:  cpu: "10"  memory: "60Gi"  nvidia.com/gpu: "2"  env:  - name: MODEL_ID  value: tiiuae/falcon-40b-instruct  - name: NUM_SHARD  value: "2"  - name: PORT  value: "8080"  - name: QUANTIZE  value: bitsandbytes-nf4  volumeMounts:  - mountPath: /dev/shm  name: dshm  # mountPath is set to /data as it's the path where the HUGGINGFACE_HUB_CACHE environment  # variable points to in the TGI container image i.e. where the downloaded model from the Hub will be  # stored  - mountPath: /data  name: ephemeral-volume  volumes:  - name: dshm  emptyDir:  medium: Memory  - name: ephemeral-volume  ephemeral:  volumeClaimTemplate:  metadata:  labels:  type: ephemeral  spec:  accessModes: ["ReadWriteOnce"]  storageClassName: "premium-rwo"  resources:  requests:  storage: 175Gi  nodeSelector:  cloud.google.com/gke-accelerator: "nvidia-l4"  cloud.google.com/gke-spot: "true"

In this manifest:

NUM_SHARD must be 2 because the model requires two NVIDIA L4 GPUs.
QUANTIZE is set to bitsandbytes-nf4 which means that the model is loaded in 4 bit instead of 32 bits. This allows GKE to reduce the amount of GPU memory needed and improves the inference speed. However, the model accuracy can decrease. To learn how to calculate the GPUs to request, see Calculating the amount of GPUs.

Apply the manifest:

kubectl apply -f text-generation-inference.yaml

The output is similar to the following:

deployment.apps/llm created

Verify the status of the model:
```
watch kubectl get deploy 
```
When the deployment is ready, the output is similar to the following:
```
NAME READY UP-TO-DATE AVAILABLE AGE llm 1/1 1 1 10m 
```
To exit the watch, type CTRL + C.

View the logs from the running deployment:

kubectl logs -l app=llm

The output is similar to the following:

{"timestamp":"2024-03-09T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291} {"timestamp":"2024-03-09T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328} {"timestamp":"2024-03-09T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329} {"timestamp":"2024-03-09T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343}

Create a Service of type ClusterIP

Expose your Pods internally within the cluster so they can be discovered and accessed by other applications.

Create the following llm-service.yaml manifest:

apiVersion: v1 kind: Service metadata:  name: llm-service spec:  selector:  app: llm  type: ClusterIP  ports:  - protocol: TCP  port: 80  targetPort: 8080

Apply the manifest:
```
kubectl apply -f llm-service.yaml 
```

Deploy a chat interface

Use Gradio to build a web application that lets you interact with your model. Gradio is a Python library that has a ChatInterface wrapper that creates user interfaces for chatbots.

Llama 3 70b

Create a file named gradio.yaml:

apiVersion: apps/v1 kind: Deployment metadata:  name: gradio  labels:  app: gradio spec:  strategy:  type: Recreate  replicas: 1  selector:  matchLabels:  app: gradio  template:  metadata:  labels:  app: gradio  spec:  containers:  - name: gradio  image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4  resources:  requests:  cpu: "512m"  memory: "512Mi"  limits:  cpu: "1"  memory: "512Mi"  env:  - name: CONTEXT_PATH  value: "/generate"  - name: HOST  value: "http://llm-service"  - name: LLM_ENGINE  value: "tgi"  - name: MODEL_ID  value: "meta-llama/Meta-Llama-3-70B-Instruct"  - name: USER_PROMPT  value: "<|begin_of_text|><|start_header_id|>user<|end_header_id|> prompt <|eot_id|><|start_header_id|>assistant<|end_header_id|>"  - name: SYSTEM_PROMPT  value: "prompt <|eot_id|>"  ports:  - containerPort: 7860 --- apiVersion: v1 kind: Service metadata:  name: gradio-service spec:  type: LoadBalancer  selector:  app: gradio  ports:  - port: 80  targetPort: 7860

Apply the manifest:
```
kubectl apply -f gradio.yaml 
```

Find the external IP address of the Service:

kubectl get svc

The output is similar to the following:

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE gradio-service LoadBalancer 10.24.29.197 34.172.115.35 80:30952/TCP 125m

Copy the external IP address from the EXTERNAL-IP column.
View the model interface from your web browser by using the external IP address with the exposed port:
```
http://EXTERNAL_IP 
```

Mixtral 8x7b

Create a file named gradio.yaml:

apiVersion: apps/v1 kind: Deployment metadata:  name: gradio  labels:  app: gradio spec:  strategy:  type: Recreate  replicas: 1  selector:  matchLabels:  app: gradio  template:  metadata:  labels:  app: gradio  spec:  containers:  - name: gradio  image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4  resources:  requests:  cpu: "512m"  memory: "512Mi"  limits:  cpu: "1"  memory: "512Mi"  env:  - name: CONTEXT_PATH  value: "/generate"  - name: HOST  value: "http://llm-service"  - name: LLM_ENGINE  value: "tgi"  - name: MODEL_ID  value: "mixtral-8x7b"  - name: USER_PROMPT  value: "[INST] prompt [/INST]"  - name: SYSTEM_PROMPT  value: "prompt"  ports:  - containerPort: 7860 --- apiVersion: v1 kind: Service metadata:  name: gradio-service spec:  type: LoadBalancer  selector:  app: gradio  ports:  - port: 80  targetPort: 7860

Apply the manifest:
```
kubectl apply -f gradio.yaml 
```

Find the external IP address of the Service:

kubectl get svc

The output is similar to the following:

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE gradio-service LoadBalancer 10.24.29.197 34.172.115.35 80:30952/TCP 125m

Copy the external IP address from the EXTERNAL-IP column.
View the model interface from your web browser by using the external IP address with the exposed port:
```
http://EXTERNAL_IP 
```

Falcon 40b

Create a file named gradio.yaml:

apiVersion: apps/v1 kind: Deployment metadata:  name: gradio  labels:  app: gradio spec:  strategy:  type: Recreate  replicas: 1  selector:  matchLabels:  app: gradio  template:  metadata:  labels:  app: gradio  spec:  containers:  - name: gradio  image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4  resources:  requests:  cpu: "512m"  memory: "512Mi"  limits:  cpu: "1"  memory: "512Mi"  env:  - name: CONTEXT_PATH  value: "/generate"  - name: HOST  value: "http://llm-service"  - name: LLM_ENGINE  value: "tgi"  - name: MODEL_ID  value: "falcon-40b-instruct"  - name: USER_PROMPT  value: "User: prompt"  - name: SYSTEM_PROMPT  value: "Assistant: prompt"  ports:  - containerPort: 7860 --- apiVersion: v1 kind: Service metadata:  name: gradio-service spec:  type: LoadBalancer  selector:  app: gradio  ports:  - port: 80  targetPort: 7860

Apply the manifest:
```
kubectl apply -f gradio.yaml 
```

Find the external IP address of the Service:

kubectl get svc

The output is similar to the following:

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE gradio-service LoadBalancer 10.24.29.197 34.172.115.35 80:30952/TCP 125m

Copy the external IP address from the EXTERNAL-IP column.
View the model interface from your web browser by using the external IP address with the exposed port:
```
http://EXTERNAL_IP 
```

Calculate the amount of GPUs

The amount of GPUs depends on the value of the QUANTIZE flag. In this tutorial, QUANTIZE is set to bitsandbytes-nf4, which means that the model is loaded in 4 bits.

A 70 billion parameter model would require a minimum of 40 GB of GPU memory which equals to 70 billion times 4 bits (70 billion x 4 bits= 35 GB) and considers a 5 GB of overhead. In this case, a single L4 GPU wouldn't have enough memory. Therefore, the examples in this tutorial use two L4 GPU of memory (2 x 24 = 48 GB). This configuration is sufficient for running Falcon 40b or Llama 3 70b in L4 GPUs.

Serve an LLM with multiple GPUs in GKE Stay organized with collections Save and categorize content based on your preferences.

Prepare your environment

Create a GKE cluster and node pool

Autopilot

Standard

Prepare your workload

Llama 3 70b

Mixtral 8x7b

Falcon 40b

Create a Service of type ClusterIP

Deploy a chat interface

Llama 3 70b

Mixtral 8x7b

Falcon 40b

Calculate the amount of GPUs

Serve an LLM with multiple GPUs in GKE