Serve an LLM using TPUs on GKE with KubeRay

This tutorial shows how to serve a large language model (LLM) using Tensor Processing Units (TPUs) on Google Kubernetes Engine (GKE) with the Ray Operator add-on, and the vLLM serving framework.

In this tutorial, you can serve LLM models on TPU v5e or TPU Trillium (v6e) as follows:

This guide is for generative AI customers, new and existing GKE users, ML engineers, MLOps (DevOps) engineers, or platform administrators interested in using Kubernetes container orchestration capabilities to serve models using Ray, on TPUs with vLLM.

Background

This section describes the key technologies used in this guide.

GKE managed Kubernetes service

Google Cloud offers a wide range of services, including GKE, which is well-suited to deploying and managing AI/ML workloads. GKE is a managed Kubernetes service that simplifies deploying, scaling, and managing containerized applications. GKE provides the necessary infrastructure, including scalable resources, distributed computing, and efficient networking, to handle the computational demands of LLMs.

To learn more about key Kubernetes concepts, see Start learning about Kubernetes. To learn more about the GKE and how it helps you scale, automate, and manage Kubernetes, see GKE overview.

Ray operator

The Ray Operator add-on on GKE provides an end-to-end AI/ML platform for serving, training, and fine-tuning machine learning workloads. In this tutorial, you use Ray Serve, a framework in Ray, to serve popular LLMs from Hugging Face.

TPUs

TPUs are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning and AI models built using frameworks such as TensorFlow, PyTorch, and JAX.

This tutorial covers serving LLM models on TPU v5e or TPU Trillium (v6e) nodes with TPU topologies configured based on each model requirements for serving prompts with low latency.

vLLM

vLLM is a highly optimized open source LLM serving framework that can increase serving throughput on TPUs, with features such as:

  • Optimized transformer implementation with PagedAttention
  • Continuous batching to improve the overall serving throughput
  • Tensor parallelism and distributed serving on multiple GPUs

To learn more, refer to the vLLM documentation.

Objectives

This tutorial covers the following steps:

  1. Create a GKE cluster with a TPU node pool.
  2. Deploy a RayCluster custom resource with a single-host TPU slice. GKE deploys the RayCluster custom resource as Kubernetes Pods.
  3. Serve an LLM.
  4. Interact with the models.

You can optionally configure the following model serving resources and techniques that the Ray Serve framework supports:

  • Deploy a RayService custom resource.
  • Compose multiple models with model composition.

Before you begin

Before you start, make sure that you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.
  • Create a Hugging Face account, if you don't already have one.
  • Ensure that you have a Hugging Face token.
  • Ensure that you have access to the Hugging Face model that you want to use. You usually gain this access by signing an agreement and requesting access from the model owner on the Hugging Face model page.
  • Ensure that you have the following IAM roles:
    • roles/container.admin
    • roles/iam.serviceAccountAdmin
    • roles/container.clusterAdmin
    • roles/artifactregistry.writer

Prepare your environment

  1. Check that you have enough quota in your Google Cloud project for a single-host TPU v5e or a single-host TPU Trillium (v6e). To manage your quota, see TPU quotas.

  2. In the Google Cloud console, start a Cloud Shell instance:
    Open Cloud Shell

  3. Clone the sample repository:

    git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git cd kubernetes-engine-samples 
  4. Navigate to the working directory:

    cd ai-ml/gke-ray/rayserve/llm 
  5. Set the default environment variables for the GKE cluster creation:

    Llama-3-8B-Instruct

    export PROJECT_ID=$(gcloud config get project) export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)") export CLUSTER_NAME=vllm-tpu export COMPUTE_REGION=REGION export COMPUTE_ZONE=ZONE export HF_TOKEN=HUGGING_FACE_TOKEN export GSBUCKET=vllm-tpu-bucket export KSA_NAME=vllm-sa export NAMESPACE=default export MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct" export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 export SERVICE_NAME=vllm-tpu-head-svc 

    Replace the following:

    • HUGGING_FACE_TOKEN: your Hugging Face access token.
    • REGION: the region where you have TPU quota. Ensure that the TPU version that you want to use is available in this region. To learn more, see TPU availability in GKE.
    • ZONE: the zone with available TPU quota.
    • VLLM_IMAGE: the vLLM TPU image. You can use the public docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 image or build your own TPU image.

    Mistral-7B

    export PROJECT_ID=$(gcloud config get project) export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)") export CLUSTER_NAME=vllm-tpu export COMPUTE_REGION=REGION export COMPUTE_ZONE=ZONE export HF_TOKEN=HUGGING_FACE_TOKEN export GSBUCKET=vllm-tpu-bucket export KSA_NAME=vllm-sa export NAMESPACE=default export MODEL_ID="mistralai/Mistral-7B-Instruct-v0.3" export TOKENIZER_MODE=mistral export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 export SERVICE_NAME=vllm-tpu-head-svc 

    Replace the following:

    • HUGGING_FACE_TOKEN: your Hugging Face access token.
    • REGION: the region where you have TPU quota. Ensure that the TPU version that you want to use is available in this region. To learn more, see TPU availability in GKE.
    • ZONE: the zone with available TPU quota.
    • VLLM_IMAGE: the vLLM TPU image. You can use the public docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 image or build your own TPU image.

    Llama 3.1 70B

    export PROJECT_ID=$(gcloud config get project) export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)") export CLUSTER_NAME=vllm-tpu export COMPUTE_REGION=REGION export COMPUTE_ZONE=ZONE export HF_TOKEN=HUGGING_FACE_TOKEN export GSBUCKET=vllm-tpu-bucket export KSA_NAME=vllm-sa export NAMESPACE=default export MODEL_ID="meta-llama/Llama-3.1-70B" export MAX_MODEL_LEN=8192 export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 export SERVICE_NAME=vllm-tpu-head-svc 

    Replace the following:

    • HUGGING_FACE_TOKEN: your Hugging Face access token.
    • REGION: the region where you have TPU quota. Ensure that the TPU version that you want to use is available in this region. To learn more, see TPU availability in GKE.
    • ZONE: the zone with available TPU quota.
    • VLLM_IMAGE: the vLLM TPU image. You can use the public docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 image or build your own TPU image.
  6. Pull down the vLLM container image:

    sudo usermod -aG docker ${USER} newgrp docker docker pull ${VLLM_IMAGE} 

Create a cluster

You can serve an LLM on TPUs with Ray in a GKE Autopilot or Standard cluster by using the Ray Operator add-on.

Best practices:

Use an Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that's the best fit for your workloads, see Choose a GKE mode of operation.

Use Cloud Shell to create an Autopilot or Standard cluster:

Autopilot

  1. Create a GKE Autopilot cluster with the Ray Operator add-on enabled:

    gcloud container clusters create-auto ${CLUSTER_NAME} \  --enable-ray-operator \  --release-channel=rapid \  --location=${COMPUTE_REGION} 

Standard

  1. Create a Standard cluster with the Ray Operator add-on enabled:

    gcloud container clusters create ${CLUSTER_NAME} \  --release-channel=rapid \  --location=${COMPUTE_ZONE} \  --workload-pool=${PROJECT_ID}.svc.id.goog \  --machine-type="n1-standard-4" \  --addons=RayOperator,GcsFuseCsiDriver 
  2. Create a single-host TPU slice node pool:

    Llama-3-8B-Instruct

    gcloud container node-pools create tpu-1 \  --location=${COMPUTE_ZONE} \  --cluster=${CLUSTER_NAME} \  --machine-type=ct5lp-hightpu-8t \  --num-nodes=1 

    GKE creates a TPU v5e node pool with a ct5lp-hightpu-8t machine type.

    Mistral-7B

    gcloud container node-pools create tpu-1 \  --location=${COMPUTE_ZONE} \  --cluster=${CLUSTER_NAME} \  --machine-type=ct5lp-hightpu-8t \  --num-nodes=1 

    GKE creates a TPU v5e node pool with a ct5lp-hightpu-8t machine type.

    Llama 3.1 70B

    gcloud container node-pools create tpu-1 \  --location=${COMPUTE_ZONE} \  --cluster=${CLUSTER_NAME} \  --machine-type=ct6e-standard-8t \  --num-nodes=1 

    GKE creates a TPU v6e node pool with a ct6e-standard-8t machine type.

Configure kubectl to communicate with your cluster

To configure kubectl to communicate with your cluster, run the following command:

Autopilot

gcloud container clusters get-credentials ${CLUSTER_NAME} \  --location=${COMPUTE_REGION} 

Standard

gcloud container clusters get-credentials ${CLUSTER_NAME} \  --location=${COMPUTE_ZONE} 

Create a Kubernetes Secret for Hugging Face credentials

To create a Kubernetes Secret that contains the Hugging Face token, run the following command:

kubectl create secret generic hf-secret \  --from-literal=hf_api_token=${HF_TOKEN} \  --dry-run=client -o yaml | kubectl --namespace ${NAMESPACE} apply -f - 

Create a Cloud Storage bucket

To accelerate the vLLM deployment startup time and minimize required disk space per node, use the Cloud Storage FUSE CSI driver to mount the downloaded model and compilation cache to the Ray nodes.

In Cloud Shell, run the following command:

gcloud storage buckets create gs://${GSBUCKET} \  --uniform-bucket-level-access 

This command creates a Cloud Storage bucket to store the model files you download from Hugging Face.

Set up a Kubernetes ServiceAccount to access the bucket

  1. Create the Kubernetes ServiceAccount:

    kubectl create serviceaccount ${KSA_NAME} \  --namespace ${NAMESPACE} 
  2. Grant the Kubernetes ServiceAccount read-write access to the Cloud Storage bucket:

    gcloud storage buckets add-iam-policy-binding gs://${GSBUCKET} \  --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/${NAMESPACE}/sa/${KSA_NAME}" \  --role "roles/storage.objectUser" 

    GKE creates the following resources for the LLM:

    1. A Cloud Storage bucket to store the downloaded model and the compilation cache. A Cloud Storage FUSE CSI driver reads the content of the bucket.
    2. Volumes with file caching enabled and the parallel download feature of Cloud Storage FUSE.
    Best practice:

    Use a file cache backed by tmpfs or Hyperdisk / Persistent Disk depending on the expected size of the model contents, for example, weight files. In this tutorial, you use Cloud Storage FUSE file cache backed by RAM.

Deploy a RayCluster custom resource

Deploy a RayCluster custom resource, which typically consists of one system Pod and multiple worker Pods.

Llama-3-8B-Instruct

Create the RayCluster custom resource to deploy the Llama 3 8B instruction tuned model by completing the following steps:

  1. Inspect the ray-cluster.tpu-v5e-singlehost.yaml manifest:

    apiVersion: ray.io/v1 kind: RayCluster metadata:  name: vllm-tpu spec:  headGroupSpec:  rayStartParams: {}  template:  metadata:  annotations:  gke-gcsfuse/volumes: "true"  gke-gcsfuse/cpu-limit: "0"  gke-gcsfuse/memory-limit: "0"  gke-gcsfuse/ephemeral-storage-limit: "0"  spec:  serviceAccountName: $KSA_NAME  containers:  - name: ray-head  image: $VLLM_IMAGE  imagePullPolicy: IfNotPresent  resources:  limits:  cpu: "2"  memory: 8G  requests:  cpu: "2"  memory: 8G  env:  - name: HUGGING_FACE_HUB_TOKEN  valueFrom:  secretKeyRef:  name: hf-secret  key: hf_api_token  - name: VLLM_XLA_CACHE_PATH  value: "/data"  ports:  - containerPort: 6379  name: gcs  - containerPort: 8265  name: dashboard  - containerPort: 10001  name: client  - containerPort: 8000  name: serve  - containerPort: 8471  name: slicebuilder  - containerPort: 8081  name: mxla  volumeMounts:  - name: gcs-fuse-csi-ephemeral  mountPath: /data  - name: dshm  mountPath: /dev/shm  volumes:  - name: gke-gcsfuse-cache  emptyDir:  medium: Memory  - name: dshm  emptyDir:  medium: Memory  - name: gcs-fuse-csi-ephemeral  csi:  driver: gcsfuse.csi.storage.gke.io  volumeAttributes:  bucketName: $GSBUCKET  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"  workerGroupSpecs:  - groupName: tpu-group  replicas: 1  minReplicas: 1  maxReplicas: 1  numOfHosts: 1  rayStartParams: {}  template:  metadata:  annotations:  gke-gcsfuse/volumes: "true"  gke-gcsfuse/cpu-limit: "0"  gke-gcsfuse/memory-limit: "0"  gke-gcsfuse/ephemeral-storage-limit: "0"  spec:  serviceAccountName: $KSA_NAME  containers:  - name: ray-worker  image: $VLLM_IMAGE  imagePullPolicy: IfNotPresent  resources:  limits:  cpu: "100"  google.com/tpu: "8"  ephemeral-storage: 40G  memory: 200G  requests:  cpu: "100"  google.com/tpu: "8"  ephemeral-storage: 40G  memory: 200G  env:  - name: VLLM_XLA_CACHE_PATH  value: "/data"  - name: HUGGING_FACE_HUB_TOKEN  valueFrom:  secretKeyRef:  name: hf-secret  key: hf_api_token  volumeMounts:  - name: gcs-fuse-csi-ephemeral  mountPath: /data  - name: dshm  mountPath: /dev/shm  volumes:  - name: gke-gcsfuse-cache  emptyDir:  medium: Memory  - name: dshm  emptyDir:  medium: Memory  - name: gcs-fuse-csi-ephemeral  csi:  driver: gcsfuse.csi.storage.gke.io  volumeAttributes:  bucketName: $GSBUCKET  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"  nodeSelector:  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice  cloud.google.com/gke-tpu-topology: 2x4
  2. Apply the manifest:

    envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f - 

    The envsubst command replaces the environment variables in the manifest.

GKE creates a RayCluster custom resource with a workergroup that contains a TPU v5e single-host in a 2x4 topology.

Mistral-7B

Create the RayCluster custom resource to deploy the Mistral-7B model by completing the following steps:

  1. Inspect the ray-cluster.tpu-v5e-singlehost.yaml manifest:

    apiVersion: ray.io/v1 kind: RayCluster metadata:  name: vllm-tpu spec:  headGroupSpec:  rayStartParams: {}  template:  metadata:  annotations:  gke-gcsfuse/volumes: "true"  gke-gcsfuse/cpu-limit: "0"  gke-gcsfuse/memory-limit: "0"  gke-gcsfuse/ephemeral-storage-limit: "0"  spec:  serviceAccountName: $KSA_NAME  containers:  - name: ray-head  image: $VLLM_IMAGE  imagePullPolicy: IfNotPresent  resources:  limits:  cpu: "2"  memory: 8G  requests:  cpu: "2"  memory: 8G  env:  - name: HUGGING_FACE_HUB_TOKEN  valueFrom:  secretKeyRef:  name: hf-secret  key: hf_api_token  - name: VLLM_XLA_CACHE_PATH  value: "/data"  ports:  - containerPort: 6379  name: gcs  - containerPort: 8265  name: dashboard  - containerPort: 10001  name: client  - containerPort: 8000  name: serve  - containerPort: 8471  name: slicebuilder  - containerPort: 8081  name: mxla  volumeMounts:  - name: gcs-fuse-csi-ephemeral  mountPath: /data  - name: dshm  mountPath: /dev/shm  volumes:  - name: gke-gcsfuse-cache  emptyDir:  medium: Memory  - name: dshm  emptyDir:  medium: Memory  - name: gcs-fuse-csi-ephemeral  csi:  driver: gcsfuse.csi.storage.gke.io  volumeAttributes:  bucketName: $GSBUCKET  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"  workerGroupSpecs:  - groupName: tpu-group  replicas: 1  minReplicas: 1  maxReplicas: 1  numOfHosts: 1  rayStartParams: {}  template:  metadata:  annotations:  gke-gcsfuse/volumes: "true"  gke-gcsfuse/cpu-limit: "0"  gke-gcsfuse/memory-limit: "0"  gke-gcsfuse/ephemeral-storage-limit: "0"  spec:  serviceAccountName: $KSA_NAME  containers:  - name: ray-worker  image: $VLLM_IMAGE  imagePullPolicy: IfNotPresent  resources:  limits:  cpu: "100"  google.com/tpu: "8"  ephemeral-storage: 40G  memory: 200G  requests:  cpu: "100"  google.com/tpu: "8"  ephemeral-storage: 40G  memory: 200G  env:  - name: VLLM_XLA_CACHE_PATH  value: "/data"  - name: HUGGING_FACE_HUB_TOKEN  valueFrom:  secretKeyRef:  name: hf-secret  key: hf_api_token  volumeMounts:  - name: gcs-fuse-csi-ephemeral  mountPath: /data  - name: dshm  mountPath: /dev/shm  volumes:  - name: gke-gcsfuse-cache  emptyDir:  medium: Memory  - name: dshm  emptyDir:  medium: Memory  - name: gcs-fuse-csi-ephemeral  csi:  driver: gcsfuse.csi.storage.gke.io  volumeAttributes:  bucketName: $GSBUCKET  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"  nodeSelector:  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice  cloud.google.com/gke-tpu-topology: 2x4
  2. Apply the manifest:

    envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f - 

    The envsubst command replaces the environment variables in the manifest.

GKE creates a RayCluster custom resource with a workergroup that contains a TPU v5e single-host in a 2x4 topology.

Llama 3.1 70B

Create the RayCluster custom resource to deploy the Llama 3.1 70B model by completing the following steps:

  1. Inspect the ray-cluster.tpu-v6e-singlehost.yaml manifest:

    apiVersion: ray.io/v1 kind: RayCluster metadata:  name: vllm-tpu spec:  headGroupSpec:  rayStartParams: {}  template:  metadata:  annotations:  gke-gcsfuse/volumes: "true"  gke-gcsfuse/cpu-limit: "0"  gke-gcsfuse/memory-limit: "0"  gke-gcsfuse/ephemeral-storage-limit: "0"  spec:  serviceAccountName: $KSA_NAME  containers:  - name: ray-head  image: $VLLM_IMAGE  imagePullPolicy: IfNotPresent  resources:  limits:  cpu: "2"  memory: 8G  requests:  cpu: "2"  memory: 8G  env:  - name: HUGGING_FACE_HUB_TOKEN  valueFrom:  secretKeyRef:  name: hf-secret  key: hf_api_token  - name: VLLM_XLA_CACHE_PATH  value: "/data"  ports:  - containerPort: 6379  name: gcs  - containerPort: 8265  name: dashboard  - containerPort: 10001  name: client  - containerPort: 8000  name: serve  - containerPort: 8471  name: slicebuilder  - containerPort: 8081  name: mxla  volumeMounts:  - name: gcs-fuse-csi-ephemeral  mountPath: /data  - name: dshm  mountPath: /dev/shm  volumes:  - name: gke-gcsfuse-cache  emptyDir:  medium: Memory  - name: dshm  emptyDir:  medium: Memory  - name: gcs-fuse-csi-ephemeral  csi:  driver: gcsfuse.csi.storage.gke.io  volumeAttributes:  bucketName: $GSBUCKET  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"  workerGroupSpecs:  - groupName: tpu-group  replicas: 1  minReplicas: 1  maxReplicas: 1  numOfHosts: 1  rayStartParams: {}  template:  metadata:  annotations:  gke-gcsfuse/volumes: "true"  gke-gcsfuse/cpu-limit: "0"  gke-gcsfuse/memory-limit: "0"  gke-gcsfuse/ephemeral-storage-limit: "0"  spec:  serviceAccountName: $KSA_NAME  containers:  - name: ray-worker  image: $VLLM_IMAGE  imagePullPolicy: IfNotPresent  resources:  limits:  cpu: "100"  google.com/tpu: "8"  ephemeral-storage: 40G  memory: 200G  requests:  cpu: "100"  google.com/tpu: "8"  ephemeral-storage: 40G  memory: 200G  env:  - name: HUGGING_FACE_HUB_TOKEN  valueFrom:  secretKeyRef:  name: hf-secret  key: hf_api_token  - name: VLLM_XLA_CACHE_PATH  value: "/data"  volumeMounts:  - name: gcs-fuse-csi-ephemeral  mountPath: /data  - name: dshm  mountPath: /dev/shm  volumes:  - name: gke-gcsfuse-cache  emptyDir:  medium: Memory  - name: dshm  emptyDir:  medium: Memory  - name: gcs-fuse-csi-ephemeral  csi:  driver: gcsfuse.csi.storage.gke.io  volumeAttributes:  bucketName: $GSBUCKET  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"  nodeSelector:  cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice  cloud.google.com/gke-tpu-topology: 2x4
  2. Apply the manifest:

    envsubst < tpu/ray-cluster.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f - 

    The envsubst command replaces the environment variables in the manifest.

GKE creates a RayCluster custom resource with a workergroup that contains a TPU v6e single-host in a 2x4 topology.

Connect to the RayCluster custom resource

After the RayCluster custom resource is created, you can connect to the RayCluster resource and start serving the model.

  1. Verify that GKE created the RayCluster Service:

    kubectl --namespace ${NAMESPACE} get raycluster/vllm-tpu \  --output wide 

    The output is similar to the following:

    NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS TPUS STATUS AGE HEAD POD IP HEAD SERVICE IP vllm-tpu 1 1 ### ###G 0 8 ready ### ###.###.###.### ###.###.###.### 

    Wait until the STATUS is ready and the HEAD POD IP and HEAD SERVICE IP columns have an IP address.

  2. Establish port-forwarding sessions to the Ray head:

    pkill -f "kubectl .* port-forward .* 8265:8265" pkill -f "kubectl .* port-forward .* 10001:10001" kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8265:8265 2>&1 >/dev/null & kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 10001:10001 2>&1 >/dev/null & 
  3. Verify that the Ray client can connect to the remote RayCluster custom resource:

    docker run --net=host -it ${VLLM_IMAGE} \ ray list nodes --address http://localhost:8265 

    The output is similar to the following:

    ======== List: YYYY-MM-DD HH:MM:SS.NNNNNN ======== Stats: ------------------------------ Total: 2 Table: ------------------------------ NODE_ID NODE_IP IS_HEAD_NODE STATE STATE_MESSAGE NODE_NAME RESOURCES_TOTAL LABELS 0 XXXXXXXXXX ###.###.###.### True ALIVE ###.###.###.### CPU: 2.0 ray.io/node_id: XXXXXXXXXX memory: #.### GiB node:###.###.###.###: 1.0 node:__internal_head__: 1.0 object_store_memory: #.### GiB 1 XXXXXXXXXX ###.###.###.### False ALIVE ###.###.###.### CPU: 100.0 ray.io/node_id: XXXXXXXXXX TPU: 8.0 TPU-v#e-8-head: 1.0 accelerator_type:TPU-V#E: 1.0 memory: ###.### GiB node:###.###.###.###: 1.0 object_store_memory: ##.### GiB tpu-group-0: 1.0 

Deploy the model with vLLM

To deploy a specific model with vLLM, follow these instructions.

Llama-3-8B-Instruct

docker run \  --env MODEL_ID=${MODEL_ID} \  --net=host \  --volume=./tpu:/workspace/vllm/tpu \  -it \  ${VLLM_IMAGE} \  serve run serve_tpu:model \  --address=ray://localhost:10001 \  --app-dir=./tpu \  --runtime-env-json='{"env_vars": {"MODEL_ID": "meta-llama/Meta-Llama-3-8B-Instruct"}}' 

Mistral-7B

docker run \  --env MODEL_ID=${MODEL_ID} \  --env TOKENIZER_MODE=${TOKENIZER_MODE} \  --net=host \  --volume=./tpu:/workspace/vllm/tpu \  -it \  ${VLLM_IMAGE} \  serve run serve_tpu:model \  --address=ray://localhost:10001 \  --app-dir=./tpu \  --runtime-env-json='{"env_vars": {"MODEL_ID": "mistralai/Mistral-7B-Instruct-v0.3", "TOKENIZER_MODE": "mistral"}}' 

Llama 3.1 70B

docker run \  --env MAX_MODEL_LEN=${MAX_MODEL_LEN} \  --env MODEL_ID=${MODEL_ID} \  --net=host \  --volume=./tpu:/workspace/vllm/tpu \  -it \  ${VLLM_IMAGE} \  serve run serve_tpu:model \  --address=ray://localhost:10001 \  --app-dir=./tpu \  --runtime-env-json='{"env_vars": {"MAX_MODEL_LEN": "8192", "MODEL_ID": "meta-llama/Meta-Llama-3.1-70B"}}' 

View the Ray Dashboard

You can view your Ray Serve deployment and relevant logs from the Ray Dashboard.

  1. Click the Web Preview icon Web Preview button, which can be found on the top right of the Cloud Shell taskbar.
  2. Click Change port and set the port number to 8265.
  3. Click Change and Preview.
  4. On the Ray Dashboard, click the Serve tab.

After the Serve deployment has a HEALTHY status, the model is ready to begin processing inputs.

Serve the model

This guide highlights models that support text generation, a technique that allows text content creation from a prompt.

Llama-3-8B-Instruct

  1. Set up port forwarding to the server:

    pkill -f "kubectl .* port-forward .* 8000:8000" kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null & 
  2. Send a prompt to the Serve endpoint:

    curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}' 

Mistral-7B

  1. Set up port forwarding to the server:

    pkill -f "kubectl .* port-forward .* 8000:8000" kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null & 
  2. Send a prompt to the Serve endpoint:

    curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}' 

Llama 3.1 70B

  1. Set up port forwarding to the server:

    pkill -f "kubectl .* port-forward .* 8000:8000" kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null & 
  2. Send a prompt to the Serve endpoint:

    curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}' 

Additional configuration

You can optionally configure the following model serving resources and techniques that the Ray Serve framework supports:

Deploy a RayService

You can deploy the same models from this tutorial by using a RayService custom resource.

  1. Delete the RayCluster custom resource that you created in this tutorial:

    kubectl --namespace ${NAMESPACE} delete raycluster/vllm-tpu 
  2. Create the RayService custom resource to deploy a model:

    Llama-3-8B-Instruct

    1. Inspect the ray-service.tpu-v5e-singlehost.yaml manifest:

      apiVersion: ray.io/v1 kind: RayService metadata:  name: vllm-tpu spec:  serveConfigV2: |  applications:  - name: llm  import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model  deployments:  - name: VLLMDeployment  num_replicas: 1  runtime_env:  working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"  env_vars:  MODEL_ID: "$MODEL_ID"  MAX_MODEL_LEN: "$MAX_MODEL_LEN"  DTYPE: "$DTYPE"  TOKENIZER_MODE: "$TOKENIZER_MODE"  TPU_CHIPS: "8"  rayClusterConfig:  headGroupSpec:  rayStartParams: {}  template:  metadata:  annotations:  gke-gcsfuse/volumes: "true"  gke-gcsfuse/cpu-limit: "0"  gke-gcsfuse/memory-limit: "0"  gke-gcsfuse/ephemeral-storage-limit: "0"  spec:  serviceAccountName: $KSA_NAME  containers:  - name: ray-head  image: $VLLM_IMAGE  imagePullPolicy: IfNotPresent  ports:  - containerPort: 6379  name: gcs  - containerPort: 8265  name: dashboard  - containerPort: 10001  name: client  - containerPort: 8000  name: serve  env:  - name: HUGGING_FACE_HUB_TOKEN  valueFrom:  secretKeyRef:  name: hf-secret  key: hf_api_token  - name: VLLM_XLA_CACHE_PATH  value: "/data"  resources:  limits:  cpu: "2"  memory: 8G  requests:  cpu: "2"  memory: 8G  volumeMounts:  - name: gcs-fuse-csi-ephemeral  mountPath: /data  - name: dshm  mountPath: /dev/shm  volumes:  - name: gke-gcsfuse-cache  emptyDir:  medium: Memory  - name: dshm  emptyDir:  medium: Memory  - name: gcs-fuse-csi-ephemeral  csi:  driver: gcsfuse.csi.storage.gke.io  volumeAttributes:  bucketName: $GSBUCKET  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"  workerGroupSpecs:  - groupName: tpu-group  replicas: 1  minReplicas: 1  maxReplicas: 1  numOfHosts: 1  rayStartParams: {}  template:  metadata:  annotations:  gke-gcsfuse/volumes: "true"  gke-gcsfuse/cpu-limit: "0"  gke-gcsfuse/memory-limit: "0"  gke-gcsfuse/ephemeral-storage-limit: "0"  spec:  serviceAccountName: $KSA_NAME  containers:  - name: ray-worker  image: $VLLM_IMAGE  imagePullPolicy: IfNotPresent  resources:  limits:  cpu: "100"  google.com/tpu: "8"  ephemeral-storage: 40G  memory: 200G  requests:  cpu: "100"  google.com/tpu: "8"  ephemeral-storage: 40G  memory: 200G  env:  - name: JAX_PLATFORMS  value: "tpu"  - name: HUGGING_FACE_HUB_TOKEN  valueFrom:  secretKeyRef:  name: hf-secret  key: hf_api_token  - name: VLLM_XLA_CACHE_PATH  value: "/data"  volumeMounts:  - name: gcs-fuse-csi-ephemeral  mountPath: /data  - name: dshm  mountPath: /dev/shm  volumes:  - name: gke-gcsfuse-cache  emptyDir:  medium: Memory  - name: dshm  emptyDir:  medium: Memory  - name: gcs-fuse-csi-ephemeral  csi:  driver: gcsfuse.csi.storage.gke.io  volumeAttributes:  bucketName: $GSBUCKET  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"  nodeSelector:  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice  cloud.google.com/gke-tpu-topology: 2x4
    2. Apply the manifest:

      envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f - 

      The envsubst command replaces the environment variables in the manifest.

      GKE creates a RayService with a workergroup that contains a TPU v5e single-host in a 2x4 topology.

    Mistral-7B

    1. Inspect the ray-service.tpu-v5e-singlehost.yaml manifest:

      apiVersion: ray.io/v1 kind: RayService metadata:  name: vllm-tpu spec:  serveConfigV2: |  applications:  - name: llm  import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model  deployments:  - name: VLLMDeployment  num_replicas: 1  runtime_env:  working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"  env_vars:  MODEL_ID: "$MODEL_ID"  MAX_MODEL_LEN: "$MAX_MODEL_LEN"  DTYPE: "$DTYPE"  TOKENIZER_MODE: "$TOKENIZER_MODE"  TPU_CHIPS: "8"  rayClusterConfig:  headGroupSpec:  rayStartParams: {}  template:  metadata:  annotations:  gke-gcsfuse/volumes: "true"  gke-gcsfuse/cpu-limit: "0"  gke-gcsfuse/memory-limit: "0"  gke-gcsfuse/ephemeral-storage-limit: "0"  spec:  serviceAccountName: $KSA_NAME  containers:  - name: ray-head  image: $VLLM_IMAGE  imagePullPolicy: IfNotPresent  ports:  - containerPort: 6379  name: gcs  - containerPort: 8265  name: dashboard  - containerPort: 10001  name: client  - containerPort: 8000  name: serve  env:  - name: HUGGING_FACE_HUB_TOKEN  valueFrom:  secretKeyRef:  name: hf-secret  key: hf_api_token  - name: VLLM_XLA_CACHE_PATH  value: "/data"  resources:  limits:  cpu: "2"  memory: 8G  requests:  cpu: "2"  memory: 8G  volumeMounts:  - name: gcs-fuse-csi-ephemeral  mountPath: /data  - name: dshm  mountPath: /dev/shm  volumes:  - name: gke-gcsfuse-cache  emptyDir:  medium: Memory  - name: dshm  emptyDir:  medium: Memory  - name: gcs-fuse-csi-ephemeral  csi:  driver: gcsfuse.csi.storage.gke.io  volumeAttributes:  bucketName: $GSBUCKET  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"  workerGroupSpecs:  - groupName: tpu-group  replicas: 1  minReplicas: 1  maxReplicas: 1  numOfHosts: 1  rayStartParams: {}  template:  metadata:  annotations:  gke-gcsfuse/volumes: "true"  gke-gcsfuse/cpu-limit: "0"  gke-gcsfuse/memory-limit: "0"  gke-gcsfuse/ephemeral-storage-limit: "0"  spec:  serviceAccountName: $KSA_NAME  containers:  - name: ray-worker  image: $VLLM_IMAGE  imagePullPolicy: IfNotPresent  resources:  limits:  cpu: "100"  google.com/tpu: "8"  ephemeral-storage: 40G  memory: 200G  requests:  cpu: "100"  google.com/tpu: "8"  ephemeral-storage: 40G  memory: 200G  env:  - name: JAX_PLATFORMS  value: "tpu"  - name: HUGGING_FACE_HUB_TOKEN  valueFrom:  secretKeyRef:  name: hf-secret  key: hf_api_token  - name: VLLM_XLA_CACHE_PATH  value: "/data"  volumeMounts:  - name: gcs-fuse-csi-ephemeral  mountPath: /data  - name: dshm  mountPath: /dev/shm  volumes:  - name: gke-gcsfuse-cache  emptyDir:  medium: Memory  - name: dshm  emptyDir:  medium: Memory  - name: gcs-fuse-csi-ephemeral  csi:  driver: gcsfuse.csi.storage.gke.io  volumeAttributes:  bucketName: $GSBUCKET  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"  nodeSelector:  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice  cloud.google.com/gke-tpu-topology: 2x4
    2. Apply the manifest:

      envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f - 

      The envsubst command replaces the environment variables in the manifest.

      GKE creates a RayService with a workergroup containing a TPU v5e single-host in a 2x4 topology.

    Llama 3.1 70B

    1. Inspect the ray-service.tpu-v6e-singlehost.yaml manifest:

      apiVersion: ray.io/v1 kind: RayService metadata:  name: vllm-tpu spec:  serveConfigV2: |  applications:  - name: llm  import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model  deployments:  - name: VLLMDeployment  num_replicas: 1  runtime_env:  working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"  env_vars:  MODEL_ID: "$MODEL_ID"  MAX_MODEL_LEN: "$MAX_MODEL_LEN"  DTYPE: "$DTYPE"  TOKENIZER_MODE: "$TOKENIZER_MODE"  TPU_CHIPS: "8"  rayClusterConfig:  headGroupSpec:  rayStartParams: {}  template:  metadata:  annotations:  gke-gcsfuse/volumes: "true"  gke-gcsfuse/cpu-limit: "0"  gke-gcsfuse/memory-limit: "0"  gke-gcsfuse/ephemeral-storage-limit: "0"  spec:  serviceAccountName: $KSA_NAME  containers:  - name: ray-head  image: $VLLM_IMAGE  imagePullPolicy: IfNotPresent  ports:  - containerPort: 6379  name: gcs  - containerPort: 8265  name: dashboard  - containerPort: 10001  name: client  - containerPort: 8000  name: serve  env:  - name: HUGGING_FACE_HUB_TOKEN  valueFrom:  secretKeyRef:  name: hf-secret  key: hf_api_token  - name: VLLM_XLA_CACHE_PATH  value: "/data"  resources:  limits:  cpu: "2"  memory: 8G  requests:  cpu: "2"  memory: 8G  volumeMounts:  - name: gcs-fuse-csi-ephemeral  mountPath: /data  - name: dshm  mountPath: /dev/shm  volumes:  - name: gke-gcsfuse-cache  emptyDir:  medium: Memory  - name: dshm  emptyDir:  medium: Memory  - name: gcs-fuse-csi-ephemeral  csi:  driver: gcsfuse.csi.storage.gke.io  volumeAttributes:  bucketName: $GSBUCKET  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"  workerGroupSpecs:  - groupName: tpu-group  replicas: 1  minReplicas: 1  maxReplicas: 1  numOfHosts: 1  rayStartParams: {}  template:  metadata:  annotations:  gke-gcsfuse/volumes: "true"  gke-gcsfuse/cpu-limit: "0"  gke-gcsfuse/memory-limit: "0"  gke-gcsfuse/ephemeral-storage-limit: "0"  spec:  serviceAccountName: $KSA_NAME  containers:  - name: ray-worker  image: $VLLM_IMAGE  imagePullPolicy: IfNotPresent  resources:  limits:  cpu: "100"  google.com/tpu: "8"  ephemeral-storage: 40G  memory: 200G  requests:  cpu: "100"  google.com/tpu: "8"  ephemeral-storage: 40G  memory: 200G  env:  - name: JAX_PLATFORMS  value: "tpu"  - name: HUGGING_FACE_HUB_TOKEN  valueFrom:  secretKeyRef:  name: hf-secret  key: hf_api_token  - name: VLLM_XLA_CACHE_PATH  value: "/data"  volumeMounts:  - name: gcs-fuse-csi-ephemeral  mountPath: /data  - name: dshm  mountPath: /dev/shm  volumes:  - name: gke-gcsfuse-cache  emptyDir:  medium: Memory  - name: dshm  emptyDir:  medium: Memory  - name: gcs-fuse-csi-ephemeral  csi:  driver: gcsfuse.csi.storage.gke.io  volumeAttributes:  bucketName: $GSBUCKET  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"  nodeSelector:  cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice  cloud.google.com/gke-tpu-topology: 2x4
    2. Apply the manifest:

      envsubst < tpu/ray-service.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f - 

      The envsubst command replaces the environment variables in the manifest.

    GKE creates a RayCluster custom resource where the Ray Serve application is deployed and the subsequent RayService custom resource is created.

  3. Verify the status of the RayService resource:

    kubectl --namespace ${NAMESPACE} get rayservices/vllm-tpu 

    Wait for the Service status to change to Running:

    NAME SERVICE STATUS NUM SERVE ENDPOINTS vllm-tpu Running 1 
  4. Retrieve the name of the RayCluster head service:

    SERVICE_NAME=$(kubectl --namespace=${NAMESPACE} get rayservices/vllm-tpu \  --template={{.status.activeServiceStatus.rayClusterStatus.head.serviceName}}) 
  5. Establish port-forwarding sessions to the Ray head to view the Ray dashboard:

    pkill -f "kubectl .* port-forward .* 8265:8265" kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8265:8265 2>&1 >/dev/null & 
  6. View the Ray Dashboard.

  7. Serve the model.

  8. Clean up the RayService resource:

    kubectl --namespace ${NAMESPACE} delete rayservice/vllm-tpu 

Compose multiple models with model composition

Model composition is a technique for composing multiple models into a single application.

In this section, you use a GKE cluster to compose two models, Llama 3 8B IT and Gemma 7B IT, into a single application:

  • The first model is the assistant model that answers questions asked in the prompt.
  • The second model is the summarizer model. The output of the assistant model is chained into the input of the summarizer model. The final result is the summarized version of the response from the assistant model.
  1. Get access to the Gemma model by completing the following steps:

    1. Sign in to the Kaggle platform, sign the license consent agreement, and get a Kaggle API token. In this tutorial, you use a Kubernetes Secret for the Kaggle credentials.
    2. Access the model consent page on Kaggle.com.
    3. Sign in to Kaggle, if you haven't done so already.
    4. Click Request Access.
    5. In the Choose Account for Consent section, select Verify via Kaggle Account to use your Kaggle account for granting consent.
    6. Accept the model Terms and Conditions.
  2. Set up your environment:

    export ASSIST_MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct export SUMMARIZER_MODEL_ID=google/gemma-7b-it 
  3. For Standard clusters, create an additional single-host TPU slice node pool:

    gcloud container node-pools create tpu-2 \  --location=${COMPUTE_ZONE} \  --cluster=${CLUSTER_NAME} \  --machine-type=MACHINE_TYPE \  --num-nodes=1 

    Replace the MACHINE_TYPE with any of the following machine types:

    • ct5lp-hightpu-8t to provision TPU v5e.
    • ct6e-standard-8t to provision TPU v6e.

    Autopilot clusters automatically provision the required nodes.

  4. Deploy the RayService resource based on the TPU version that you want to use:

    TPU v5e

    1. Inspect the ray-service.tpu-v5e-singlehost.yaml manifest:

      apiVersion: ray.io/v1 kind: RayService metadata:  name: vllm-tpu spec:  serveConfigV2: |  applications:  - name: llm  route_prefix: /  import_path: ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_model  deployments:  - name: MultiModelDeployment  num_replicas: 1  runtime_env:  working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"  env_vars:  ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"  SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"  TPU_CHIPS: "16"  TPU_HEADS: "2"  rayClusterConfig:  headGroupSpec:  rayStartParams: {}  template:  metadata:  annotations:  gke-gcsfuse/volumes: "true"  gke-gcsfuse/cpu-limit: "0"  gke-gcsfuse/memory-limit: "0"  gke-gcsfuse/ephemeral-storage-limit: "0"  spec:  serviceAccountName: $KSA_NAME  containers:  - name: ray-head  image: $VLLM_IMAGE  resources:  limits:  cpu: "2"  memory: 8G  requests:  cpu: "2"  memory: 8G  ports:  - containerPort: 6379  name: gcs-server  - containerPort: 8265  name: dashboard  - containerPort: 10001  name: client  - containerPort: 8000  name: serve  env:  - name: HUGGING_FACE_HUB_TOKEN  valueFrom:  secretKeyRef:  name: hf-secret  key: hf_api_token  - name: VLLM_XLA_CACHE_PATH  value: "/data"  volumeMounts:  - name: gcs-fuse-csi-ephemeral  mountPath: /data  - name: dshm  mountPath: /dev/shm  volumes:  - name: gke-gcsfuse-cache  emptyDir:  medium: Memory  - name: dshm  emptyDir:  medium: Memory  - name: gcs-fuse-csi-ephemeral  csi:  driver: gcsfuse.csi.storage.gke.io  volumeAttributes:  bucketName: $GSBUCKET  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"  workerGroupSpecs:  - replicas: 2  minReplicas: 1  maxReplicas: 2  numOfHosts: 1  groupName: tpu-group  rayStartParams: {}  template:  metadata:  annotations:  gke-gcsfuse/volumes: "true"  gke-gcsfuse/cpu-limit: "0"  gke-gcsfuse/memory-limit: "0"  gke-gcsfuse/ephemeral-storage-limit: "0"  spec:  serviceAccountName: $KSA_NAME  containers:  - name: llm  image: $VLLM_IMAGE  env:  - name: HUGGING_FACE_HUB_TOKEN  valueFrom:  secretKeyRef:  name: hf-secret  key: hf_api_token  - name: VLLM_XLA_CACHE_PATH  value: "/data"  resources:  limits:  cpu: "100"  google.com/tpu: "8"  ephemeral-storage: 40G  memory: 200G  requests:  cpu: "100"  google.com/tpu: "8"  ephemeral-storage: 40G  memory: 200G  volumeMounts:  - name: gcs-fuse-csi-ephemeral  mountPath: /data  - name: dshm  mountPath: /dev/shm  volumes:  - name: gke-gcsfuse-cache  emptyDir:  medium: Memory  - name: dshm  emptyDir:  medium: Memory  - name: gcs-fuse-csi-ephemeral  csi:  driver: gcsfuse.csi.storage.gke.io  volumeAttributes:  bucketName: $GSBUCKET  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"  nodeSelector:  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice  cloud.google.com/gke-tpu-topology: 2x4
    2. Apply the manifest:

      envsubst < model-composition/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f - 

    TPU v6e

    1. Inspect the ray-service.tpu-v6e-singlehost.yaml manifest:

      apiVersion: ray.io/v1 kind: RayService metadata:  name: vllm-tpu spec:  serveConfigV2: |  applications:  - name: llm  route_prefix: /  import_path: ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_model  deployments:  - name: MultiModelDeployment  num_replicas: 1  runtime_env:  working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"  env_vars:  ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"  SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"  TPU_CHIPS: "16"  TPU_HEADS: "2"  rayClusterConfig:  headGroupSpec:  rayStartParams: {}  template:  metadata:  annotations:  gke-gcsfuse/volumes: "true"  gke-gcsfuse/cpu-limit: "0"  gke-gcsfuse/memory-limit: "0"  gke-gcsfuse/ephemeral-storage-limit: "0"  spec:  serviceAccountName: $KSA_NAME  containers:  - name: ray-head  image: $VLLM_IMAGE  resources:  limits:  cpu: "2"  memory: 8G  requests:  cpu: "2"  memory: 8G  ports:  - containerPort: 6379  name: gcs-server  - containerPort: 8265  name: dashboard  - containerPort: 10001  name: client  - containerPort: 8000  name: serve  env:  - name: HUGGING_FACE_HUB_TOKEN  valueFrom:  secretKeyRef:  name: hf-secret  key: hf_api_token  - name: VLLM_XLA_CACHE_PATH  value: "/data"  volumeMounts:  - name: gcs-fuse-csi-ephemeral  mountPath: /data  - name: dshm  mountPath: /dev/shm  volumes:  - name: gke-gcsfuse-cache  emptyDir:  medium: Memory  - name: dshm  emptyDir:  medium: Memory  - name: gcs-fuse-csi-ephemeral  csi:  driver: gcsfuse.csi.storage.gke.io  volumeAttributes:  bucketName: $GSBUCKET  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"  workerGroupSpecs:  - replicas: 2  minReplicas: 1  maxReplicas: 2  numOfHosts: 1  groupName: tpu-group  rayStartParams: {}  template:  metadata:  annotations:  gke-gcsfuse/volumes: "true"  gke-gcsfuse/cpu-limit: "0"  gke-gcsfuse/memory-limit: "0"  gke-gcsfuse/ephemeral-storage-limit: "0"  spec:  serviceAccountName: $KSA_NAME  containers:  - name: llm  image: $VLLM_IMAGE  env:  - name: HUGGING_FACE_HUB_TOKEN  valueFrom:  secretKeyRef:  name: hf-secret  key: hf_api_token  - name: VLLM_XLA_CACHE_PATH  value: "/data"  resources:  limits:  cpu: "100"  google.com/tpu: "8"  ephemeral-storage: 40G  memory: 200G  requests:  cpu: "100"  google.com/tpu: "8"  ephemeral-storage: 40G  memory: 200G  volumeMounts:  - name: gcs-fuse-csi-ephemeral  mountPath: /data  - name: dshm  mountPath: /dev/shm  volumes:  - name: gke-gcsfuse-cache  emptyDir:  medium: Memory  - name: dshm  emptyDir:  medium: Memory  - name: gcs-fuse-csi-ephemeral  csi:  driver: gcsfuse.csi.storage.gke.io  volumeAttributes:  bucketName: $GSBUCKET  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"  nodeSelector:  cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice  cloud.google.com/gke-tpu-topology: 2x4
    2. Apply the manifest:

      envsubst < model-composition/ray-service.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f - 
  5. Wait for the status of the RayService resource to change to Running:

    kubectl --namespace ${NAMESPACE} get rayservice/vllm-tpu 

    The output is similar to the following:

    NAME SERVICE STATUS NUM SERVE ENDPOINTS vllm-tpu Running 2 

    In this output, the RUNNING status indicates the RayService resource is ready.

  6. Confirm that GKE created the Service for the Ray Serve application:

    kubectl --namespace ${NAMESPACE} get service/vllm-tpu-serve-svc 

    The output is similar to the following:

    NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE vllm-tpu-serve-svc ClusterIP ###.###.###.### <none> 8000/TCP ### 
  7. Establish port-forwarding sessions to the Ray head:

    pkill -f "kubectl .* port-forward .* 8265:8265" pkill -f "kubectl .* port-forward .* 8000:8000" kubectl --namespace ${NAMESPACE} port-forward service/vllm-tpu-serve-svc 8265:8265 2>&1 >/dev/null & kubectl --namespace ${NAMESPACE} port-forward service/vllm-tpu-serve-svc 8000:8000 2>&1 >/dev/null & 
  8. Send a request to the model:

    curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What is the most popular programming language for machine learning and why?", "max_tokens": 1000}' 

    The output is similar to the following:

     {"text": [" used in various data science projects, including building machine learning models, preprocessing data, and visualizing results.\n\nSure, here is a single sentence summarizing the text:\n\nPython is the most popular programming language for machine learning and is widely used in data science projects, encompassing model building, data preprocessing, and visualization."]} 

Build and deploy the TPU image

This tutorial uses hosted TPU images from vLLM. vLLM provides a Dockerfile.tpu image that builds vLLM on top of the required PyTorch XLA image that includes TPU dependencies. However, you can also build and deploy your own TPU image for finer-grained control over the contents of your Docker image.

  1. Create a Docker repository to store the container images for this guide:

    gcloud artifacts repositories create vllm-tpu --repository-format=docker --location=${COMPUTE_REGION} && \ gcloud auth configure-docker ${COMPUTE_REGION}-docker.pkg.dev 
  2. Clone the vLLM repository:

    git clone https://github.com/vllm-project/vllm.git cd vllm 
  3. Build the image:

    docker build -f ./docker/Dockerfile.tpu . -t vllm-tpu 
  4. Tag the TPU image with your Artifact Registry name:

    export VLLM_IMAGE=${COMPUTE_REGION}-docker.pkg.dev/${PROJECT_ID}/vllm-tpu/vllm-tpu:TAG docker tag vllm-tpu ${VLLM_IMAGE} 

    Replace TAG with the name of the tag that you want to define. If you don't specify a tag, Docker applies the default latest tag.

  5. Push the image to Artifact Registry:

    docker push ${VLLM_IMAGE} 

Delete the individual resources

If you used an existing project and you don't want to delete it, you can delete the individual resources.

  1. Delete the RayCluster custom resource:

    kubectl --namespace ${NAMESPACE} delete rayclusters vllm-tpu 
  2. Delete the Cloud Storage bucket:

    gcloud storage rm -r gs://${GSBUCKET} 
  3. Delete the Artifact Registry repository:

    gcloud artifacts repositories delete vllm-tpu \  --location=${COMPUTE_REGION} 
  4. Delete the cluster:

    gcloud container clusters delete ${CLUSTER_NAME} \  --location=LOCATION 

    Replace LOCATION with any of the following environment variables:

    • For Autopilot clusters, use COMPUTE_REGION.
    • For Standard clusters, use COMPUTE_ZONE.

Delete the project

If you deployed the tutorial in a new Google Cloud project, and if you no longer need the project, then delete it by completing the following steps:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next