Stream Models into GPU Memory with the Run:ai Model Streamer

Accelerating model loads from Object Storage into GPU memory for vLLM inference is dramatically simpler with the Run:ai model streamer. The model streamer will stream multiple files in parallel into GPU memory via CPU memory from a Cloud Storage bucket. To use the model streamer from a Google Cloud server all that’s needed is passing in the appropriate authentication method and inserting the flag -–load-format=runai_streamer and if tensor-parallel is set >1 –model-loader-extra-config={“distributed”:true} to enable distributed streaming mode.

The following steps will guide you through installing the latest version of vLLM and deploying a model that’s loaded from Google Cloud Storage with the model streamer:

Stream a Model from Google Cloud Storage with Run:ai Model Streamer

To use the model streamer install vLLM with the Run:ai extension

pip3 install vllm[runai]

Next you will need a model in Cloud Object storage. You can use the script below to transfer a model from a HuggingFace repo to a bucket you have access to if needed:

Transfer Model From HuggingFace to Google Cloud Storage(Optional)

  1. You will need to make sure the cli you run the script from is logged into Google Cloud with gcloud init and also Huggingface via hf auth login. You may need to download these libraries:
pip3 install google-cloud-storage pip3 install huggingface_hub 
  1. Replace the environmental variables at the top of the script below with your own and save it as hf-2-gc.py
import os from huggingface_hub import snapshot_download from google.cloud import storage change the below to match your Google Cloud and Huggingface model locations repo_id=“google/gemma-3-4b-it” local_dir=“/tmp” gcs_bucket_name=“” gcs_prefix=“gemma-3-4b-it” def download_model_then_upload_model(): # Ensure you’re logged in to Hugging Face # You’ll need to run huggingface-cli login first or set HF_TOKEN # Download model print(f"Downloading model {repo_id}...") snapshot_download( repo_id=repo_id, local_dir=local_dir, local_dir_use_symlinks=False, # Important for full download resume_download=True, # Resume interrupted downloads max_workers=10 # Parallel downloads ) upload_model() def upload_model(): # Upload model to GCS # Initialize Google Cloud Storage client print(gcs_bucket_name) storage_client = storage.Client() bucket = storage_client.bucket(gcs_bucket_name) print(f"Uploading to GCS bucket {gcs_bucket_name}…") for root, _, files in os.walk(local_dir): for file in files: local_path = os.path.join(root, file) relative_path = os.path.relpath(local_path, local_dir) blob_path = os.path.join(gcs_prefix, relative_path) blob = bucket.blob(blob_path) blob.upload_from_filename(local_path) print(f"Uploaded: {local_path} to {blob_path}") print("Model download and upload complete!") download_model_then_upload_model() 

Stream Your Model in the CLI

  1. Run the command below to start a vLLM inference engine using a model from Google Cloud storage accessed with the Run:ai model streamer
vllm serve gs://models-usc/gemma-3-4b-it --load-format=runai_streamer 

Stream Your Model in GKE

  1. If using GKE you will need to enable workload identity on the cluster.

  2. Next You will need to create a GKE service account

export SERVICE_ACCOUNT="[service account name]" 

kubectl create serviceaccount $SERVICE_ACCOUNT

  1. Provide the objectViewer and objectUser IAM permissions for this service account
export BUCKET="[bucket name" export PROJECT_NUMBER="[project number]" export PROJECT_ID="[project ID" gcloud storage buckets add-iam-policy-binding gs://$BUCKET --member principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/default/sa/$SERVICE_ACCOUNT --role roles/storage.bucketViewer gcloud storage buckets add-iam-policy-binding gs://$BUCKET --member principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/default/sa/$SERVICE_ACCOUNT --role roles/storage.objectUser 
  1. You can use the example deployment spec as a guide to launch the model streamer in vLLM on GKE using workload identity. You will need to replace the “serviceAccountName” with the value in your SERVICE_ACCOUNT variable. Note if using a tensor-parallel-size >1 you will want to add the flag –model-loader-extra-config={“distributed”:true} to enable distributed streaming:
apiVersion: apps/v1 kind: Deployment metadata: name: vllm-streamer namespace: default spec: replicas: 1 selector: matchLabels: app: vllm-streamer template: metadata: labels: app: vllm-streamer spec: serviceAccountName: gcs-access containers: - args: - --model=gs://models-usc/gemma-3-4b-it - --load-format=runai_streamer - --disable-log-requests - --max-num-batched-tokens=512 - --max-num-seqs=128 - --max-model-len=2048 - --tensor-parallel-size=1 command: - python3 - -m - vllm.entrypoints.openai.api_server name: inference-server image: vllm/vllm-openai:nightly ports: - containerPort: 8000 name: metrics readinessProbe: failureThreshold: 600 httpGet: path: /health port: 8000 periodSeconds: 10 resources: limits: nvidia.com/gpu: "1" requests: nvidia.com/gpu: "1" nodeSelector: cloud.google.com/gke-accelerator: nvidia-l4 volumes: - emptyDir: medium: Memory name: dshm 

Further Information