How-To Configure a ClusterServingRuntime v1.3

Suggest edits

Prerequisite: Access to the Hybrid Manager UI with AI Factory enabled. See /edb-postgres-ai/1.3/hybrid-manager/ai-factory/.

This guide explains how to configure a ClusterServingRuntime in KServe. A ClusterServingRuntime defines the environment used to serve your AI models — specifying container image, resource settings, environment variables, and supported model formats.

For Hybrid Manager users, configuring runtimes is a core step toward enabling Model Serving — see Model Serving in Hybrid Manager.

Goal

Configure a ClusterServingRuntime so it can be used by InferenceServices to deploy models.

Estimated time

5–10 minutes.

What you will accomplish

Define a ClusterServingRuntime YAML manifest.
Apply it to your Kubernetes cluster.
Enable reusable serving configuration for one or more models.

What this unlocks

Supports consistent deployment of models using a standard runtime definition.
Allows for centralized control over serving images and resource profiles.
Required step for deploying NVIDIA NIM containers with KServe.

Prerequisites

Kubernetes cluster with KServe installed.
Access to container image registry with the desired model server image.
NVIDIA GPU node pool configured (if using GPU-based models).
(If required) Kubernetes secret configured for API keys (e.g., build.nvidia.com).

For background concepts, see:

Steps

1. Create ClusterServingRuntime YAML

Create a file named ClusterServingRuntime.yaml.

Example:

apiVersion: serving.kserve.io/v1alpha1 kind: ClusterServingRuntime metadata: name: nvidia-nim-llama-3.1-8b-instruct-1.3.3 namespace: default spec: containers: - env: - name: NIM_CACHE_PATH value: /tmp - name: NGC_API_KEY valueFrom: secretKeyRef: name: nvidia-nim-secrets key: NGC_API_KEY image: upmdev.azurecr.io/nim/meta/llama-3.1-8b-instruct:1.3.3 name: kserve-container ports: - containerPort: 8000 protocol: TCP resources: limits: cpu: "12" memory: 64Gi requests: cpu: "12" memory: 64Gi volumeMounts: - mountPath: /dev/shm name: dshm imagePullSecrets: - name: edb-cred protocolVersions: - v2 - grpc-v2 supportedModelFormats: - autoSelect: true name: nvidia-nim-llama-3.1-8b-instruct priority: 1 version: "1.3.3" volumes: - emptyDir: medium: Memory sizeLimit: 16Gi name: dshm

Key fields explained:

containers.image: The model server container (e.g., NVIDIA NIM image).
resources: CPU, memory, and GPU requirements.
NGC_API_KEY: Secret reference for NVIDIA models.
supportedModelFormats: Logical name used by InferenceService to reference this runtime.

2. Apply the ClusterServingRuntime

Run:

kubectl apply -f ClusterServingRuntime.yaml

3. Verify deployed ClusterServingRuntime

Run:

kubectl get ClusterServingRuntime

Output

NAME AGE nvidia-nim-llama-3.1-8b-instruct-1.3.3 1m

You can inspect full details with:

kubectl get ClusterServingRuntime <name> -o yaml

4. Reference runtime in InferenceService

When you create your InferenceService, reference this runtime:

runtime: nvidia-nim-llama-3.1-8b-instruct-1.3.3 modelFormat: name: nvidia-nim-llama-3.1-8b-instruct

See Deploy an NVIDIA NIM container with KServe.

Notes

Runtimes are reusable — you can deploy multiple models referencing the same ClusterServingRuntime.
Use meaningful names and version fields in supportedModelFormats for traceability.
You can update a runtime by editing and re-applying the YAML.

How-To Configure a ClusterServingRuntime v1.3

Goal

Estimated time

What you will accomplish

What this unlocks

Prerequisites

Steps

1. Create ClusterServingRuntime YAML

2. Apply the ClusterServingRuntime

3. Verify deployed ClusterServingRuntime

4. Reference runtime in InferenceService

Notes

Next steps

← Prev

↑ Up

How-To Configure a ClusterServingRuntime v1.3

Goal

Estimated time

What you will accomplish

What this unlocks

Prerequisites

Steps

1. Create ClusterServingRuntime YAML

2. Apply the ClusterServingRuntime

3. Verify deployed ClusterServingRuntime

4. Reference runtime in InferenceService

Notes

Next steps

Related reading

← Prev

↑ Up