KServe Support on NIM Operator#

About KServe Support#

The NIM Operator supports both raw deployment and serverless deployment of NIM through KServe on Kubernetes clusters, including Red Hat OpenShift Container Platform.

NIM Operator with KServe provides two additional benefits:

  1. Intelligent Caching with NIMCache to reduce initial inference time and autoscaling latency, resulting in faster and more responsive deployments.

  2. NeMo Microservices support to evaluate, guardrail, and enhance AI systems across key metrics such as latency, accuracy, cost, and compliance.

The Operator configures KServe deployments using InferenceService to manage deployment, upgrade, ingress, and autoscaling of NIM.

Diagram 1. NIM Operator and KServe interaction

KServe Deployment Modes Comparison#

Category

KServe serverless (with Knative)

Raw Deployment

Autoscaling

Scales pods automatically based on request load (speed and traffic). Can scale all the way to zero when unused.

These are configured by providing Knative annotations in the NIMService spec.annotations field.

Uses Horizontal/Vertical Pod Autoscalers with custom NIM metrics. Cannot scale to zero by default.

These are configured by providing spec.scaling.hpa config in the NIM service.

Upgrades

Every new model version creates a new revision. Can send only a portion of traffic to test new versions. Easy rollback by shifting traffic.

This is automatically managed by KServe. No special parameters required in the NIMService API. Refer to the Knative documentation on gradual rollouts and traffic management for configuring canary rollouts.

Uses rolling updates. Rollback is manual if something goes wrong.

This is not configurable. Only RollingUpdate of native Kubernetes deployment is supported.

Ingress

Uses Knative Gateway (such as Istio). Gives each model revision a stable domain/URL. Built-in secure connections. Has a queue system to handle overload.

This is automatically managed by KServe. No special parameters are added. The domain name has to be configured during KServe install (default example.com).

Exposed using Kubernetes Service + Ingress/LoadBalancer (NGINX, Istio, etc.) Security (mTLS, certificates) must be set up manually.

This is automatically managed by KServe. No special parameters are added. The domain name has to be configured during KServe install (default example.com).

NIM Metrics

Metrics and tracing are built in through Knative, KServe, and service mesh.

No extra config in the NIMService is required.

Must build your own stack. For example, Prometheus, Grafana, OpenTelemetry, etc.

ServiceMonitor needs to be enabled through the NIMService spec.metrics.enabled field set to true and spec.metrics.serviceMonitor config.

GPU Resources

Passed using the spec.resources field.

Passed using the spec.resources field.

Multi-Node

Not supported.

Not supported.

Dynamic Resource Allocation (DRA)

Not supported.

Not supported.

Tolerations

Passed through spec.tolerations in NIMService.

Feature gates must be enabled with the Knative config.

Passed through spec.tolerations in NIMService.

RuntimeClassName

Passed through spec.runtimeClassName in NIMService.

Feature gates must be enabled with the Knative config.

Passed through spec.runtimeClassName in NIMService.

NodeSelectors

Passed through spec.nodeSelectors in NIMService.

Feature gates must be enabled with the Knative config.

Passed through spec.nodeSelectors in NIMService.

Custom Scheduler Name

Passed through spec.schedulerName in NIMService.

Feature gates must be enabled with the Knative config.

Passed through spec.schedulerName in NIMService.

Select Your Deployment Environment and Type#

Raw Deployment
on Standard Kubernetes

#raw-deployment-example-on-standard-kubernetes

Raw Deployment
on Red Hat OpenShift

#raw-deployment-example-on-red-hat-openshift

Serverless Deployment
on Standard Kubernetes

#serverless-deployment-example-on-standard-kubernetes

Note

This documentation uses kubectl for Kubernetes examples and oc for OpenShift examples. Both tools provide similar functionality for their respective platforms.

Raw Deployment Example on Standard Kubernetes#

Summary#

For raw deployment of NIM through KServe on a standard Kubernetes installation, follow these steps:

  1. Install KServe.

  2. Optional: Create NIM Cache.

  3. Deploy NIM through KServe.

1. Install KServe in Raw Deployment Mode#

Note

For security, consider downloading and reviewing the script before execution in production environments.

Run the following command to execute the KServe quick install script:

$ curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.15/hack/quick_install.sh" | bash -s - -r 

For more information, refer to Getting Started with KServe.

The following components are deployed by the KServe quick install script:

Component

Description

KServe

  • Installs KServe CRDs, such as InferenceService.

  • Installs KServe controller in the kserve namespace.

  • Provides model serving, autoscaling, and inference management on Kubernetes.

Gateway API CRDs

  • Installs the Gateway API CustomResourceDefinitions: GatewayClass, Gateway, GRPCRoute, HTTPRoute, ReferenceGrant.

  • Provides modern networking primitives for routing traffic into services.

Istio (Service Mesh)

Deployed into the istio-system namespace with Helm charts:

  • istio-base: core Istio CRDs and cluster-scoped resources.

  • istiod: Istio control plane (pilot, configuration, service discovery).

  • istio-ingressgateway: data plane ingress gateway for external traffic.

Cert-Manager

  • Installed in the cert-manager namespace.

  • Handles TLS certificate provisioning and management.

  • Required for automatic HTTPS and secure communication.

Verify the installation by checking each component:

  1. KServe

    $ kubectl get pods -n kserve 
    Example output
    NAME READY STATUS RESTARTS AGE kserve-controller-manager-85768d7b78-lrmss 2/2 Running 0 2m56s 
  2. Istio System

    $ kubectl get pods -n istio-system 
    Example output
    NAME READY STATUS RESTARTS AGE istio-ingressgateway-74949b4866-hk5nf 1/1 Running 0 3m50s istiod-77bbc7c8bb-qkwdt 1/1 Running 0 4m1s 
  3. Cert Manager

    $ kubectl get pods -n cert-manager 
    Example output
    NAME READY STATUS RESTARTS AGE cert-manager-74b7f6cbbc-7mrrw 1/1 Running 0 4m18s cert-manager-cainjector-58c9d76cb8-t8bxb 1/1 Running 0 4m18s cert-manager-webhook-5875b545cf-dhhfb 1/1 Running 0 4m18s 

Note

To uninstall KServe, follow the instructions in Uninstalling KServe.

2. Optional: Create NIM Cache#

Note

Refer to prerequisites for more information on using NIM Cache.

  1. Create a file, such as nimcache.yaml, with contents like the following sample manifest:

    apiVersion: apps.nvidia.com/v1alpha1 kind: NIMCache metadata:  name: meta-llama-3-2-1b-instruct  namespace: nim-service spec:  source:  ngc:  modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0  pullSecret: ngc-secret  authSecret: ngc-api-secret  model:  engine: "tensorrt"  tensorParallelism: "1"  storage:  pvc:  create: true  storageClass:  size: "50Gi"  volumeAccessMode: ReadWriteOnce 
  2. Apply the manifest:

    $ kubectl apply -n nim-service -f nimcache.yaml 
  1. Create a file, such as nimcache.yaml, with contents like the following sample manifest:

    apiVersion: apps.nvidia.com/v1alpha1 kind: NIMCache metadata:  name: nim-cache-multi-llm  namespace: nim-service spec:  source:  hf:  endpoint: "https://huggingface.co"  namespace: "nvidia"  authSecret: hf-api-secret  modelPuller: nvcr.io/nim/nvidia/llm-nim:1.12  pullSecret: ngc-secret  modelName: "Llama-3.1-Nemotron-Nano-8B-v1"  storage:  pvc:  create: true  storageClass: ''  size: "50Gi"  volumeAccessMode: ReadWriteOnce 
  2. Apply the manifest:

    $ kubectl apply -n nim-service -f nimcache.yaml 

3. Deploy NIM through KServe as a Raw Deployment#

  1. Create a file, such as nimservice.yaml, with contents like the following sample manifest:

    apiVersion: apps.nvidia.com/v1alpha1 kind: NIMService metadata:  name: meta-llama-3-2-1b-instruct  namespace: nim-service spec:  inferencePlatform: kserve  annotations:  serving.kserve.io/deploymentMode: 'RawDeployment'  scale:  enabled: true  hpa:  minReplicas: 1  maxReplicas: 3  metrics:  - type: "Resource"  resource:  name: "cpu"  target:  type: "Utilization"  averageUtilization: 80  image:  repository: nvcr.io/nim/meta/llama-3.2-1b-instruct  tag: "1.12.0"  pullPolicy: IfNotPresent  pullSecrets:  - ngc-secret  authSecret: ngc-api-secret  storage:  nimCache:  name: meta-llama-3-2-1b-instruct  replicas: 1  resources:  limits:  nvidia.com/gpu: 1  cpu: "12"  memory: 32Gi  requests:  nvidia.com/gpu: 1  cpu: "4"  memory: 6Gi  expose:  service:  type: ClusterIP  port: 8000 
  2. Apply the manifest for raw deployment:

    $ kubectl create -f nimservice.yaml -n nim-service 
  3. Verify that the inference service has been created:

    1. View inference service details:

      $ kubectl get inferenceservice -n nim-service -o yaml 
      Example output
       apiVersion: v1  items:  - apiVersion: serving.kserve.io/v1beta1  kind: InferenceService  metadata:  annotations:  nvidia.com/last-applied-hash: 757475558b  nvidia.com/parent-spec-hash: b978f49f7  openshift.io/required-scc: nonroot  serving.kserve.io/autoscalerClass: hpa  serving.kserve.io/deploymentMode: RawDeployment  serving.kserve.io/enable-metric-aggregation: "true"  serving.kserve.io/enable-prometheus-scraping: "true"  temp: temp-1  creationTimestamp: "2025-07-25T14:58:06Z"  finalizers:  - inferenceservice.finalizers  generation: 1  labels:  app.kubernetes.io/instance: meta-llama-3-2-1b-instruct  app.kubernetes.io/managed-by: k8s-nim-operator  app.kubernetes.io/name: meta-llama-3-2-1b-instruct  app.kubernetes.io/operator-version: ""  app.kubernetes.io/part-of: nim-service  networking.kserve.io/visibility: cluster-local  temp2: temp-2  name: meta-llama-3-2-1b-instruct  namespace: nim-service  ownerReferences:  - apiVersion: apps.nvidia.com/v1alpha1  blockOwnerDeletion: true  controller: true  kind: NIMService  name: meta-llama-3-2-1b-instruct  uid: aaa7e95a-81e4-404a-a1de-6dce898f9937  resourceVersion: "43983337"  uid: c92a7aa9-4954-4959-86b3-ad8aa6c39ca8  spec:  predictor:  annotations:  openshift.io/required-scc: nonroot  serving.kserve.io/deploymentMode: RawDeployment  temp: temp-1  containers:  - env:  - name: MY_ENV  value: my-value  - name: NIM_CACHE_PATH  value: /model-store  - name: NGC_API_KEY  valueFrom:  secretKeyRef:  key: NGC_API_KEY  name: ngc-api-secret  - name: OUTLINES_CACHE_DIR  value: /tmp/outlines  - name: NIM_SERVER_PORT  value: "8000"  - name: NIM_HTTP_API_PORT  value: "8000"  - name: NIM_JSONL_LOGGING  value: "1"  - name: NIM_LOG_LEVEL  value: INFO  image: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.8  imagePullPolicy: IfNotPresent  livenessProbe:  failureThreshold: 3  httpGet:  path: /v1/health/live  port: api  initialDelaySeconds: 15  periodSeconds: 10  successThreshold: 1  timeoutSeconds: 1  name: kserve-container  ports:  - containerPort: 8000  name: api  protocol: TCP  readinessProbe:  failureThreshold: 3  httpGet:  path: /v1/health/ready  port: api  initialDelaySeconds: 15  periodSeconds: 10  successThreshold: 1  timeoutSeconds: 1  resources:  limits:  cpu: "12"  memory: 32Gi  nvidia.com/gpu: "1"  requests:  cpu: "12"  memory: 32Gi  nvidia.com/gpu: "1"  startupProbe:  failureThreshold: 30  httpGet:  path: /v1/health/ready  port: api  initialDelaySeconds: 30  periodSeconds: 10  successThreshold: 1  timeoutSeconds: 1  volumeMounts:  - mountPath: /model-store  name: model-store  - mountPath: /dev/shm  name: dshm  deploymentStrategy:  rollingUpdate:  maxSurge: 0  maxUnavailable: 25%  type: RollingUpdate  imagePullSecrets:  - name: ngc-secret  labels:  app: meta-llama-3-2-1b-instruct  maxReplicas: 3  minReplicas: 1  scaleMetric: cpu  scaleMetricType: Utilization  scaleTarget: 80  securityContext:  fsGroup: 2500  runAsGroup: 2500  runAsUser: 2500  serviceAccountName: meta-llama-3-2-1b-instruct  volumes:  - emptyDir:  medium: Memory  name: dshm  - name: model-store  persistentVolumeClaim:  claimName: meta-llama-3-2-1b-instruct-pvc  status:  address:  url: http://meta-llama-3-2-1b-instruct-predictor.nim-service.svc.cluster.local  components:  predictor:  url: http://meta-llama-3-2-1b-instruct-predictor-nim-service.example.com  conditions:  - lastTransitionTime: "2025-07-25T14:58:06Z"  status: "True"  type: IngressReady  - lastTransitionTime: "2025-07-25T14:58:06Z"  status: "True"  type: PredictorReady  - lastTransitionTime: "2025-07-25T14:58:06Z"  status: "True"  type: Ready  deploymentMode: RawDeployment  modelStatus:  copies:  failedCopies: 0  totalCopies: 1  states:  activeModelState: Loaded  targetModelState: Loaded  transitionStatus: UpToDate  observedGeneration: 1  url: http://meta-llama-3-2-1b-instruct-nim-service.example.com  kind: List  metadata:  resourceVersion: "" 
    2. View inference service status:

      $ kubectl get inferenceservice -n nim-service 
      Example output
      NAME URL READY AGE meta-llama-3-2-1b-instruct http://meta-llama-3-2-1b-instruct-nim-service.example.com True 100s 
    3. View NIM Service status:

      $ kubectl get nimservice -n nim-service -o json | jq .items[0].status 
      Example output
      {  "conditions": [  {  "lastTransitionTime": "2025-07-25T15:02:09Z",  "message": "",  "reason": "Ready",  "status": "True",  "type": "Ready"  },  {  "lastTransitionTime": "2025-07-25T14:58:06Z",  "message": "",  "reason": "Ready",  "status": "False",  "type": "Failed"  }  ],  "model": {  "clusterEndpoint": "http://meta-llama-3-2-1b-instruct-predictor.nim-service.svc.cluster.local",  "externalEndpoint": "http://meta-llama-3-2-1b-instruct-nim-service.example.com",  "name": "meta/llama-3.2-1b-instruct"  },  "state": "Ready" } 
  4. Verify that the HPA has been created:

    $ kubectl get hpa -n nim-service meta-llama-3-2-1b-instruct-predictor -o yaml 
  1. Create a file, such as nimservice.yaml, with contents like the following sample manifest:

    apiVersion: apps.nvidia.com/v1alpha1 kind: NIMService metadata:  name: nim-service-multi-llm  namespace: nim-service spec:  inferencePlatform: kserve  annotations:  serving.kserve.io/deploymentMode: 'RawDeployment'  image:  repository: nvcr.io/nim/nvidia/llm-nim  tag: "1.12"  pullPolicy: IfNotPresent  pullSecrets:  - ngc-secret  authSecret: ngc-api-secret  storage:  nimCache:  name: nim-cache-multi-llm  profile: 'tensorrt_llm'  resources:  limits:  nvidia.com/gpu: 1  cpu: "12"  memory: 32Gi  requests:  nvidia.com/gpu: 1  cpu: "4"  memory: 6Gi  expose:  service:  type: ClusterIP  port: 8000 
  2. Apply the manifest for raw deployment:

    $ kubectl create -f nimservice.yaml -n nim-service 
  3. Verify that the inference service has been created:

    $ kubectl get inferenceservice -n nim-service nim-service-multi-llm -o yaml 

Raw Deployment Example on Red Hat OpenShift#

Summary#

For raw deployment of NIM through KServe using Red Hat OpenShift, follow these steps:

  1. Install KServe using OpenShift interface.

  2. Create NIM Cache using OpenShift.

  3. Deploy NIM through KServe using OpenShift.

1. Install KServe in Raw Deployment Mode Using OpenShift#

  1. Follow the instructions on the Red Hat website for installing the single-model serving platform to install the OpenShift AI Operator.

    Figure 1. OpenShift web console

    Figure 2. Interface to install OpenShift AI Operator

    Follow the steps for standard deployment (OpenShift’s term for raw deployment mode). Select the following settings:

    • Update channel: stable

    • Version: 2.22.1

    • Operator recommended namespace: redhat-ods-operator

    Note

    There is also an advanced mode, which is equivalent to serverless; however, the NIM Operator does not support it in this release.

  2. Create an instance of Data Science Cluster (DSC)

    Figure 3. Interface to create instance of DSC

    Use the following YAML to create the DSC:

    kind: DataScienceCluster metadata:  labels:  app.kubernetes.io/created-by: rhods-operator  app.kubernetes.io/instance: default-dsc  app.kubernetes.io/managed-by: kustomize  app.kubernetes.io/name: datasciencecluster  app.kubernetes.io/part-of: rhods-operator  name: default-dsc spec:  components:  codeflare:  managementState: Managed  kserve:  defaultDeploymentMode: RawDeployment  managementState: Managed  nim:  managementState: Managed  rawDeploymentServiceConfig: Headed  serving:  ingressGateway:  certificate:  type: OpenshiftDefaultIngress  managementState: Removed  name: knative-serving  modelregistry:  registriesNamespace: rhoai-model-registries  feastoperator: {}  trustyai: {}  ray: {}  kueue: {}  workbenches:  workbenchNamespace: rhods-notebooks  dashboard: {}  modelmeshserving: {}  llamastackoperator: {}  datasciencepipelines: {}  trainingoperator: {} 
  3. Create an instance of DSCInitialization (DSI):

    Figure 4. Interface to create instance of DSI

    Use the following YAML to create the DSI:

    apiVersion: dscinitialization.opendatahub.io/v1 kind: DSCInitialization metadata:  name: default-dsci spec:  applicationsNamespace: redhat-ods-applications  serviceMesh:  controlPlane:  metricsCollection: Istio  name: data-science-smcp  namespace: istio-system  managementState: Removed 
  4. Verify the KServe controller is running:

    Click on Workloads > Pods > redhat-ods-applications project.

2. Create NIM Cache Using OpenShift#

Note

Refer to prerequisites for more information on using NIM Cache.

  1. Create a file, such as nimcache.yaml, with contents like the following example:

    # NIM Cache for OpenShift apiVersion: apps.nvidia.com/v1alpha1 kind: NIMCache metadata:  name: meta-llama3-1b-instruct  namespace: nim-service spec:  source:  ngc:  authSecret: ngc-api-secret  model:  engine: tensorrt_llm  tensorParallelism: '1'  modelPuller: 'nvcr.io/nim/meta/llama-3.2-1b-instruct:1.8.3'  pullSecret: ngc-secret  storage:  pvc:  create: true  size: 50Gi  volumeAccessMode: ReadWriteOnce 
  2. Apply the manifest:

    $ oc apply -n nim-service -f nimcache.yaml 

3. Deploy NIM Through KServe as a Raw Deployment Using OpenShift#

  1. Create a file, such as nimservice.yaml, with contents like the following example:

    # NIM Service Raw Deployment Using OpenShift apiVersion: apps.nvidia.com/v1alpha1 kind: NIMService metadata:  name: meta-llama-3-2-1b-instruct  namespace: nim-service spec:  annotations:  serving.kserve.io/deploymentMode: RawDeployment  expose:  service:  port: 8000  type: ClusterIP  scale:  enabled: true  hpa:  maxReplicas: 3  metrics:  - resource:  name: cpu  target:  averageUtilization: 80  type: Utilization  type: Resource  minReplicas: 1  inferencePlatform: kserve  authSecret: ngc-api-secret  image:  pullPolicy: IfNotPresent  pullSecrets:  - ngc-secret  repository: nvcr.io/nim/meta/llama-3.2-1b-instruct  tag: 1.8.3  storage:  nimCache:  name: meta-llama3-1b-instruct  resources:  limits:  nvidia.com/gpu: 1  cpu: "12"  memory: 32Gi  requests:  nvidia.com/gpu: 1  cpu: "4"  memory: 6Gi  replicas: 1 
  2. Apply the manifest for raw deployment:

    $ oc create -f nimservice.yaml -n nim-service 
  3. Verify that the inference service has been created:

    1. View inference service details:

      $ oc get inferenceservice -n nim-service -o json | jq .items[0].status 
      Example output
      {  "address": {  "url": "http://meta-llama-3-2-1b-instruct-predictor.nim-service.svc.cluster.local"  },  "components": {  "predictor": {}  },  "conditions": [  {  "lastTransitionTime": "2025-08-15T05:59:00Z",  "status": "True",  "type": "IngressReady"  },  {  "lastTransitionTime": "2025-08-15T05:59:00Z",  "status": "True",  "type": "PredictorReady"  },  {  "lastTransitionTime": "2025-08-15T05:59:00Z",  "status": "True",  "type": "Ready"  },  {  "lastTransitionTime": "2025-08-15T05:58:59Z",  "severity": "Info",  "status": "False",  "type": "Stopped"  }  ],  "deploymentMode": "RawDeployment",  "modelStatus": {  "copies": {  "failedCopies": 0,  "totalCopies": 1  },  "states": {  "activeModelState": "Loaded",  "targetModelState": "Loaded"  },  "transitionStatus": "UpToDate"  },  "observedGeneration": 1,  "url": "http://meta-llama-3-2-1b-instruct-predictor.nim-service.svc.cluster.local" } 
    2. View NIM Service status:

      $ oc get nimservice -n nim-service meta-llama-3-2-1b-instruct -o json | jq .status 
      Example output
      {  "conditions": [  {  "lastTransitionTime": "2025-08-15T05:59:50Z",  "message": "",  "reason": "Ready",  "status": "True",  "type": "Ready"  },  {  "lastTransitionTime": "2025-08-15T05:58:59Z",  "message": "",  "reason": "Ready",  "status": "False",  "type": "Failed"  }  ],  "model": {  "clusterEndpoint": "http://meta-llama-3-2-1b-instruct-predictor.nim-service.svc.cluster.local",  "externalEndpoint": "http://meta-llama-3-2-1b-instruct-predictor.nim-service.svc.cluster.local",  "name": "meta/llama-3.2-1b-instruct"  },  "state": "Ready" } 
  4. Run inference:

    $ oc -n nim-service run curltest --rm -i --image=curlimages/curl --restart=Never -- \  curl -s http://meta-llama-3-2-1b-instruct-predictor.nim-service.svc.cluster.local/v1/chat/completions \  -H 'Content-Type: application/json' \  -d '{"model":"meta/llama-3.2-1b-instruct","messages":[{"role":"user","content":"Hello!"}]}' 
    Example output
    {"id":"chat-2d3c821536514598b931783033a7e7e7","object":"chat.completion","created":1755239534,"model":"meta/llama-3.2-1b-instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! How can I help you today?"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":12,"total_tokens":21,"completion_tokens":9},"prompt_logprobs":null} 

Serverless Deployment Example on Standard Kubernetes#

Summary#

For serverless deployment of NIM through KServe on a standard Kubernetes installation, follow these steps:

  1. Install KServe in Serverless Mode.

  2. Enable PVC support.

  3. Optional: Create NIM Cache with Serverless Deployment.

  4. Deploy NIM through KServe as a Serverless Deployment.

1. Install KServe in Serverless Mode#

Note

For security, consider downloading and reviewing the script before execution in production environments.

Run the following command to execute the KServe quick install script:

$ curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.15/hack/quick_install.sh" | bash 

For more information, refer to Getting Started with KServe.

The following components are deployed by the KServe quick install script:

Component

Description

KServe

  • Installs KServe CRDs, such as InferenceService.

  • Installs KServe controller in the kserve namespace.

  • Provides model serving, autoscaling, and inference management on Kubernetes.

Gateway API CRDs

  • Installs the Gateway API CustomResourceDefinitions: GatewayClass, Gateway, GRPCRoute, HTTPRoute, ReferenceGrant.

  • Provides modern networking primitives for routing traffic into services.

Istio (Service Mesh)

Deployed into the istio-system namespace with Helm charts:

  • istio-base: core Istio CRDs and cluster-scoped resources.

  • istiod: Istio control plane (pilot, configuration, service discovery).

  • istio-ingressgateway: data plane ingress gateway for external traffic.

Cert-Manager

  • Installed in the cert-manager namespace.

  • Handles TLS certificate provisioning and management.

  • Required for automatic HTTPS and secure communication.

Knative

  • Installs the Knative Operator in knative-serving namespace.

  • Deploys a KnativeServing custom resource.

  • Provides serverless deployment and scaling primitives used by KServe.

Note

To uninstall KServe, follow the instructions in Uninstalling KServe.

2. Enable PVC Support#

Run the following command to enable persistent volume support:

$ kubectl patch --namespace knative-serving configmap/config-features \  --type merge \  --patch '{"data":{"kubernetes.podspec-persistent-volume-claim": "enabled", "kubernetes.podspec-persistent-volume-write": "enabled"}}' 

These extensions enable persistent volume support:

  • kubernetes.podspec-persistent-volume-claim: Enables persistent volumes (PVs) with Knative Serving

  • kubernetes.podspec-persistent-volume-write: Provides write access to those PVs

3. Optional: Create NIM Cache With Serverless Deployment#

Note

Refer to prerequisites for more information on using NIM Cache.

  1. Create a file, such as nimcache.yaml, with contents like the following sample manifest:

    apiVersion: apps.nvidia.com/v1alpha1 kind: NIMCache metadata:  labels:  app.kubernetes.io/name: k8s-nim-operator  name: meta-llama-3-2-1b-instruct  namespace: nim-service spec:  source:  ngc:  modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0  pullSecret: ngc-secret  authSecret: ngc-api-secret  model:  engine: "tensorrt"  tensorParallelism: "1"  storage:  pvc:  create: true  size: "50Gi"  volumeAccessMode: ReadWriteOnce 
  2. Apply the manifest:

    $ kubectl apply -n nim-service -f nimcache.yaml 

4. Deploy NIM through KServe as a Serverless Deployment#

Note

Autoscaling and ingress are handled directly by KServe, not through NIMService configuration.

  1. Create a file, such as nimservice.yaml, with contents like the following sample manifest:

    apiVersion: apps.nvidia.com/v1alpha1 kind: NIMService metadata:  name: meta-llama-3-2-1b-instruct  namespace: nim-service spec:  inferencePlatform: kserve  annotations:  # Knative concurrency-based autoscaling (default).  autoscaling.knative.dev/class: kpa.autoscaling.knative.dev  autoscaling.knative.dev/metric: concurrency  # Target 10 requests in-flight per pod.  autoscaling.knative.dev/target: "10"  # Disable scale to zero with a min scale of 1.  autoscaling.knative.dev/min-scale: "1"  # Limit scaling to 100 pods.  autoscaling.knative.dev/max-scale: "10"  image:  repository: nvcr.io/nim/meta/llama-3.2-1b-instruct  tag: "1.12.0"  pullPolicy: IfNotPresent  pullSecrets:  - ngc-secret  authSecret: ngc-api-secret  storage:  nimCache:  name: meta-llama-3-2-1b-instruct  profile: ''  resources:  limits:  nvidia.com/gpu: 1  cpu: "12"  memory: 32Gi  requests:  nvidia.com/gpu: 1  cpu: "4"  memory: 6Gi  replicas: 1  expose:  service:  type: ClusterIP  port: 8000 
  2. Apply the manifest for serverless deployment:

    $ kubectl create -f nimservice.yaml -n nim-service 
  3. Verify that the inference service has been created:

    $ kubectl get inferenceservice -n nim-service meta-llama-3-2-1b-instruct-serverless -o yaml 

Uninstalling KServe#

If you installed KServe using the https://raw.githubusercontent.com/kserve/kserve/release-0.15/hack/quick_install.sh quick install script, use the following commands to uninstall it:

helm uninstall --ignore-not-found istio-ingressgateway -n istio-system helm uninstall --ignore-not-found istiod -n istio-system helm uninstall --ignore-not-found istio-base -n istio-system echo "πŸ˜€ Successfully uninstalled Istio" helm uninstall --ignore-not-found cert-manager -n cert-manager echo "πŸ˜€ Successfully uninstalled Cert Manager" helm uninstall --ignore-not-found keda -n keda echo "πŸ˜€ Successfully uninstalled KEDA" kubectl delete --ignore-not-found=true KnativeServing knative-serving -n knative-serving --wait=True --timeout=300s || true helm uninstall --ignore-not-found knative-operator -n knative-serving echo "πŸ˜€ Successfully uninstalled Knative" helm uninstall --ignore-not-found kserve -n kserve helm uninstall --ignore-not-found kserve-crd -n kserve echo "πŸ˜€ Successfully uninstalled KServe" kubectl delete --ignore-not-found=true namespace istio-system kubectl delete --ignore-not-found=true namespace cert-manager kubectl delete --ignore-not-found=true namespace kserve