Managing NIM Services#

About NIM Services#

A NIM service is a Kubernetes custom resource, nimservices.apps.nvidia.com. The NIM Service spec remains the same across non-LLM NIM, LLM-Specific NIM, and Multi-LLM NIM, except the repository name as shown in the following examples. You create and delete NIM service resources to manage NVIDIA NIM microservices.

Refer to the sample manifest in the following tab that matches your scenario:

# NIMService for Non-LLM apiVersion: apps.nvidia.com/v1alpha1 kind: NIMService metadata:  name: nv-rerankqa-mistral-4b-v3 spec:  image:  repository: nvcr.io/nim/nvidia/nv-rerankqa-mistral-4b-v3  tag: 1.0.2  pullPolicy: IfNotPresent  pullSecrets:  - ngc-secret  authSecret: ngc-api-secret  storage:  nimCache:  name: nv-rerankqa-mistral-4b-v3  profile: ''  resources:  limits:  nvidia.com/gpu: 1  expose:  service:  type: ClusterIP  port: 8000 
# NIMService for LLM-specific apiVersion: apps.nvidia.com/v1alpha1 kind: NIMService metadata:  name: llama-3.1-8b-instruct spec:  source:  ngc:  modelPuller: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3  pullSecret: ngc-secret  authSecret: ngc-api-secret  model: #Include the model object to describe the LLM-specific model you want to pull from NGC  engine: tensorrt_llm  tensorParallelism: "1"  storage:  nimCache:  name: llama-3.1-8b-instruct  profile: ''  resources:  limits:  nvidia.com/gpu: 1  expose:  service:  type: ClusterIP  port: 8000 
# NIMService for Multi-LLM NGC NIMCache apiVersion: apps.nvidia.com/v1alpha1 kind: NIMService metadata:  name: ngc-nim-service-multi spec:  image:  repository: nvcr.io/nim/nvidia/llm-nim  tag: "1.11.0"  pullPolicy: IfNotPresent  pullSecrets:  - ngc-secret  authSecret: ngc-api-secret  storage:  nimCache:  name: ngc-nim-cache-multi  profile: tensorrt_llm  resources:  limits:  nvidia.com/gpu: 1  expose:  service:  type: ClusterIP  port: 8000 

Refer to the following table for information about the commonly modified fields:

Field

Description

Default Value

spec.annotations

Specifies to add the user-supplied annotations to the pod.

None

spec.authSecret (required)

Specifies the name of a generic secret that contains NGC_API_KEY. Learn more about image pull secrets.

None

spec.draResources

DRAResources is the list of DRA resource claims to be used for the NIMService deployment or leader worker set.

Refer to Dynamic Resource Allocation (DRA) Support for NIM for more information.

None

spec.draResources.claimCreationSpec

ClaimCreationSpec is the spec to auto-generate a DRA resource claim template.

None

spec.draResources.claimCreationSpec
.devices.attributeSelectors

Defines the criteria which must be satisfied by the device attributes of a device.

None

spec.draResources.claimCreationSpec
.devices.attributeSelectors.key

Specifies the name of the device attribute. This is either a qualified name or a simple name. If it is a simple name, then it is assumed to be prefixed with the DRA driver name.

For example, “gpu.nvidia.com/productName” is equivalent to “productName” if the driver name is “gpu.nvidia.com”. Otherwise they are treated as 2 different attributes.

None

spec.draResources.claimCreationSpec
.devices.attributeSelectors.op

Specifies the operator to use for comparing the device attribute value. Supported operators are:

  • Equal: The device attribute value must be equal to the value specified in the selector.

  • NotEqual: The device attribute value must not be equal to the value specified in the selector.

  • GreaterThan: The device attribute value must be greater than the value specified in the selector.

  • GreaterThanOrEqual: The device attribute value must be greater than or equal to the value specified in the selector.

  • LessThan: The device attribute value must be less than the value specified in the selector.

  • LessThanOrEqual: The device attribute value must be less than or equal to the value specified in the selector.

Equal

spec.draResources.claimCreationSpec
.devices.attributeSelectors.value

Specifies the value to compare against the device attribute.

None

spec.draResources.claimCreationSpec
.devices.capacitySelectors

Defines the criteria that must be satisfied by the device capacity of a device.

None

spec.draResources.claimCreationSpec
.devices.capacitySelectors.key

Specifies the name of the resource. This is either a qualified name or a simple name. If it is a simple name, then it is assumed to be prefixed with the DRA driver name.

For example, “gpu.nvidia.com/memory” is equivalent to “memory” if the driver name is “gpu.nvidia.com”. Otherwise they are treated as 2 different attributes.

None

spec.draResources.claimCreationSpec
.devices.capacitySelectors.op

Specifies the operator to use for comparing against the device capacity. Supported operators are:

  • Equal: The resource quantity value must be equal to the value specified in the selector.

  • NotEqual: The resource quantity value must not be equal to the value specified in the selector.

  • GreaterThan: The resource quantity value must be greater than the value specified in the selector.

  • GreaterThanOrEqual: The resource quantity value must be greater than or equal to the value specified in the selector.

  • LessThan: The resource quantity value must be less than the value specified in the selector.

  • LessThanOrEqual: The resource quantity value must be less than or equal to the value specified in the selector.

Equal

spec.draResources.claimCreationSpec
.devices.capacitySelectors.value

Specifies the resource quantity to compare against.

None

spec.draResources.claimCreationSpec
.devices.celExpressions

Specifies a list of CEL expressions that must be satisfied by the DRA device.

None

spec.draResources.claimCreationSpec
.devices.count

Specifies the number of devices to request.

1

spec.draResources.claimCreationSpec
.devices.deviceClassName

Specifies the DeviceClass to inherit configuration and selectors from.

gpu.nvidia.com

spec.draResources.claimCreationSpec
.devices.driverName

Specifies the name of the DRA driver providing the capacity information.

Must be a DNS subdomain.

gpu.nvidia.com

spec.draResources.claimCreationSpec
.devices.name

Specifies the name of the device request to use in the generated claim spec.

Must be a valid DNS_LABEL.

None

spec.draResources.requests

Requests is the list of requests in the referenced ResourceClaim/ResourceClaimTemplate to be made available to the model container of the NIMService pods.

If empty, everything from the claim is made available, otherwise only the result of this subset of requests.

None

spec.draResources.resourceClaimName

ResourceClaimName is the name of a ResourceClaim object in the same namespace as the NIMService.

Exactly one of ResourceClaimName and ResourceClaimTemplateName must be set.

None

spec.draResources.resourceClaimTemplateName

ResourceClaimTemplateName is the name of a ResourceClaimTemplate object in the same namespace as the pods for this NIMService.

The template will be used to create a new ResourceClaim, which will be bound to the pods created for this NIMService.

Exactly one of ResourceClaimName and ResourceClaimTemplateName must be set.

None

spec.env

Specifies environment variables to set in the NIM microservice container.

None

spec.expose.ingress.enabled

When set to true, the Operator creates a Kubernetes Ingress resource for the NIM microservice. Specify the ingress specification in the spec.expose.ingress.spec field.

If you have an ingress controller, values like the following sample configures an ingress for the v1/chat/completions endpoint.

ingress:  enabled: true  spec:  ingressClassName: nginx  rules:  - host: demo.nvidia.example.com  http:  paths:  - backend:  service:  name: meta-llama3-8b-instruct  port:  number: 8000  path: /v1/chat/completions  pathType: Prefix 

false

spec.expose.service.grpcPort

Specifies the Triton Inference Server gRPC service port number. Only use this port for non-LLM NIM microservice running a Triton gRPC Inference Server.

None

spec.expose.service.metricsPort

Specifies the Triton Inference Server metrics port number for a non-LLM NIM microservice. Only use this port for non-LLM NIM running a separate Triton Inference Server metrics endpoint.

None

spec.expose.service.port (required)

Specifies the network port number for the NIM microservice. The frequently used value is 8000.

None

spec.expose.service.type

Specifies the Kubernetes service type to create for the NIM microservice.

ClusterIP

spec.groupID

Specifies the group for the pods. This value is used to set the security context of the pod in the runAsGroup and fsGroup fields.

2000

spec.image

Specifies repository, tag, pull policy, and pull secret for the container image.

None

spec.labels

Specifies the user-supplied labels to add to the pod.

None

spec.metrics.enabled

When set to true, the Operator configures a Prometheus service monitor for the service. Specify the service monitor specification in the spec.metrics.serviceMonitor field.

false

spec.multiNode

NimServiceMultiNodeConfig defines the configuration for multi-node NIMService.

Refer to Multi-Node NIM Deployment for more information.

None

spec.multiNode.backendType

BackendType specifies the backend type for the multi-node NIMService. Currently only LWS is supported.

lws

spec.multiNode.gpusPerPod

GPUSPerPod specifies the number of GPUs for each instance. In most cases, this should match resources.limits.nvidia.com/gpu.

1

spec.multiNode.mpi

MPI config for NIMService using LeaderWorkerSet

None

spec.multiNode.size

Size specifies the number of pods to create for the multi-node NIMService.

1

spec.nodeSelector

Specifies node selector labels to schedule the service.

None

spec.proxy.certConfigMap

Specifies the name of the ConfigMap with CA certs for your proxy server.

None

spec.proxy.httpProxy

Specifies the address of a proxy server that should be used for outbound HTTP requests.

None

spec.proxy.httpsProxy

Specifies the address of a proxy server that should be used for outbound HTTPS requests

None

spec.proxy.noProxy

Specifies a comma-separated list of domain names, IP addresses, or IP ranges for which proxying should be bypassed.

None

spec.resources

Specifies the resource requirements for the pods.

None

spec.replicas

Specifies the desired number of pods in the replica set for the NIM microservice.

1

spec.runtimeClassName

Specifies the underlying container runtime class name to be used for running NIM with NVIDIA GPUs allocated. If not set, the default nvidia runtime class is assigned automatically. This runtime class is created by the NVIDIA GPU Operator.

None

spec.scale.enabled

When set to true, the Operator creates a Kubernetes horizontal pod autoscaler for the NIM microservice. Specify the HPA specification in the spec.scale.hpa field.

The spec.scale.hpa field supports the following subfields: minReplicas, maxReplicas, metrics, and behavior. These fields correspond to the same fields in a horizontal pod autoscaler resource specification.

false

spec.storage.nimCache

Specifies the name of the NIM cache that has the cached model profiles for the NIM microservice. Specify values for the name subfield and optionally, the profile subfield. This field has precedence over the spec.storage.pvc field. Refer to Displaying Cached Model Profiles for details on viewing available models profile names in a NIM cache. Supported profiles for Multi-LLM NIM are trtllm, sglang, and vllm.

None

spec.storage.pvc

If you did not create a NIM cache resource to download and cache your model, you can specify this field to download model profiles. This field has the following subfields: annotations, create, name, size, storageClass, volumeAccessMode, and subPath.

To have the Operator create a PVC for the model profiles, specify pvc.create: true. Refer to Example: Create a PVC Instead of Using a NIM Cache.

None

spec.storage.readOnly

When set to true, the Operator mounts the PVC from either the pvc or nimCache specification as read-only.

false

spec.storage.sharedMemorySizeLimit

Specifies the max size of the shared memory volume (emptyDir) used by NIM for fast model runtime read and write operations. If not specified, the NIM Operator will create an emptyDir with no limit.

None

spec.schedulerName

Specifies the custom scheduler to use for NIM deployments. If no scheduler is specified, then your configured default scheduler is used. This could be the Kubernetes default-scheduler, or a custom default scheduler, for example if you have configured Run:ai as the default scheduler for the NIM Operator namespace.

Kubernetes default scheduler.

spec.tolerations

Specifies the tolerations for the pods.

None

spec.userID

Specifies the user ID for the pod. This value is used to set the security context of the pod in the runAsUser fields.

1000

Prerequisites#

  • Optional: Added a NIM cache resource for the NIM microservice. If you created a NIM cache resource, specify the name in the spec.nimCache.name field.

    If you prefer to have the service download a model directly to storage, refer to Example: Create a PVC Instead of Using a NIM Cache for a sample manifest.

Procedure#

  1. Create a file, such as service.yaml, with contents like one of the following sample manifests:

    # NIMService for Non-LLM apiVersion: apps.nvidia.com/v1alpha1 kind: NIMService metadata:  name: nv-rerankqa-mistral-4b-v3 spec:  image:  repository: nvcr.io/nim/nvidia/nv-rerankqa-mistral-4b-v3  tag: 1.0.2  pullPolicy: IfNotPresent  pullSecrets:  - ngc-secret  authSecret: ngc-api-secret  storage:  nimCache:  name: nv-rerankqa-mistral-4b-v3  profile: ''  resources:  limits:  nvidia.com/gpu: 1  expose:  service:  type: ClusterIP  port: 8000 
    # NIMService for LLM-specific apiVersion: apps.nvidia.com/v1alpha1 kind: NIMService metadata:  name: llama-3.1-8b-instruct spec:  source:  ngc:  modelPuller: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3  pullSecret: ngc-secret  authSecret: ngc-api-secret  model: #Include the model object to describe the LLM-specific model you want to pull from NGC  engine: tensorrt_llm  tensorParallelism: "1"  storage:  nimCache:  name: llama-3.1-8b-instruct  profile: ''  resources:  limits:  nvidia.com/gpu: 1  expose:  service:  type: ClusterIP  port: 8000 
    # NIMService for Multi-LLM NGC NIMCache apiVersion: apps.nvidia.com/v1alpha1 kind: NIMService metadata:  name: ngc-nim-service-multi spec:  image:  repository: nvcr.io/nim/nvidia/llm-nim  tag: "1.11.0"  pullPolicy: IfNotPresent  pullSecrets:  - ngc-secret  authSecret: ngc-api-secret  storage:  nimCache:  name: ngc-nim-cache-multi  profile: tensorrt_llm  resources:  limits:  nvidia.com/gpu: 1  expose:  service:  type: ClusterIP  port: 8000 
  2. Apply the manifest:

    $ kubectl apply -n nim-service -f service.yaml 
  3. Optional: View information about the NIM services:

    $ kubectl describe nimservices.apps.nvidia.com -n nim-service 

    Partial Output

    ... Conditions:  Last Transition Time: 2024-08-12T19:09:43Z  Message: Deployment is ready  Reason: Ready  Status: True  Type: Ready  Last Transition Time: 2024-08-12T19:09:43Z  Message:  Reason: Ready  Status: False  Type: Failed State: Ready 

Verification#

  1. Start a pod that has access to the curl command. Substitute any pod that has the command and meets your organization’s security requirements:

    $ kubectl run --rm -it -n default curl --image=curlimages/curl:latest -- ash 

    After the pod starts, you are connected to the ash shell in the pod.

  2. Connect to the chat completions endpoint on the NIM for LLMs container.

    The command connects to the service in the nim-service namespace, meta-llama3-8b-instruct.nim-service. The command specifies the model to use, meta/llama3.1-8b-instruct. Replace these values if you use a different service name, namespace, or model. Find the model name by Displaying Model Status.

    curl -X "POST" \  'http://meta-llama3-8b-instruct.nim-service:8000/v1/chat/completions' \  -H 'Accept: application/json' \  -H 'Content-Type: application/json' \  -d '{  "model": "meta/llama-3.1-8b-instruct",  "messages": [  {  "content":"What should I do for a 4 day vacation at Cape Hatteras National Seashore?",  "role": "user"  }],  "top_p": 1,  "n": 1,  "max_tokens": 1024,  "stream": false,  "frequency_penalty": 0.0,  "stop": ["STOP"]  }' 
  3. Press Ctrl+D to exit and delete the pod.

Displaying Model Status#

  1. View the .status.model field of the custom resource.

    Replace meta-llama3-8b-instruct with the name of your NIM service.

    $ kubectl get nimservice.apps.nvidia.com -n nim-service \  meta-llama3-8b-instruct -o=jsonpath="{.status.model}" | jq . 

    Example Output

     {  "clusterEndpoint": "",  "externalEndpoint": "",  "name": "meta/llama-3.1-8b-instruct"  } 

Configuring Horizontal Pod Autoscaling#

Prerequisites#

  • Prometheus installed on your cluster. Refer to the Observability page for details on installing and configuring Prometheus for the NIM Operator.

Autoscaling NIM for LLMs#

NVIDIA NIM for LLMs provides several service metrics. Refer to Observability in the NVIDIA NIM for LLMs documentation for information about the metrics.

  1. Create a file, such as service-hpa.yaml, or update your NIMService manifest to include spec.metrics and your spec.scale configuration. The service-hpa.yaml uses the following example:

    apiVersion: apps.nvidia.com/v1alpha1 kind: NIMService metadata:  name: meta-llama3-8b-instruct spec:  image:  repository: nvcr.io/nim/meta/llama-3.1-8b-instruct  tag: 1.3.3  pullPolicy: IfNotPresent  pullSecrets:  - ngc-secret  authSecret: ngc-api-secret  storage:  nimCache:  name: meta-llama3-8b-instruct  profile: ''  replicas: 1  resources:  limits:  nvidia.com/gpu: 1  expose:  service:  type: ClusterIP  port: 8000  metrics:  enabled: true  serviceMonitor:  additionalLabels:  release: kube-prometheus-stack  scale:  enabled: true  hpa:  maxReplicas: 2  minReplicas: 1  metrics:  - type: Object  object:  metric:  name: gpu_cache_usage_perc  describedObject:  apiVersion: v1  kind: Service  name: meta-llama3-8b-instruct  target:  type: Value  value: "0.5" 
  2. Apply the manifest:

    $ kubectl apply -n nim-service -f service-hpa.yaml 
  3. Annotate the service resource related to NIM for LLMs:

    $ kubectl annotate -n nim-service svc meta-llama3-8b-instruct prometheus.io/scrape=true 

    Prometheus might require several minutes to begin collecting metrics from the service.

  4. Optional: Confirm Prometheus collects the metrics.

    • If you have access to the Prometheus dashboard, search for a service metric such as gpu_cache_usage_perc.

    • You can query Prometheus Adapter:

      $ kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/nim-service/services/*/gpu_cache_usage_perc" | jq . 

      Example Output

      {  "kind": "MetricValueList",  "apiVersion": "custom.metrics.k8s.io/v1beta1",  "metadata": {},  "items": [  {  "describedObject": {  "kind": "Service",  "namespace": "nim-service",  "name": "meta-llama3-8b-instruct",  "apiVersion": "/v1"  },  "metricName": "gpu_cache_usage_perc",  "timestamp": "2024-09-12T15:14:20Z",  "value": "0",  "selector": null  }  ] } 
  5. Optional: Confirm the horizontal pod autoscaler resource is created:

    $ kubectl get hpa -n nim-service 

    Example Output

    NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE meta-llama3-8b-instruct Deployment/meta-llama3-8b-instruct 0/500m 1 2 1 40s 

Sample Manifests#

Example: Create a PVC Instead of Using a NIM Cache#

As an alternative to creating NIM cache resources to download and cache NIM model profiles, you can specify that the Operator create a PVC and the NIM service downloads and runs a NIM model profile.

Create and apply a manifest like the following example:

apiVersion: apps.nvidia.com/v1alpha1 kind: NIMService metadata:  name: meta-llama3-8b-instruct spec:  image:  repository: nvcr.io/nim/meta/llama3-8b-instruct  tag: 1.0.3  pullPolicy: IfNotPresent  pullSecrets:  - ngc-secret  authSecret: ngc-api-secret  replicas: 1  storage:  pvc:  create: true  storageClass: <storage-class-name>  size: 10Gi  volumeAccessMode: ReadWriteMany  resources:  limits:  nvidia.com/gpu: 1  expose:  service:  type: ClusterIP  port: 8000 

Example: Air-Gapped Environment#

For air-gapped environments, you must download the model profiles for the NIM microservices from a host that has internet access. You must manually create a PVC and then transfer the model profile files into the PVC.

Typically, the Operator determines the PVC name by dereferencing it from the NIM cache resource. When there is no NIM cache resource, such as an air-gapped environment, you must specify the PVC name.

Create and apply a manifest like the following example:

apiVersion: apps.nvidia.com/v1alpha1 kind: NIMService metadata:  name: meta-llama3-8b-instruct spec:  image:  repository: nvcr.io/nim/meta/llama3-8b-instruct  tag: 1.0.3  pullPolicy: IfNotPresent  pullSecrets:  - ngc-secret  authSecret: ngc-api-secret  replicas: 1  storage:  pvc:  name: <existing-pvc-name>  readOnly: true  resources:  limits:  nvidia.com/gpu: 1  expose:  service:  type: ClusterIP  port: 8000 

Deleting a NIM Service#

To delete a NIM service perform the following steps.

  1. View the NIM services custom resources:

    $ kubectl get nimservices.apps.nvidia.com -A 

    Example Output

    NAMESPACE NAME STATUS AGE nim-service meta-llama3-8b-instruct Ready 2024-08-12T17:16:05Z 
  2. Delete the custom resource:

    $ kubectl delete nimservice -n nim-service meta-llama3-8b-instruct 

    If the Operator created the PVC when you created the NIM cache, the Operator deletes the PVC and the cached model profiles. You can determine if the Operator created the PVC by running a command like the following example:

    $ kubectl get nimcaches.apps.nvidia.com -n nim-service \  -o=jsonpath='{range .items[*]}{.metadata.name}: {.spec.storage.pvc.create}{"\n"}{end}' 

    Example Output

    meta-llama3-8b-instruct: true