GPU Max 1100 doesn't support querying the available free memory

Describe the bug
We have a K8s cluster with 2 nodes and 8 GPU Max 1100s each. We installed the GPU plugin v0.34.0 to access the GPU in the cluster. However, when we run this command in the pod python -c "import torch; print(torch.xpu.mem_get_info(torch.xpu.current_device())[1])" it fails with the following error

Traceback (most recent call last): File "<string>", line 1, in <module> File "/opt/vllm/lib64/python3.12/site-packages/torch/xpu/memory.py", line 194, in mem_get_info return torch._C._xpu_getMemoryInfo(device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: The device (Intel(R) Data Center GPU Max 1100) doesn't support querying the available free memory. You can file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize its implementation.

However, the above command works in a container on the same node. I have attached the pod definition and container command below.

sudo nerdctl run -it \ --name xpu \ -e VLLM_LOGGING_LEVEL=DEBUG \ -e http_proxy=http://proxy-dmz.intel.com:912 \ -e https_proxy=http://proxy-dmz.intel.com:912 \ -e no_proxy="localhost,127.0.0.1,.maas,10.0.0.0/8,172.16.0.0/16,192.168.0.0/16,134.134.0.0/16,.maas-internal,.svc" \ --device /dev/dri \ --entrypoint="bash" \ --privileged \ ghcr.io/llm-d/llm-d-xpu:v0.3.0

To Reproduce

apiVersion: v1 kind: Pod metadata: name: xpu namespace: default spec: containers: - name: xpu image: ghcr.io/llm-d/llm-d-xpu:v0.3.0 imagePullPolicy: Always command: - bash stdin: true tty: true env: - name: VLLM_LOGGING_LEVEL value: DEBUG - name: http_proxy value: http://proxy-dmz.intel.com:912 - name: https_proxy value: http://proxy-dmz.intel.com:912 - name: no_proxy value: "localhost,127.0.0.1,.maas,10.0.0.0/8,172.16.0.0/16,192.168.0.0/16,134.134.0.0/16,.maas-internal,.svc" resources: limits: gpu.intel.com/i915: "8" requests: gpu.intel.com/i915: "8" restartPolicy: Never

This same pod works with privileged: true and removing the i915 request and limits.

Expected behavior
The GPU memory command should work if it works in containers.

System (please complete the following information):

OS version: [e.g. Ubuntu 22.04]
Kernel version: [e.g. Linux 5.15]
Device plugins version: [e.g. v0.34.0]
Hardware info: [e.g. Intel dGPU Max 1100]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU Max 1100 doesn't support querying the available free memory #2158

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU Max 1100 doesn't support querying the available free memory #2158

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions