Skip to content

GPU Max 1100 doesn't support querying the available free memory #2158

@sharvil10

Description

@sharvil10

Describe the bug
We have a K8s cluster with 2 nodes and 8 GPU Max 1100s each. We installed the GPU plugin v0.34.0 to access the GPU in the cluster. However, when we run this command in the pod python -c "import torch; print(torch.xpu.mem_get_info(torch.xpu.current_device())[1])" it fails with the following error

Traceback (most recent call last): File "<string>", line 1, in <module> File "/opt/vllm/lib64/python3.12/site-packages/torch/xpu/memory.py", line 194, in mem_get_info return torch._C._xpu_getMemoryInfo(device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: The device (Intel(R) Data Center GPU Max 1100) doesn't support querying the available free memory. You can file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize its implementation. 

However, the above command works in a container on the same node. I have attached the pod definition and container command below.

sudo nerdctl run -it \ --name xpu \ -e VLLM_LOGGING_LEVEL=DEBUG \ -e http_proxy=http://proxy-dmz.intel.com:912 \ -e https_proxy=http://proxy-dmz.intel.com:912 \ -e no_proxy="localhost,127.0.0.1,.maas,10.0.0.0/8,172.16.0.0/16,192.168.0.0/16,134.134.0.0/16,.maas-internal,.svc" \ --device /dev/dri \ --entrypoint="bash" \ --privileged \ ghcr.io/llm-d/llm-d-xpu:v0.3.0

To Reproduce

apiVersion: v1 kind: Pod metadata: name: xpu namespace: default spec: containers: - name: xpu image: ghcr.io/llm-d/llm-d-xpu:v0.3.0 imagePullPolicy: Always command: - bash stdin: true tty: true env: - name: VLLM_LOGGING_LEVEL value: DEBUG - name: http_proxy value: http://proxy-dmz.intel.com:912 - name: https_proxy value: http://proxy-dmz.intel.com:912 - name: no_proxy value: "localhost,127.0.0.1,.maas,10.0.0.0/8,172.16.0.0/16,192.168.0.0/16,134.134.0.0/16,.maas-internal,.svc" resources: limits: gpu.intel.com/i915: "8" requests: gpu.intel.com/i915: "8" restartPolicy: Never

This same pod works with privileged: true and removing the i915 request and limits.

Expected behavior
The GPU memory command should work if it works in containers.

System (please complete the following information):

  • OS version: [e.g. Ubuntu 22.04]
  • Kernel version: [e.g. Linux 5.15]
  • Device plugins version: [e.g. v0.34.0]
  • Hardware info: [e.g. Intel dGPU Max 1100]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggpuGPU device plugin related issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions