- Notifications
You must be signed in to change notification settings - Fork 212
Closed
Labels
bugSomething isn't workingSomething isn't workinggpuGPU device plugin related issueGPU device plugin related issue
Description
Describe the bug
We have a K8s cluster with 2 nodes and 8 GPU Max 1100s each. We installed the GPU plugin v0.34.0 to access the GPU in the cluster. However, when we run this command in the pod python -c "import torch; print(torch.xpu.mem_get_info(torch.xpu.current_device())[1])" it fails with the following error
Traceback (most recent call last): File "<string>", line 1, in <module> File "/opt/vllm/lib64/python3.12/site-packages/torch/xpu/memory.py", line 194, in mem_get_info return torch._C._xpu_getMemoryInfo(device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: The device (Intel(R) Data Center GPU Max 1100) doesn't support querying the available free memory. You can file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize its implementation. However, the above command works in a container on the same node. I have attached the pod definition and container command below.
sudo nerdctl run -it \ --name xpu \ -e VLLM_LOGGING_LEVEL=DEBUG \ -e http_proxy=http://proxy-dmz.intel.com:912 \ -e https_proxy=http://proxy-dmz.intel.com:912 \ -e no_proxy="localhost,127.0.0.1,.maas,10.0.0.0/8,172.16.0.0/16,192.168.0.0/16,134.134.0.0/16,.maas-internal,.svc" \ --device /dev/dri \ --entrypoint="bash" \ --privileged \ ghcr.io/llm-d/llm-d-xpu:v0.3.0To Reproduce
apiVersion: v1 kind: Pod metadata: name: xpu namespace: default spec: containers: - name: xpu image: ghcr.io/llm-d/llm-d-xpu:v0.3.0 imagePullPolicy: Always command: - bash stdin: true tty: true env: - name: VLLM_LOGGING_LEVEL value: DEBUG - name: http_proxy value: http://proxy-dmz.intel.com:912 - name: https_proxy value: http://proxy-dmz.intel.com:912 - name: no_proxy value: "localhost,127.0.0.1,.maas,10.0.0.0/8,172.16.0.0/16,192.168.0.0/16,134.134.0.0/16,.maas-internal,.svc" resources: limits: gpu.intel.com/i915: "8" requests: gpu.intel.com/i915: "8" restartPolicy: NeverThis same pod works with privileged: true and removing the i915 request and limits.
Expected behavior
The GPU memory command should work if it works in containers.
System (please complete the following information):
- OS version: [e.g. Ubuntu 22.04]
- Kernel version: [e.g. Linux 5.15]
- Device plugins version: [e.g. v0.34.0]
- Hardware info: [e.g. Intel dGPU Max 1100]
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinggpuGPU device plugin related issueGPU device plugin related issue