ehfd October 11, 2025, 9:19am 1 opened 07:42AM - 18 Aug 25 UTC
### Summary We are experiencing a critical regression with NVENC hardware encod⦠ing when using NVIDIA driver version `570.x` in a multi-GPU Kubernetes environment. On a node with four identical GPUs, any containerized application managed by the GPU Operator can only use the NVENC encoder successfully if it is scheduled on the last enumerated GPU (e.g., GPU 3 of 4). Pods scheduled on any other GPU (0, 1, or 2) fail to initialize the encoder. This issue is a clear regression, as the entire setup works perfectly with the `550.x` driver series. Host-level encoding works on all cards, and we have confirmed there is **no** driver version mismatch between the host and the container. The problem appears to be specific to how the 570.x driver exposes NVENC capabilities to the containerized environment in a multi-GPU configuration. ### Environment Details * **Hardware:** * **CPU:** AMD Ryzen Threadripper 7970X (32-Cores) * **GPU:** 4 x NVIDIA GeForce RTX 4080 SUPER * **Motherboard:** ASUSTeK Pro WS TRX50-SAGE WIFI * **Software:** * **Orchestrator:** Kubernetes * **GPU Management:** NVIDIA GPU Operator * **Host Driver (Problematic):** `570.x` (e.g., 570.124.06) * **Host Driver (Working):** `550.x` series * **Container:** Using a container with correctly matched user-space libraries for the host driver. * **Application:** An Unreal Engine-based rendering service, and standard `ffmpeg`. ### Steps to Reproduce 1. Configure a Kubernetes node with multiple identical GPUs (e.g., 4x 4080 SUPER) and install NVIDIA host driver `570.x`. 2. Deploy the NVIDIA GPU Operator. 3. Deploy a Kubernetes `Deployment` that requests a single GPU (`spec.containers.resources.limits: nvidia.com/gpu: 1`). 4. Ensure pods from the Deployment are scheduled on different physical GPUs (e.g., GPU 0, GPU 1, etc.). 5. Inside a pod scheduled on any GPU *except the last one*, attempt to initialize an NVENC session using any application (`ffmpeg`, custom code, etc.). ### Expected Behavior The containerized application should be able to successfully initialize the NVENC hardware encoder and perform video encoding, regardless of which physical GPU (0, 1, 2, or 3) is assigned to the pod. ### Actual Behavior 1. **Consistent Failure on first N-1 GPUs:** NVENC initialization fails on pods assigned to GPU 0, GPU 1, and GPU 2. 2. **Consistent Success on the last GPU:** A pod that is scheduled on GPU 3 works perfectly and can encode video without issue. 3. **Application-Agnostic Failure:** The issue is not tied to our application. A standard `ffmpeg` command inside a failing pod reproduces the error perfectly: ```bash $ ffmpeg -f lavfi -i testsrc=size=1920x1080:rate=30 -t 10 -c:v h264_nvenc -f null - ... [h264_nvenc @ 0x55de29791c00] OpenEncodeSessionEx failed: unsupported device (2): (no details) [h264_nvenc @ 0x55de29791c00] No capable devices found Error initializing output stream 0:0 -- Error while opening encoder for output stream #0:0 ``` 4. Our Unreal Engine application logs corresponding errors: ``` LogAVCodecs: Error: Error Creating: Failed to create encoder [NVENC 2] LogPixelStreaming: Error: Could not create encoder. ``` ### Troubleshooting and Analysis * **This is a clear regression,** as downgrading the host driver to the `550.x` series resolves the issue completely on the exact same hardware and software stack. * The issue is **specific to the container environment.** Running `ffmpeg` with NVENC directly on the host OS works correctly for all 4 GPUs simultaneously. * The problem is tied to the **logical GPU index**, not a specific faulty card. Physically swapping the GPUs does not change the behavior; the failure always occurs on the first N-1 logical GPUs. * Based on this evidence, the behavior strongly suggests a bug in the `570.x` driver or a related component of the GPU Operator toolkit. The issue likely lies in the enumeration or initialization process for NVENC capabilities when exposing them to a container in a multi-GPU system. ### Workaround The only known workaround is to **downgrade the NVIDIA host driver to a version in the 550.x series.**
opened 01:07AM - 30 Jul 25 UTC
So, this is a weird and pretty specific problem. I am at a loss at what I can do⦠next because I am unsure if this is an issue with nvidia or ffmpeg (nvdec specifically). Issue has been observed in the frigate tensorrt images and ubuntu cuda images with ffmpeg 7. This is in relation to ffmpeg crashing (due to not finding a CUDA device) when using multiple nvidia gpus while trying to use any other index than '0'. I have tried to only expose devices using NVIDIA_VISIBLE_DEVICES env var and assigning using index or GPU-UUID. The weird part is that I can load ONNX models onto GPU 1, which kind of leads towards this may be something specific to ffmpeg. I am opening this issue to seek advice and to see if there are any other users having this issue. Error from ffmpeg output: ```logs 2025-07-29 16:57:53.773105842 [2025-07-29 16:57:53] ffmpeg.alley.detect ERROR : [AVHWDeviceContext @ 0x5dfde7cb58c0] cu->cuDeviceGet(&hwctx->internal->cuda_device, device_idx) failed -> CUDA_ERROR_INVALID_DEVICE: invalid device ordinal 2025-07-29 16:57:53.773621224 [2025-07-29 16:57:53] ffmpeg.alley.detect ERROR : Device creation failed: -542398533. 2025-07-29 16:57:53.774103329 [2025-07-29 16:57:53] ffmpeg.alley.detect ERROR : [vist#0:0/h264 @ 0x5dfde7c75f40] [dec:h264 @ 0x5dfde7d32440] No device available for decoder: device type cuda needed for codec h264. 2025-07-29 16:57:53.774607022 [2025-07-29 16:57:53] ffmpeg.alley.detect ERROR : [vist#0:0/h264 @ 0x5dfde7c75f40] [dec:h264 @ 0x5dfde7d32440] Hardware device setup failed for decoder: Generic error in an external library 2025-07-29 16:57:53.775085378 [2025-07-29 16:57:53] ffmpeg.alley.detect ERROR : [vost#0:0/rawvideo @ 0x5dfde7c80000] Error initializing a simple filtergraph 2025-07-29 16:57:53.775575817 [2025-07-29 16:57:53] ffmpeg.alley.detect ERROR : Error opening output file pipe:. 2025-07-29 16:57:53.776048525 [2025-07-29 16:57:53] ffmpeg.alley.detect ERROR : Error opening output files: Generic error in an external library ``` Edit: Other issues where this error occurs: - https://github.com/blakeblackshear/frigate/discussions/18018 - https://github.com/blakeblackshear/frigate/discussions/18722 I have also opened a ticket on the ffmpeg bug tracker: https://trac.ffmpeg.org/ticket/11694
opened 01:21PM - 06 Jun 25 UTC
lifecycle/stale
### π Describe the bug When deploying GPU-bound pods using the NVIDIA device plβ¦ ugin (`nvidia-device-plugin` Helm chart v0.17.1), **FFmpeg NVENC fails inside the container unless the assigned GPU is mounted at the path `/dev/nvidiaN` where `N` matches its `index` in `nvidia-smi`.** This issue occurs **only when using `deviceListStrategy: volume-mounts`**, which is required for secure GPU isolation in our multi-tenant environment. Using `envvar` is not an option, as users can override `NVIDIA_VISIBLE_DEVICES` in untrusted Docker images. As a result, **only pods where the assigned GPU's `nvidia-smi` index matches the container path `/dev/nvidiaN` succeed. All others fail with `unsupported device` errors in FFmpeg**. --- ### π οΈ Helm values ```yaml deviceIDStrategy: uuid deviceListStrategy: volume-mounts runtimeClassName: nvidia ``` --- ### π§ Root cause NVENC appears to rely on the assumption that: ``` /dev/nvidiaN <β> GPU with index N from `nvidia-smi` ``` If this alignment is broken (e.g. GPU with `index: 0` is mounted as `/dev/nvidia5`), the encoder fails: ``` [h264_nvenc @ 0x637317ea8e80] OpenEncodeSessionEx failed: unsupported device (2): (no details) [h264_nvenc @ 0x637317ea8e80] No capable devices found ``` This behavior is reproducible and consistent across all tested environments. --- ### π₯οΈ Host configuration * 6Γ NVIDIA RTX 4090 (UUID-assigned, known-good hardware) * Host `/dev/nvidia[0-5]` layout matches `nvidia-smi` output * `nvidia-smi`, CUDA, and NVENC work fine directly on host * Issue **only occurs inside container** when mount path/index diverge from `nvidia-smi` --- ### β
Working pod example * GPU UUID: `GPU-46b5dd79-...` * `nvidia-smi index`: `0` * Mounted as: `/dev/nvidia0` * β
`ffmpeg -c:v h264_nvenc` works --- ### β Failing pod example * GPU UUID: `GPU-dada647b-...` * `nvidia-smi index`: `0` * Mounted as: `/dev/nvidia5` * β `ffmpeg -c:v h264_nvenc` fails with: ```text [h264_nvenc @ 0x637317ea8e80] OpenEncodeSessionEx failed: unsupported device (2): (no details) [h264_nvenc @ 0x637317ea8e80] No capable devices found ``` --- ### π Additional observations * All expected character devices (`nvidia[0-9]`, `nvidiactl`, `uvm`, etc.) are present inside the pod. * The mounted `/dev/nvidiaX` files have correct major/minor numbers. * The issue **only depends on the alignment between `nvidia-smi index` and the mounted path**. * The `Device Minor:` in `/proc/driver/nvidia/gpus/.../information` **does not determine NVENC success**, only the mount path does. --- ### β
Expected behavior All GPUs assigned to a container should be fully usable via NVENC β regardless of physical or logical index β **as long as the device is properly mounted**. The device plugin should ensure that **`/dev/nvidiaN` always maps to the GPU with `nvidia-smi index N`**, or NVENC workloads will fail. --- ### π Environment * **Host OS:** Ubuntu 22.04 * **GPUs:** 6Γ NVIDIA RTX 4090 * **Container runtime:** containerd * **Kubernetes:** v1.32.x (K3s) * **NVIDIA Driver:** 570.133.20 (also tested with 575) * **NVIDIA device plugin:** v0.17.1 (Helm) * **nvidia-container-runtime:** 3.14.0-1 * **nvidia-container-toolkit:** 1.17.6-1 * **NVIDIA_DRIVER_CAPABILITIES:** `compute,video,utility,graphics,display` (set in the deployment image) * **FFmpeg:** NVENC-enabled build (confirmed working directly on host) --- ### π§ͺ Steps to reproduce 1. Deploy multiple pods with: ```yaml resources: limits: nvidia.com/gpu: 1 ``` 2. Inside each pod, run: ```bash nvidia-smi --query-gpu=gpu_uuid,index,name --format=csv,noheader ls -l /dev/nvidia[0-9] ffmpeg -hide_banner -f lavfi -i testsrc=duration=3:size=1280x720:rate=30 -c:v h264_nvenc -y /tmp/test.mp4 ``` 3. Observe: * If `/dev/nvidiaN` matches the `index: N` reported by `nvidia-smi`, encoding works. * If not, FFmpeg fails. --- ### π‘ Suggested improvement Ensure the device plugin **mounts GPU devices inside the pod at the `/dev/nvidiaN` path where `N` is the GPU's index reported by `nvidia-smi`**. This will restore NVENC compatibility and likely benefit other workloads that rely on this path/index alignment. --- ### π« Partial workaround None identified. Detecting the mismatch inside user space (via `nvidia-smi` + `ls -l /dev/nvidia*`) lets us fail fast, but does not resolve the root problem β NVENC will still fail to initialize.
opened 11:15AM - 16 Apr 25 UTC
bug
### NVIDIA Open GPU Kernel Modules Version 570.86.16 ### Please confirm this i⦠ssue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver. - [x] I confirm that this does not happen with the proprietary driver package. ### Operating System and Version Ubuntu 22.04.5 LTS ### Kernel Release 5.15.0-113-generic ### Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels. - [x] I am running on a stable kernel release. ### Hardware: GPU NVIDIA GeForce RTX 4090 ### Describe the bug After installing version 570.86.16 of the open GPU kernel modules, I encountered an error when using the NVENC functionality inside a container with Nvidia Container Runtime. The error indicates that the device is unsupported. However, when running the application directly on the host machine, the NVENC feature works correctly. I confirm that this does not happen with the proprietary driver package. ### To Reproduce `docker run --runtime=nvidia --gpus '"device=0,1"' jrottenberg/ffmpeg:4.1-nvidia -report -loglevel debug -f lavfi -i testsrc=duration=5:size=1280x720:rate=30 -c:v h264_nvenc -preset fast -y /tmp/test_output.mp4`  ### Bug Incidence Always ### nvidia-bug-report.log.gz [nvidia-bug-report.log.gz](https://github.com/user-attachments/files/19776094/nvidia-bug-report.log.gz) ### More Info Similar issues were #104 and #378, but the nvenc problem occurs again in the new version of the driver
The above are critical issues where NVENC and NVDEC work on only one GPU with Multi-GPU setups with NVIDIA Container Toolkit in driver versions >565, which is >=570.
This is in relation to NVENC crashing (due to not finding a CUDA device) when using multiple NVIDIA GPUs while trying to use any index other than β0β. Many efforts tried to only expose devices using NVIDIA_VISIBLE_DEVICES envvar and assigning them using index or GPU-UUID.
Only one GPU works (it may be the first GPU, last GPU, or anything in between), and everything else fails in FFmpeg:
[h264_nvenc @ 0x] OpenEncodeSessionEx failed: unsupported device (2): (no details) [h264_nvenc @ 0x] No capable devices found Moreover, GStreamer also fails in a similar way when FFmpeg fails:
nvh264encoder gstnvh264encoder.cpp:2158:gst_nv_h264_encoder_register_cuda:<cudacontext0> Failed to open session nvh265encoder gstnvh265encoder.cpp:2196:gst_nv_h265_encoder_register_cuda:<cudacontext0> Failed to open session nvenc gstnvenc.c:685:gst_nv_enc_register: NvEncOpenEncodeSessionEx failed: codec h264, device 0, error code 2 nvenc gstnvenc.c:685:gst_nv_enc_register: NvEncOpenEncodeSessionEx failed: codec h265, device 0, error code 2 The above is on driver version 580.82.07 with five NVIDIA Titan Xp GPUs.
Driver versions 565 or 550 work fine, but this is a regression of the driver version 570 or higher; therefore, I am bringing this up in the forum to the driver team.
This is widely known to happen in Kubernetes, but it may also happen in Docker.
CC @amrits @generix
1 Like
ktsong October 23, 2025, 6:03am 2 We are also closely monitoring this issue. Based on our test, ffmpeg NVENC functionality within K8S pods is working well on Tesla T4 nodes with multiple GPU cards. Since issue NVENC Fails in Kubernetes Pods on all but the last GPU with Driver 570.x or 580.x Β· Issue #1249 Β· NVIDIA/nvidia-container-toolkit Β· GitHub have indicated stable performance on V100 GPUs, and considering the current findings, we suspect there might be some driver-level issues affecting NVENC support for the GeForce seriesβparticularly models like the 3060, 4090, and 5090. Weβre continuing to look into this and will provide updates as we learn more.
It has been confirmed that driver version 565.57.01 does not have this issue, but both the 570 and 580 series are affected. What is the current status regarding this problem?