I have been using the container nvcr.io/nvidia/pytorch:21.07-py3 daily for the last few weeks without any issues whatsoever. Last night I accepted an ubuntu update dialog popup and the above container no longer loads. The error message is:
$ nvidia-docker run -p 8888:8888 --ipc=host --gpus all -it --rm -v /home/orca/:/home/t/ nvcr.io/nvidia/pytorch:21.07-py3 docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: signal: segmentation fault (core dumped), stdout: , stderr:: unknown. I get the same error message when trying docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Looking through my update logs (/var/log/apt/history.log), it seems the following update is what broke functionality:
Start-Date: 2021-08-16 19:49:10
Commandline: aptdaemon role='role-commit-packages' sender=':1.148'
Upgrade: libnvidia-container-tools:amd64 (1.4.0-1, 1.5.0~rc.1-1), openssl:amd64 (1.1.1-1ubuntu2.1~18.04.9, 1.1.1-1ubuntu2.1~18.04.10), libnvidia-container1:amd64 (1.4.0-1, 1.5.0~rc.1-1), gir1.2-snapd-1:amd64 (1.49-0ubuntu0.18.04.2, 1.58-0ubuntu0.18.04.0), libsnapd-glib1:amd64 (1.49-0ubuntu0.18.04.2, 1.58-0ubuntu0.18.04.0), libssl1.1:amd64 (1.1.1-1ubuntu2.1~18.04.9, 1.1.1-1ubuntu2.1~18.04.10), nvidia-container-toolkit:amd64 (1.5.1-1, 1.6.0~rc.1-1) End-Date: 2021-08-16 19:49:13
Here is the output of nvidia-smi:
nvidia-smi
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A | | 0% 51C P8 30W / 320W | 1070MiB / 10014MiB | 17% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1191 G /usr/lib/xorg/Xorg 36MiB | | 0 N/A N/A 1312 G /usr/bin/gnome-shell 103MiB | | 0 N/A N/A 2130 G /usr/lib/xorg/Xorg 452MiB | | 0 N/A N/A 2299 G /usr/bin/gnome-shell 86MiB | | 0 N/A N/A 2679 G ...AAAAAAAAA= --shared-files 99MiB | | 0 N/A N/A 3470 G /usr/lib/firefox/firefox 249MiB | | 0 N/A N/A 3600 G /usr/lib/firefox/firefox 3MiB | | 0 N/A N/A 3761 G /usr/lib/firefox/firefox 3MiB | | 0 N/A N/A 9373 G /usr/lib/firefox/firefox 3MiB | | 0 N/A N/A 9515 G /usr/lib/firefox/firefox 3MiB | +-----------------------------------------------------------------------------+ Finally, nvidia-container-cli no longer seems to work:
$ nvidia-container-cli info Segmentation fault (core dumped) Thanks for any help in fixing this issue!