Install Cuda 10.0 on Ubuntu16.04 (for DGX-1)

Hi All,

I am trying to install CUDA-10.0 on Ubuntu 16.04 running on DGX-1 server.
I followed the instructions for “runfile installation” in https://docs.nvidia.com/cuda/archive/10.0/cuda-installation-guide-linux/index.html#runfile.

After step 4.2.6 (i.e. Reboot the system to reload the graphical interface.), I checked the CUDA version as follows:

nvcc --version 

which returns:

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2018 NVIDIA Corporation Built on Sat_Aug_25_21:08:01_CDT_2018 Cuda compilation tools, release 10.0, V10.0.130 

However, when I run:

nvidia-smi 

it returns:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. 

I went to step 4.4 (Device Node Verification.), and found that the device files /dev/nvidia* don’t exist.
I tried to create them manually, however, running:

sudo /sbin/modprobe nvidia 

returns:

modprobe: ERROR: could not insert 'nvidia': Exec format error 

Please help to solve the problem. Thanks!


Other details.

lspci | grep -i nvidia 06:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1) 07:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1) 0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1) 0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1) 85:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1) 86:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1) 89:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1) 8a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1) 
uname -m && cat /etc/*release x86_64 DGX_NAME="DGX Server" DGX_PRETTY_NAME="NVIDIA DGX Server" DGX_SWBUILD_DATE="2018-03-20" DGX_SWBUILD_VERSION="3.1.6" DGX_COMMIT_ID="1b0f58ecbf989820ce745a9e4836e1de5eea6cfd" DGX_SERIAL_NUMBER=QTFCOU8280021 DISTRIB_ID=Ubuntu DISTRIB_RELEASE=16.04 DISTRIB_CODENAME=xenial DISTRIB_DESCRIPTION="Ubuntu 16.04.6 LTS" NAME="Ubuntu" VERSION="16.04.6 LTS (Xenial Xerus)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 16.04.6 LTS" VERSION_ID="16.04" HOME_URL="http://www.ubuntu.com/" SUPPORT_URL="http://help.ubuntu.com/" BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/" VERSION_CODENAME=xenial UBUNTU_CODENAME=xenial 
gcc --version gcc (GCC) 5.4.0 Copyright (C) 2015 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 
uname -r 4.4.0-142-generic 
cat /proc/version Linux version 4.4.0-142-generic (buildd@lgw01-amd64-033) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 
dpkg -l | grep nvidia ii dgx-peer-mem-loader 1.1-10 amd64 Ensure nvidia is loaded before nv_peer_mem 

DGX-1 software is mostly maintained and installed via package manager systems. You can use a runfile installer, but you’ll need to be aware of the conflicts that are inherent. These conflicts are documented in the CUDA linux install guide in the section “handle conflicting install methods”.

In short, CUDA 10 toolkit is installed, but your driver install is broken. You’ll need to clean up and remove all installation history, to rectify this.