SLURM able to send CPU, but not GPU jobs

Question

I've built out a slurm "cluster" (just a job server and single compute server at the moment) and am trying to run jobs on it. I can run CPU jobs just fine, it sends them to the machine and runs them. However, when I try to run a GPU job, it never sends it to the machine. According to the slurmctld logs, it finds the node and considers it usable. It just never sends the job to the machine.

Here are the conf files and the logs:

slurm.conf

# slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ClusterName=cluster1 SlurmctldHost=jobserver #SlurmctldHost= # #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #EpilogSlurmctld= #FirstJobId=1 #MaxJobId=67043328 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=lua #KillOnBadExit=0 #LaunchType=launch/slurm #Licenses=foo*4,bar #MailProg=/bin/mail #MaxJobCount=10000 #MaxStepCount=40000 #MaxTasksPerNode=512 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/cgroup #Prolog= #PrologFlags= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= #RebootProgram= ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root #SrunEpilog= #SrunProlog= StateSaveLocation=/var/spool/slurmctld SwitchType=switch/none #TaskEpilog= TaskPlugin=task/affinity,task/cgroup #TaskProlog= #TopologyPlugin=topology/tree #TmpFS=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING #DefMemPerCPU=0 #MaxMemPerCPU=0 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory # # # JOB PRIORITY #PriorityFlags= #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 #AccountingStorageHost= #AccountingStoragePass= #AccountingStoragePort= AccountingStorageTRES=gres/gpu AccountingStorageType=accounting_storage/slurmdbd #AccountingStorageUser= #AccountingStoreFlags= #JobCompHost= #JobCompLoc= #JobCompParams= #JobCompPass= #JobCompPort= JobCompType=jobcomp/none #JobCompUser= #JobContainerType=job_container/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/cgroup SlurmctldDebug=debug5 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=debug5 SlurmdLogFile=/var/log/slurmd.log #SlurmSchedLogFile= #SlurmSchedLogLevel= #DebugFlags= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES GresTypes=gpu NodeName=cpu02 Gres=gpu:rtx_4000_sff:1 CPUs=112 Sockets=2 CoresPerSocket=28 ThreadsPerCore=2 RealMemory=772637 State=UNKNOWN # Partitions PartitionName=1gpu Nodes=cpu02 Default=YES MaxTime=INFINITE State=UP

(in production I will change debug5 to info)

cgroup.conf

### # # Slurm cgroup support configuration file # # See man slurm.conf and man cgroup.conf for further # information on cgroup configuration parameters #-- CgroupPlugin=cgroup/v1 CgroupAutomount=yes ConstrainCores=yes ConstrainDevices=yes ConstrainRAMSpace=yes

gres.conf (on compute node)

# GPU list NodeName=cpu02 Name=gpu Type=rtx_4000_sff File=/dev/nvidia0

sinfo -N -o "%N %G"

NODELIST GRES cpu02 gpu:1(S:1)

slurmctld.log

[2023-10-20T17:40:56.004] debug3: Writing job id 69 to header record of job_state file [2023-10-20T17:40:56.054] debug2: Processing RPC: REQUEST_RESOURCE_ALLOCATION from UID=0 [2023-10-20T17:40:56.054] debug3: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from uid=0 [2023-10-20T17:40:56.054] debug3: _set_hostname: Using auth hostname for alloc_node: jobserver [2023-10-20T17:40:56.054] debug3: JobDesc: user_id=0 JobId=N/A partition=(null) name=neofetch [2023-10-20T17:40:56.054] debug3: cpus=1-4294967294 pn_min_cpus=-1 core_spec=-1 [2023-10-20T17:40:56.054] debug3: Nodes=1-[1] Sock/Node=65534 Core/Sock=65534 Thread/Core=65534 [2023-10-20T17:40:56.054] debug3: pn_min_memory_job=18446744073709551615 pn_min_tmp_disk=-1 [2023-10-20T17:40:56.054] debug3: immediate=0 reservation=(null) [2023-10-20T17:40:56.054] debug3: features=(null) batch_features=(null) cluster_features=(null) prefer=(null) [2023-10-20T17:40:56.054] debug3: req_nodes=(null) exc_nodes=(null) [2023-10-20T17:40:56.054] debug3: time_limit=-1--1 priority=-1 contiguous=0 shared=-1 [2023-10-20T17:40:56.054] debug3: kill_on_node_fail=-1 script=(null) [2023-10-20T17:40:56.054] debug3: argv="neofetch" [2023-10-20T17:40:56.054] debug3: stdin=(null) stdout=(null) stderr=(null) [2023-10-20T17:40:56.054] debug3: work_dir=/var/log alloc_node:sid=jobserver:2638 [2023-10-20T17:40:56.054] debug3: power_flags= [2023-10-20T17:40:56.054] debug3: resp_host=127.0.0.1 alloc_resp_port=44165 other_port=35589 [2023-10-20T17:40:56.054] debug3: dependency=(null) account=(null) qos=(null) comment=(null) [2023-10-20T17:40:56.054] debug3: mail_type=0 mail_user=(null) nice=0 num_tasks=-1 open_mode=0 overcommit=-1 acctg_freq=(null) [2023-10-20T17:40:56.054] debug3: network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 licenses=(null) [2023-10-20T17:40:56.054] debug3: end_time= signal=0@0 wait_all_nodes=1 cpu_freq= [2023-10-20T17:40:56.054] debug3: ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1 ntasks_per_tres=-1 [2023-10-20T17:40:56.054] debug3: mem_bind=0:(null) plane_size:65534 [2023-10-20T17:40:56.054] debug3: array_inx=(null) [2023-10-20T17:40:56.054] debug3: burst_buffer=(null) [2023-10-20T17:40:56.054] debug3: mcs_label=(null) [2023-10-20T17:40:56.054] debug3: deadline=Unknown [2023-10-20T17:40:56.054] debug3: bitflags=0x1e000000 delay_boot=4294967294 [2023-10-20T17:40:56.054] debug3: TRES_per_job=gres:gpu:1 [2023-10-20T17:40:56.054] debug3: assoc_mgr_fill_in_user: found correct user: root(0) [2023-10-20T17:40:56.054] debug5: assoc_mgr_fill_in_assoc: looking for assoc of user=root(0), acct=root, cluster=cluster1, partition=1gpu [2023-10-20T17:40:56.054] debug3: assoc_mgr_fill_in_assoc: found correct association of user=root(0), acct=root, cluster=cluster1, partition=1gpu to assoc=2 acct=root [2023-10-20T17:40:56.054] debug3: found correct qos [2023-10-20T17:40:56.054] debug2: found 1 usable nodes from config containing cpu02 [2023-10-20T17:40:56.054] debug2: NodeSet for JobId=69 [2023-10-20T17:40:56.054] debug2: NodeSet[0] Nodes:cpu02 NodeWeight:1 Flags:0 FeatureBits:0 SchedWeight:511 [2023-10-20T17:40:56.054] debug3: _pick_best_nodes: JobId=69 idle_nodes 1 share_nodes 1 [2023-10-20T17:40:56.054] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69 [2023-10-20T17:40:56.054] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69 [2023-10-20T17:40:56.054] sched: _slurm_rpc_allocate_resources JobId=69 NodeList=(null) usec=253 [2023-10-20T17:40:58.518] debug: sched: Running job scheduler for default depth. [2023-10-20T17:40:58.518] debug2: found 1 usable nodes from config containing cpu02 [2023-10-20T17:40:58.518] debug2: NodeSet for JobId=69 [2023-10-20T17:40:58.518] debug2: NodeSet[0] Nodes:cpu02 NodeWeight:1 Flags:0 FeatureBits:0 SchedWeight:511 [2023-10-20T17:40:58.518] debug3: _pick_best_nodes: JobId=69 idle_nodes 1 share_nodes 1 [2023-10-20T17:40:58.518] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69 [2023-10-20T17:40:58.518] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69 [2023-10-20T17:40:58.518] debug3: sched: JobId=69. State=PENDING. Reason=Resources. Priority=4294901743. Partition=1gpu. [2023-10-20T17:40:59.518] debug: Spawning ping agent for cpu02 [2023-10-20T17:40:59.518] debug2: Spawning RPC agent for msg_type REQUEST_PING [2023-10-20T17:40:59.518] debug2: Tree head got back 0 looking for 1 [2023-10-20T17:40:59.518] debug3: Tree sending to cpu02 [2023-10-20T17:40:59.520] debug2: Tree head got back 1 [2023-10-20T17:40:59.523] debug2: node_did_resp cpu02 [2023-10-20T17:41:01.004] debug3: create_mmap_buf: loaded file `/var/spool/slurmctld/job_state` as buf_t [2023-10-20T17:41:01.004] debug3: Writing job id 70 to header record of job_state file [2023-10-20T17:41:02.496] debug: sched/backfill: _attempt_backfill: beginning [2023-10-20T17:41:02.496] debug: sched/backfill: _attempt_backfill: 1 jobs to backfill [2023-10-20T17:41:02.496] debug2: sched/backfill: _attempt_backfill: entering _try_sched for JobId=69. [2023-10-20T17:41:02.496] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69 [2023-10-20T17:41:02.496] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69 [2023-10-20T17:41:24.316] debug2: Processing RPC: REQUEST_JOB_INFO from UID=0 [2023-10-20T17:41:24.316] debug3: assoc_mgr_fill_in_user: found correct user: root(0) [2023-10-20T17:41:24.316] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=0 [2023-10-20T17:41:24.316] debug2: _slurm_rpc_dump_partitions, size=221 usec=3 [2023-10-20T17:41:25.532] debug2: Testing job time limits and checkpoints [2023-10-20T17:41:50.271] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0 [2023-10-20T17:41:50.271] debug3: validate_node_specs: validating nodes cpu02 in state: IDLE [2023-10-20T17:41:50.271] debug2: _slurm_rpc_node_registration complete for cpu02 usec=16 [2023-10-20T17:41:50.497] debug: sched/backfill: _attempt_backfill: beginning [2023-10-20T17:41:50.497] debug: sched/backfill: _attempt_backfill: 1 jobs to backfill [2023-10-20T17:41:50.497] debug2: sched/backfill: _attempt_backfill: entering _try_sched for JobId=69. [2023-10-20T17:41:50.497] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69 [2023-10-20T17:41:50.497] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69 [2023-10-20T17:41:55.547] debug2: Testing job time limits and checkpoints [2023-10-20T17:41:55.547] debug2: Performing purge of old job records [2023-10-20T17:41:55.547] debug: sched: Running job scheduler for full queue. [2023-10-20T17:41:55.547] debug2: found 1 usable nodes from config containing cpu02 [2023-10-20T17:41:55.547] debug2: NodeSet for JobId=69 [2023-10-20T17:41:55.547] debug2: NodeSet[0] Nodes:cpu02 NodeWeight:1 Flags:0 FeatureBits:0 SchedWeight:511 [2023-10-20T17:41:55.547] debug3: _pick_best_nodes: JobId=69 idle_nodes 1 share_nodes 1 [2023-10-20T17:41:55.547] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69 [2023-10-20T17:41:55.547] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69 [2023-10-20T17:41:55.547] debug3: sched: JobId=69. State=PENDING. Reason=Resources. Priority=4294901743. Partition=1gpu. [2023-10-20T17:41:56.570] debug2: Processing RPC: REQUEST_JOB_INFO from UID=0 [2023-10-20T17:41:56.570] debug3: assoc_mgr_fill_in_user: found correct user: root(0) [2023-10-20T17:41:56.571] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=0 [2023-10-20T17:41:56.571] debug2: _slurm_rpc_dump_partitions, size=221 usec=4 [2023-10-20T17:41:57.522] debug2: Processing RPC: REQUEST_JOB_INFO from UID=0 [2023-10-20T17:41:57.522] debug3: assoc_mgr_fill_in_user: found correct user: root(0) [2023-10-20T17:41:57.522] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=0 [2023-10-20T17:41:57.522] debug2: _slurm_rpc_dump_partitions, size=221 usec=4 [2023-10-20T17:42:15.822] debug2: Processing RPC: REQUEST_JOB_INFO from UID=0 [2023-10-20T17:42:15.822] debug3: assoc_mgr_fill_in_user: found correct user: root(0) [2023-10-20T17:42:15.823] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=0 [2023-10-20T17:42:15.823] debug2: _slurm_rpc_dump_partitions, size=221 usec=4 [2023-10-20T17:42:20.497] debug: sched/backfill: _attempt_backfill: beginning [2023-10-20T17:42:20.497] debug: sched/backfill: _attempt_backfill: 1 jobs to backfill [2023-10-20T17:42:20.497] debug2: sched/backfill: _attempt_backfill: entering _try_sched for JobId=69. [2023-10-20T17:42:20.497] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69 [2023-10-20T17:42:20.497] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69 [2023-10-20T17:42:25.563] debug2: Testing job time limits and checkpoints [2023-10-20T17:42:55.579] debug2: Testing job time limits and checkpoints [2023-10-20T17:42:55.579] debug2: Performing purge of old job records [2023-10-20T17:42:55.579] debug: sched: Running job scheduler for full queue. [2023-10-20T17:42:55.579] debug2: found 1 usable nodes from config containing cpu02 [2023-10-20T17:42:55.579] debug2: NodeSet for JobId=69 [2023-10-20T17:42:55.579] debug2: NodeSet[0] Nodes:cpu02 NodeWeight:1 Flags:0 FeatureBits:0 SchedWeight:511 [2023-10-20T17:42:55.579] debug3: _pick_best_nodes: JobId=69 idle_nodes 1 share_nodes 1 [2023-10-20T17:42:55.579] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69 [2023-10-20T17:42:55.579] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69 [2023-10-20T17:42:55.579] debug3: sched: JobId=69. State=PENDING. Reason=Resources. Priority=4294901743. Partition=1gpu. [2023-10-20T17:43:25.594] debug2: Testing job time limits and checkpoints

slurmd.log

[2023-10-20T17:41:50.258] debug3: Trying to load plugin /usr/local/lib/slurm/gres_gpu.so [2023-10-20T17:41:50.259] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Gres GPU plugin type:gres/gpu version:0x170206 [2023-10-20T17:41:50.259] debug: gres/gpu: init: loaded [2023-10-20T17:41:50.259] debug3: Success. [2023-10-20T17:41:50.259] debug3: _merge_gres2: From gres.conf, using gpu:rtx_4000_sff:1:/dev/nvidia0 [2023-10-20T17:41:50.259] debug3: Trying to load plugin /usr/local/lib/slurm/gpu_generic.so [2023-10-20T17:41:50.259] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:GPU Generic plugin type:gpu/generic version:0x170206 [2023-10-20T17:41:50.259] debug: gpu/generic: init: init: GPU Generic plugin loaded [2023-10-20T17:41:50.259] debug3: Success. [2023-10-20T17:41:50.259] Gres Name=gpu Type=rtx_4000_sff Count=1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT [2023-10-20T17:41:50.259] debug3: Trying to load plugin /usr/local/lib/slurm/topology_none.so [2023-10-20T17:41:50.259] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:topology NONE plugin type:topology/none version:0x170206 [2023-10-20T17:41:50.259] topology/none: init: topology NONE plugin loaded [2023-10-20T17:41:50.259] debug3: Success. [2023-10-20T17:41:50.259] debug3: Trying to load plugin /usr/local/lib/slurm/route_default.so [2023-10-20T17:41:50.259] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:route default plugin type:route/default version:0x170206 [2023-10-20T17:41:50.259] route/default: init: route default plugin loaded [2023-10-20T17:41:50.259] debug3: Success. [2023-10-20T17:41:50.259] debug2: Gathering cpu frequency information for 112 cpus [2023-10-20T17:41:50.262] debug: Resource spec: No specialized cores configured by default on this node [2023-10-20T17:41:50.262] debug: Resource spec: Reserved system memory limit not configured for this node [2023-10-20T17:41:50.262] debug3: Trying to load plugin /usr/local/lib/slurm/proctrack_cgroup.so [2023-10-20T17:41:50.262] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Process tracking via linux cgroup freezer subsystem type:proctrack/cgroup version:0x170206 [2023-10-20T17:41:50.263] debug3: cgroup/v1: xcgroup_create_slurm_cg: slurm cgroup /slurm successfully created for ns freezer [2023-10-20T17:41:50.263] debug3: Success. [2023-10-20T17:41:50.263] debug3: Trying to load plugin /usr/local/lib/slurm/task_cgroup.so [2023-10-20T17:41:50.263] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Tasks containment cgroup plugin type:task/cgroup version:0x170206 [2023-10-20T17:41:50.263] debug: task/cgroup: init: Tasks containment cgroup plugin loaded [2023-10-20T17:41:50.263] debug3: Success. [2023-10-20T17:41:50.263] debug3: Trying to load plugin /usr/local/lib/slurm/task_affinity.so [2023-10-20T17:41:50.263] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:task affinity plugin type:task/affinity version:0x170206 [2023-10-20T17:41:50.263] debug3: task/affinity: slurm_getaffinity: sched_getaffinity(0) = 0xffffffffffffffffffffffffffff [2023-10-20T17:41:50.263] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffffffffffffffff [2023-10-20T17:41:50.263] debug3: Success. [2023-10-20T17:41:50.263] debug: spank: opening plugin stack /etc/slurm/plugstack.conf [2023-10-20T17:41:50.263] debug3: Trying to load plugin /usr/local/lib/slurm/cred_munge.so [2023-10-20T17:41:50.263] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge credential signature plugin type:cred/munge version:0x170206 [2023-10-20T17:41:50.263] cred/munge: init: Munge credential signature plugin loaded [2023-10-20T17:41:50.263] debug3: Success. [2023-10-20T17:41:50.263] debug3: slurmd initialization successful [2023-10-20T17:41:50.265] slurmd version 23.02.6 started [2023-10-20T17:41:50.265] debug3: finished daemonize [2023-10-20T17:41:50.265] debug3: cred_unpack: job 66 ctime:1697822916 revoked:1697822916 expires:1697823036 [2023-10-20T17:41:50.265] debug3: not appending expired job 66 state [2023-10-20T17:41:50.265] debug3: destroying job 66 state [2023-10-20T17:41:50.265] debug3: Trying to load plugin /usr/local/lib/slurm/acct_gather_energy_none.so [2023-10-20T17:41:50.265] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:AcctGatherEnergy NONE plugin type:acct_gather_energy/none version:0x170206 [2023-10-20T17:41:50.265] debug: acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded [2023-10-20T17:41:50.265] debug3: Success. [2023-10-20T17:41:50.265] debug3: Trying to load plugin /usr/local/lib/slurm/acct_gather_profile_none.so [2023-10-20T17:41:50.265] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:AcctGatherProfile NONE plugin type:acct_gather_profile/none version:0x170206 [2023-10-20T17:41:50.265] debug: acct_gather_profile/none: init: AcctGatherProfile NONE plugin loaded [2023-10-20T17:41:50.265] debug3: Success. [2023-10-20T17:41:50.265] debug3: Trying to load plugin /usr/local/lib/slurm/acct_gather_interconnect_none.so [2023-10-20T17:41:50.266] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:AcctGatherInterconnect NONE plugin type:acct_gather_interconnect/none version:0x170206 [2023-10-20T17:41:50.266] debug: acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded [2023-10-20T17:41:50.266] debug3: Success. [2023-10-20T17:41:50.266] debug3: Trying to load plugin /usr/local/lib/slurm/acct_gather_filesystem_none.so [2023-10-20T17:41:50.266] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:AcctGatherFilesystem NONE plugin type:acct_gather_filesystem/none version:0x170206 [2023-10-20T17:41:50.266] debug: acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded [2023-10-20T17:41:50.266] debug3: Success. [2023-10-20T17:41:50.266] debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf) [2023-10-20T17:41:50.266] debug3: Trying to load plugin /usr/local/lib/slurm/jobacct_gather_cgroup.so [2023-10-20T17:41:50.266] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Job accounting gather cgroup plugin type:jobacct_gather/cgroup version:0x170206 [2023-10-20T17:41:50.266] debug3: cgroup/v1: xcgroup_create_slurm_cg: slurm cgroup /slurm successfully created for ns memory [2023-10-20T17:41:50.266] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'memory.use_hierarchy' set to '1' for '/sys/fs/cgroup/memory' [2023-10-20T17:41:50.267] debug3: cgroup/v1: xcgroup_create_slurm_cg: slurm cgroup /slurm successfully created for ns cpuacct [2023-10-20T17:41:50.267] debug: jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded [2023-10-20T17:41:50.267] debug3: Success. [2023-10-20T17:41:50.267] debug3: Trying to load plugin /usr/local/lib/slurm/job_container_none.so [2023-10-20T17:41:50.267] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:job_container none plugin type:job_container/none version:0x170206 [2023-10-20T17:41:50.267] debug: job_container/none: init: job_container none plugin loaded [2023-10-20T17:41:50.267] debug3: Success. [2023-10-20T17:41:50.267] debug3: Trying to load plugin /usr/local/lib/slurm/prep_script.so [2023-10-20T17:41:50.267] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Script PrEp plugin type:prep/script version:0x170206 [2023-10-20T17:41:50.267] debug3: Success. [2023-10-20T17:41:50.267] debug3: Trying to load plugin /usr/local/lib/slurm/core_spec_none.so [2023-10-20T17:41:50.267] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Null core specialization plugin type:core_spec/none version:0x170206 [2023-10-20T17:41:50.267] debug3: Success. [2023-10-20T17:41:50.267] debug3: Trying to load plugin /usr/local/lib/slurm/switch_none.so [2023-10-20T17:41:50.268] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:switch NONE plugin type:switch/none version:0x170206 [2023-10-20T17:41:50.268] debug: switch/none: init: switch NONE plugin loaded [2023-10-20T17:41:50.268] debug3: Success. [2023-10-20T17:41:50.268] debug3: Trying to load plugin /usr/local/lib/slurm/switch_cray_aries.so [2023-10-20T17:41:50.268] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:switch Cray/Aries plugin type:switch/cray_aries version:0x170206 [2023-10-20T17:41:50.268] debug: switch Cray/Aries plugin loaded. [2023-10-20T17:41:50.268] debug3: Success. [2023-10-20T17:41:50.268] debug: MPI: Loading all types [2023-10-20T17:41:50.268] debug3: Trying to load plugin /usr/local/lib/slurm/mpi_cray_shasta.so [2023-10-20T17:41:50.268] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:mpi Cray Shasta plugin type:mpi/cray_shasta version:0x170206 [2023-10-20T17:41:50.268] debug3: Success. [2023-10-20T17:41:50.268] debug3: Trying to load plugin /usr/local/lib/slurm/mpi_pmi2.so [2023-10-20T17:41:50.268] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:mpi PMI2 plugin type:mpi/pmi2 version:0x170206 [2023-10-20T17:41:50.268] debug3: Success. [2023-10-20T17:41:50.268] debug3: Trying to load plugin /usr/local/lib/slurm/mpi_none.so [2023-10-20T17:41:50.268] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:mpi none plugin type:mpi/none version:0x170206 [2023-10-20T17:41:50.268] debug3: Success. [2023-10-20T17:41:50.268] debug2: No mpi.conf file (/etc/slurm/mpi.conf) [2023-10-20T17:41:50.268] debug3: Successfully opened slurm listen port 6818 [2023-10-20T17:41:50.268] slurmd started on Fri, 20 Oct 2023 17:41:50 +0000 [2023-10-20T17:41:50.269] CPUs=112 Boards=1 Sockets=2 Cores=28 Threads=2 Memory=772637 TmpDisk=1855467 Uptime=77781 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2023-10-20T17:41:50.270] debug: _handle_node_reg_resp: slurmctld sent back 11 TRES. [2023-10-20T17:41:50.270] debug3: _registration_engine complete [2023-10-20T17:44:19.623] debug3: in the service_connection [2023-10-20T17:44:19.623] debug2: Start processing RPC: REQUEST_PING [2023-10-20T17:44:19.623] debug2: Processing RPC: REQUEST_PING [2023-10-20T17:44:19.624] debug2: Finish processing RPC: REQUEST_PING

No jobs are being run on cpu02, so the resources are available. The job server just doesn't seem to be communicating with the compute node when a GPU is requested.

Any help figuring this out would be appreciated.

UCBKurt · Accepted Answer · 2023-10-20 22:49:12Z

0

The fix is to remove the typecast of the GPU (Type=rtx_4000_sff).

edited Oct 20, 2023 at 22:49

answered Oct 20, 2023 at 20:02

UCBKurt

12 bronze badges

You mean removing Type=rtx_4000_sff from gres.conf file on the compute node, right?

U880D
– U880D

2023-10-20 20:54:16 +00:00
Commented Oct 20, 2023 at 20:54
@U880D Correct, Removing the Type= argument

UCBKurt
– UCBKurt

2023-10-20 22:48:55 +00:00
Commented Oct 20, 2023 at 22:48

Add a comment |

Stack Exchange Network

SLURM able to send CPU, but not GPU jobs

1 Answer 1

You must log in to answer this question.

Hot Network Questions

SLURM able to send CPU, but not GPU jobs

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions