Skip to content

Commit e54d8a1

Browse files
authored
Merge pull request #1 from jirikraus/10-H_Device-initiated_Communication_with_NVSHMEM
Added Hands-On Task 10: Device-initiated Communication with NVSHMEM
2 parents 778447c + 7e6b738 commit e54d8a1

File tree

4 files changed

+746
-0
lines changed

4 files changed

+746
-0
lines changed
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# SC21 Tutorial: Efficient Distributed GPU Programming for Exascale
2+
3+
- Time: Sunday, 14 November 2021 8AM - 5PM CST
4+
- Location: *online*
5+
- Program Link: https://sc21.supercomputing.org/presentation/?id=tut138&sess=sess188
6+
7+
8+
## Hands-On 10: Device-initiated Communication with NVSHMEM
9+
10+
### Task 0: Using NVSHMEM device API
11+
12+
#### Description
13+
14+
The purpose of this task is to use the NVSHMEM device API instead of MPI to implement a multi-GPU jacobi solver. The starting point of this task is the MPI variant of the jacobi solver. You need to work on `TODOs` in `jacobi.cu`:
15+
16+
- Initialize NVSHMEM (same as in Hans-On 8-H):
17+
- Include NVSHMEM headers.
18+
- Initialize and shutdown NVSHMEM using `MPI_COMM_WORLD`.
19+
- Allocate work arrays `a` and `a_new` from the NVSHMEM symmetric heap. Take care of passing in a consistent size!
20+
- Calculate halo/boundary row index of top and bottom neighbors.
21+
- Add necessary inter PE synchronization.
22+
- Modify `jacobi_kernel`
23+
- Pass in halo/boundary row index of top and bottom neighbors.
24+
- Use `nvshmem_float_p` to directly push values needed by top and bottom neighbors from the kernel.
25+
- Remove no longer needed MPI communication.
26+
27+
Compile with
28+
29+
``` {.bash}
30+
make
31+
```
32+
33+
Submit your compiled application to the batch system with
34+
35+
``` {.bash}
36+
make run
37+
```
38+
39+
Study the performance by glimpsing at the profile generated with
40+
`make profile`. For `make run` and `make profile` the environment variable `NP` can be set to change the number of processes.
41+
42+
### Task 1: Use `nvshmemx_float_put_nbi_block`
43+
44+
#### Description
45+
46+
This is an optional Task to use `nvshmemx_float_put_nbi_block` instead of `nvshmem_float_p` for more efficient multi node execution. There are no TODOs prepared. Use the solution of Task 0 as a starting point. Some tips:
47+
48+
- You only need to change `jacobi_kernel`.
49+
- Switching to a 1-dimensional CUDA block can simplify the task.
50+
- The difficult part is calculating the right offsets and size for calling into `nvshmemx_float_put_nbi_block`.
51+
- If a CUDA blocks needs to communicate data with `nvshmemx_float_put_nbi_block` all threads in that block need to call into `nvshmemx_float_put_nbi_block`.
52+
- The [`nvshmem_opt`](https://github.com/NVIDIA/multi-gpu-programming-models/blob/master/nvshmem_opt/jacobi.cu#L154) variant in the [Multi GPU Programming Models Github repository](https://github.com/NVIDIA/multi-gpu-programming-models) implements the same strategy.
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
2+
NP ?= 4
3+
NVCC=nvcc
4+
JSC_SUBMIT_CMD ?= srun --gres=gpu:4 --ntasks-per-node 4
5+
CUDA_HOME ?= /usr/local/cuda
6+
ifndef NVSHMEM_HOME
7+
$(error NVSHMEM_HOME is not set)
8+
endif
9+
ifndef MPI_HOME
10+
$(error MPI_HOME is not set)
11+
endif
12+
GENCODE_SM30:= -gencode arch=compute_30,code=sm_30
13+
GENCODE_SM35:= -gencode arch=compute_35,code=sm_35
14+
GENCODE_SM37:= -gencode arch=compute_37,code=sm_37
15+
GENCODE_SM50:= -gencode arch=compute_50,code=sm_50
16+
GENCODE_SM52:= -gencode arch=compute_52,code=sm_52
17+
GENCODE_SM60 := -gencode arch=compute_60,code=sm_60
18+
GENCODE_SM70 := -gencode arch=compute_70,code=sm_70
19+
GENCODE_SM80 := -gencode arch=compute_80,code=sm_80 -gencode arch=compute_80,code=compute_80
20+
GENCODE_FLAGS:= $(GENCODE_SM70) $(GENCODE_SM80)
21+
ifdef DISABLE_CUB
22+
NVCC_FLAGS = -Xptxas --optimize-float-atomics
23+
else
24+
NVCC_FLAGS = -DHAVE_CUB
25+
endif
26+
NVCC_FLAGS += -dc -Xcompiler -fopenmp -lineinfo -DUSE_NVTX -lnvToolsExt $(GENCODE_FLAGS) -std=c++14 -I$(NVSHMEM_HOME)/include -I$(MPI_HOME)/include
27+
NVCC_LDFLAGS = -ccbin=mpic++ -L$(NVSHMEM_HOME)/lib -lnvshmem -L$(MPI_HOME)/lib -lmpi -L$(CUDA_HOME)/lib64 -lcuda -lcudart -lnvToolsExt
28+
jacobi: Makefile jacobi.cu
29+
$(NVCC) $(NVCC_FLAGS) jacobi.cu -c -o jacobi.o
30+
$(NVCC) $(GENCODE_FLAGS) jacobi.o -o jacobi $(NVCC_LDFLAGS)
31+
32+
.PHONY.: clean
33+
clean:
34+
rm -f jacobi jacobi.o *.nsys-rep jacobi.*.compute-sanitizer.log
35+
36+
sanitize: jacobi
37+
$(JSC_SUBMIT_CMD) -n $(NP) compute-sanitizer --log-file jacobi.%q{SLURM_PROCID}.compute-sanitizer.log ./jacobi -niter 10
38+
39+
run: jacobi
40+
$(JSC_SUBMIT_CMD) -n $(NP) ./jacobi
41+
42+
profile: jacobi
43+
$(JSC_SUBMIT_CMD) -n $(NP) nsys profile --trace=mpi,cuda,nvtx -o jacobi.%q{SLURM_PROCID} ./jacobi -niter 10
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
#!/usr/bin/make -f
2+
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
3+
TASKDIR = ../../tasks/10-H_Device-initiated_Communication_with_NVSHMEM
4+
SOLUTIONDIR = ../../solutions/10-H_Device-initiated_Communication_with_NVSHMEM
5+
6+
PROCESSFILES = jacobi.cu
7+
COPYFILES = Makefile Instructions.ipynb Instructions.md
8+
9+
10+
TASKPROCCESFILES = $(addprefix $(TASKDIR)/,$(PROCESSFILES))
11+
TASKCOPYFILES = $(addprefix $(TASKDIR)/,$(COPYFILES))
12+
SOLUTIONPROCCESFILES = $(addprefix $(SOLUTIONDIR)/,$(PROCESSFILES))
13+
SOLUTIONCOPYFILES = $(addprefix $(SOLUTIONDIR)/,$(COPYFILES))
14+
15+
.PHONY: all task
16+
all: task
17+
task: ${TASKPROCCESFILES} ${TASKCOPYFILES} ${SOLUTIONPROCCESFILES} ${SOLUTIONCOPYFILES}
18+
19+
20+
${TASKPROCCESFILES}: $(PROCESSFILES)
21+
mkdir -p $(TASKDIR)/
22+
cppp -USOLUTION $(notdir $@) $@
23+
24+
${SOLUTIONPROCCESFILES}: $(PROCESSFILES)
25+
mkdir -p $(SOLUTIONDIR)/
26+
cppp -DSOLUTION $(notdir $@) $@
27+
28+
29+
${TASKCOPYFILES}: $(COPYFILES)
30+
mkdir -p $(TASKDIR)/
31+
cp $(notdir $@) $@
32+
33+
${SOLUTIONCOPYFILES}: $(COPYFILES)
34+
mkdir -p $(SOLUTIONDIR)/
35+
cp $(notdir $@) $@
36+
37+
%.ipynb: %.md
38+
pandoc $< -o $@
39+
# add metadata so this is seen as python
40+
jq -s '.[0] * .[1]' $@ ../template.json | sponge $@

0 commit comments

Comments
 (0)