- Notifications
You must be signed in to change notification settings - Fork 67
Add H3 #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add H3 #4
Changes from 8 commits
3359a01 19f768f 6a51c98 6714a92 9094b0e ce5b4a9 029f6e0 2d23f76 6e34a35 5e6c38c c2e25b4 9515940 14ca7f7 eb7e666 File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,52 @@ | ||||||
| # SC21 Tutorial: Efficient Distributed GPU Programming for Exascale | ||||||
| | ||||||
| - Time: Sunday, 14 November 2021 8AM - 5PM CST | ||||||
| - Location: *online* | ||||||
| - Program Link: https://sc21.supercomputing.org/presentation/?id=tut138&sess=sess188 | ||||||
| | ||||||
| | ||||||
| ## Hands-ON 5: Multi GPU parallelization with CUDA-aware MPI | ||||||
| | ||||||
| ## Task 1: Parallelize the jacobi solver for multiple GPUs using CUDA-aware MPI | ||||||
| Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggested change
| ||||||
| | ||||||
| #### Description | ||||||
| The purpose of this task is to use CUDA-aware MPI to parallelize a jacobi solver. The starting point of this task is a skeleton `jacobi.cu`, in which the CUDA kernel is already defined and also some basic-setup functions are already present. | ||||||
LenaO marked this conversation as resolved. Outdated Show resolved Hide resolved | ||||||
| There is also a single-GPU version with which the performance and numerical results are compared. | ||||||
| Take some time to get familiar with the code. Some functions (like NVTX) will be explained in the next tutorial. They can be ignored for now (e.g. the `PUSH` and `POP` macros). | ||||||
LenaO marked this conversation as resolved. Outdated Show resolved Hide resolved | ||||||
| Once you are familiar with the code, you need to work on `TODOs` in `jacobi.cu`: | ||||||
LenaO marked this conversation as resolved. Outdated Show resolved Hide resolved | ||||||
| | ||||||
| - Initialize the MPI application | ||||||
| - Include the MPI header file | ||||||
| - Determine the local rank and the number of MPI processes | ||||||
LenaO marked this conversation as resolved. Show resolved Hide resolved | ||||||
| - Query the number of GPUs visible to the calling process. | ||||||
| - Use a local communicator to assign one GPU to each MPI process | ||||||
| - Finalize MPI at the end of the application | ||||||
| - Compute the 1-D domain decomposition | ||||||
| - Compute the local chunk size to distribute (ny-2) lines among the process | ||||||
| - in case `(ny-2)%size != 0` the last process should calculate the remaining rows | ||||||
| - determine the global (`iy_start_global, iy_end_global`) and local (`iy_start, iy_end`) start and end points in the 2-dimensional grid. | ||||||
| - Use MPI to exchange the boundaries | ||||||
| - Compute the top and the bottom neighbor | ||||||
| - we are using reflecting/periodic boundaries on top and bottom, so rank0's Top neighbor is (size-1) and rank(size-1) bottom neighbor is rank 0 | ||||||
| - Use MPI_Sendrecv to exchange data between the neighbors | ||||||
| - use the self-defined MPI_REAL_TYPE. This allows an easy switch between single- and double precision | ||||||
| Comment on lines +18 to +32 Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that are too many complicated things which can go wrong… | ||||||
| | ||||||
| | ||||||
| Compile with | ||||||
| | ||||||
| ``` {.bash} | ||||||
| make | ||||||
| ``` | ||||||
| | ||||||
| Submit your compiled application to the batch system with | ||||||
| | ||||||
| ``` {.bash} | ||||||
| make run | ||||||
| ``` | ||||||
| | ||||||
| ## Task 1: Optimize load balancing | ||||||
| Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggested change
This is how it was meant, right? | ||||||
| | ||||||
| - The work distribution of the first task is not ideal, because it can lead to the process with the last rank having to calculate significantly more than all the others. Therefore, the load distribution is to be optimized in this task. | ||||||
| - Compute the `chunk_size` that each rank gets either (ny - 2) / size or (ny - 2) / size + 1 rows. | ||||||
| - Compute how many processes get (ny - 2) / size resp (ny - 2) / size + 1 rows | ||||||
| - Adapt the computation of (`iy_start_global`) | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| # Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved. | ||
| NP ?= 4 | ||
| NVCC=nvcc | ||
| JSC_SUBMIT_CMD ?= srun --gres=gpu:4 --ntasks-per-node 4 | ||
| CUDA_HOME ?= /usr/local/cuda | ||
| ifndef MPI_HOME | ||
| $(error MPI_HOME is not set) | ||
| endif | ||
| GENCODE_SM30 := -gencode arch=compute_30,code=sm_30 | ||
| GENCODE_SM35 := -gencode arch=compute_35,code=sm_35 | ||
| GENCODE_SM37 := -gencode arch=compute_37,code=sm_37 | ||
| GENCODE_SM50 := -gencode arch=compute_50,code=sm_50 | ||
| GENCODE_SM52 := -gencode arch=compute_52,code=sm_52 | ||
| GENCODE_SM60 := -gencode arch=compute_60,code=sm_60 | ||
| GENCODE_SM70 := -gencode arch=compute_70,code=sm_70 | ||
| GENCODE_SM80 := -gencode arch=compute_80,code=sm_80 -gencode arch=compute_80,code=compute_80 | ||
| GENCODE_FLAGS := $(GENCODE_SM70) $(GENCODE_SM80) | ||
| ifdef DISABLE_CUB | ||
| NVCC_FLAGS = -Xptxas --optimize-float-atomics | ||
| else | ||
| NVCC_FLAGS = -DHAVE_CUB | ||
| endif | ||
| NVCC_FLAGS += -dc -Xcompiler -fopenmp -lineinfo -DUSE_NVTX -lnvToolsExt $(GENCODE_FLAGS) -std=c++14 -I$(MPI_HOME)/include | ||
| NVCC_LDFLAGS = -ccbin=mpic++ -L$(NVSHMEM_HOME) -L$(MPI_HOME)/lib -lmpi -L$(CUDA_HOME)/lib64 -lcuda -lcudart -lnvToolsExt | ||
| jacobi: Makefile jacobi.cu | ||
| $(NVCC) $(NVCC_FLAGS) jacobi.cu -c -o jacobi.o | ||
| $(NVCC) $(GENCODE_FLAGS) jacobi.o -o jacobi $(NVCC_LDFLAGS) | ||
| | ||
| .PHONY.: clean | ||
| clean: | ||
| rm -f jacobi jacobi.o *.nsys-rep jacobi.*.compute-sanitizer.log | ||
| | ||
| sanitize: jacobi | ||
| $(JSC_SUBMIT_CMD) -n $(NP) compute-sanitizer --log-file jacobi.%q{SLURM_PROCID}.compute-sanitizer.log ./jacobi -niter 10 | ||
| | ||
| run: jacobi | ||
| $(JSC_SUBMIT_CMD) -n $(NP) ./jacobi | ||
| | ||
| profile: jacobi | ||
| $(JSC_SUBMIT_CMD) -n $(NP) nsys profile --trace=mpi,cuda,nvtx -o jacobi.%q{SLURM_PROCID} ./jacobi -niter 10 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| #!/usr/bin/make -f | ||
| # Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. | ||
| TASKDIR = ../../tasks/H3-Multi-GPU-parallelization | ||
| SOLUTIONDIR = ../../solutions/H3-Multi-GPU-parallelization | ||
| OPT_SOLUTIONDIR = ../../solutions/H3-Multi-GPU-parallelization_opt | ||
| | ||
| PROCESSFILES = jacobi.cu | ||
| COPYFILES = Makefile Instructions.ipynb Instructions.md | ||
| | ||
| | ||
| TASKPROCCESFILES = $(addprefix $(TASKDIR)/,$(PROCESSFILES)) | ||
| TASKCOPYFILES = $(addprefix $(TASKDIR)/,$(COPYFILES)) | ||
| SOLUTIONPROCCESFILES = $(addprefix $(SOLUTIONDIR)/,$(PROCESSFILES)) | ||
| OPT_SOLUTIONPROCCESFILES = $(addprefix $(OPT_SOLUTIONDIR)/,$(PROCESSFILES)) | ||
| SOLUTIONCOPYFILES = $(addprefix $(SOLUTIONDIR)/,$(COPYFILES)) | ||
| OPT_SOLUTIONCOPYFILES = $(addprefix $(OPT_SOLUTIONDIR)/,$(COPYFILES)) | ||
| | ||
| | ||
| .PHONY: all task | ||
| all: task | ||
| task: ${TASKPROCCESFILES} ${TASKCOPYFILES} ${SOLUTIONPROCCESFILES} ${SOLUTIONCOPYFILES} ${OPT_SOLUTIONPROCCESFILES} ${OPT_SOLUTIONCOPYFILES} | ||
| | ||
| | ||
| ${TASKPROCCESFILES}: $(PROCESSFILES) | ||
| mkdir -p $(TASKDIR)/ | ||
| cppp -USOLUTION -USOLUTION_OPT $(notdir $@) $@ | ||
| | ||
| ${SOLUTIONPROCCESFILES}: $(PROCESSFILES) | ||
| mkdir -p $(SOLUTIONDIR)/ | ||
| cppp -DSOLUTION -USOLUTION_OPT $(notdir $@) $@ | ||
| | ||
| ${OPT_SOLUTIONPROCCESFILES}: $(PROCESSFILES) | ||
| mkdir -p $(OPT_SOLUTIONDIR)/ | ||
| cppp -DSOLUTION -DSOLUTION_OPT $(notdir $@) $@ | ||
| | ||
| ${TASKCOPYFILES}: $(COPYFILES) | ||
| mkdir -p $(TASKDIR)/ | ||
| cp $(notdir $@) $@ | ||
| | ||
| ${SOLUTIONCOPYFILES}: $(COPYFILES) | ||
| mkdir -p $(SOLUTIONDIR)/ | ||
| cp $(notdir $@) $@ | ||
| | ||
| ${OPT_SOLUTIONCOPYFILES}: $(COPYFILES) | ||
| mkdir -p $(OPT_SOLUTIONDIR)/ | ||
| cp $(notdir $@) $@ | ||
| | ||
| %.ipynb: %.md | ||
| pandoc $< -o $@ | ||
| # add metadata so this is seen as python | ||
| jq -s '.[0] * .[1]' $@ ../template.json | sponge $@ |
Uh oh!
There was an error while loading. Please reload this page.