KAUST	Supercompu.ng	Laboratory Por.ng	an	MPI	applica.on	to	hybrid	MPI+OpenMP with	Reveal	tool	on	Shaheen	II George	Markomanolis Computa.onal	Scien.st June	23th,	2016
Outline KAUST King Abdullah University of Science and Technology 2 ❖  Introduction ❖  Test case ❖  Reveal
Introduc.on	-	Components	of	CrayPat KAUST King Abdullah University of Science and Technology 3 ❖  Module perftools-base •  pat_build – Instruments the program to be analyzed •  pat_report – Generates text reports from the performance data captured during program execution and exports data for use in other programs. •  Cray Apprentice2 – A graphical analysis tool that can be used to visualize and explore the performance data captured during program, execution •  Reveal – A graphical source code analysis tool that can be used to correlate performance analysis data with annotated source code listings, to identify key opportunities for optimization (it works only with Cray compiler)
Studying	case KAUST King Abdullah University of Science and Technology 4 ❖  Application from seismic group related to acoustic wave solver •  Why this application? A user asked for it •  MPI application •  Test on 3 nodes with totally 96 cores on Shaheen II
Prepare	for	the	tutorial KAUST King Abdullah University of Science and Technology 5 •  Connect to Shaheen II and copy the material: •  ssh –X username@shaheen.kaust.edu.sa •  cp /scratch/tmp/model_reveal.tgz . •  tar zxvf model_reveal.tgz •  cd model_reveal •  Reservation name: k1056_141
Reveal A tool to port your application to OpenMP programming model KAUST King Abdullah University of Science and Technology 6
Reveal KAUST King Abdullah University of Science and Technology 7 ❖  Reveal is Cray’s next-generation integrated performance analysis and code optimization tool. •  Source code navigation using whole program analysis (data provided by the Cray compilation environment only) •  Coupling with performance data collected during execution by CrayPAT. Understand which high level serial loops could benefit from parallelism. •  Enhanced loop mark listing functionality. •  Dependency information for targeted loops •  Assist users optimize code by providing variable scoping feedback and suggested compile directives.
Prepare	for	Reveal KAUST King Abdullah University of Science and Technology 8 ❖  Load Perftools •  module unload darshan •  module load perftools-base/6.3.2 •  module load perftools/6.3.2 ❖  Execute the MPI version •  cd model_reveal •  make clean •  make •  In the submit.sh file changed to your account number and submit the job §  sbatch submit.sh •  tail -n 10 testdata.XXX.err §  1m46.361s Reservation: k1056_141
Prepare	the	applica.on	for	Reveal KAUST King Abdullah University of Science and Technology 9 ❖  Compile the version for Reveal tool •  make clean –f Makefile_reveal •  In the Makefile_reveal file §  $(CC) -h profile_generate -hpl=data.pl -h noomp $< -o $@ $ (CFLAGS) §  ${CC} -h profile_generate -hpl=data.pl -h noomp -c $< CrayData.c §  Reveal needs the object of the files, so you need to modify the Makefile if needed •  make –f Makefile_reveal •  The folder data.pl is created in the folder •  Instrument your application §  pat_build –w CrayData.exe §  New executable is called CrayData.exe+pat, replace it to submit.sh
Submit	the	job	for	Reveal	tool KAUST King Abdullah University of Science and Technology 10 ❖  Submit your job script and do not forget the reservation name (--reservation=…) •  sbatch submit.sh ❖  A performance file (extension .xf) is created, if not something was wrong in the previous steps ❖  Generate the report and the ap2 file •  pat_report -o report.txt CrayData.exe+pat+58072-37t.xf ❖  Execute Reveal •  reveal data.pl CrayData.exe+pat+58072-37t.ap2
Reveal	–	Loop	Performance KAUST King Abdullah University of Science and Technology 11
Reveal	–	Scoping KAUST King Abdullah University of Science and Technology 12
Reveal	–	Program	view KAUST King Abdullah University of Science and Technology 13
Reveal	–	Func.on	View KAUST King Abdullah University of Science and Technology 14
Reveal	–	Array	View KAUST King Abdullah University of Science and Technology 15
Reveal	–	Compiler	Messages KAUST King Abdullah University of Science and Technology 16
Reveal	–	Loop	Performance KAUST King Abdullah University of Science and Technology 17
Reveal	–	Scoping	Tool KAUST King Abdullah University of Science and Technology 18
Reveal	–	Scoping	Results KAUST King Abdullah University of Science and Technology 19
Reveal	–	OpenMP	pragmas KAUST King Abdullah University of Science and Technology 20
Reveal	–	Inserted	OpenMP	pragmas KAUST King Abdullah University of Science and Technology 21
Clean	the	code	from	unresolved	issues	and observe	OpenMP	pragmas KAUST King Abdullah University of Science and Technology 22 ❖  vim CrayData.c ❖  Remove the lines with unresolved, only if you are sure. #pragma omp parallel for default(none) private (i1,i2,u) shared (nxpad,nzpad) #pragma omp parallel for default(none) private (ix,ib,ibz) shared (nxpad,nb,nzpad,bndr,p0) lastprivate (w)
Check	an	OpenMP	pragma	and	its	valida.on KAUST King Abdullah University of Science and Technology 23 #pragma omp parallel for default(none) private (ix,ib,ibz) shared (nxpad,nb,nzpad,bndr,p0) lastprivate (w) for(ix=0; ix<nxpad; ix++) { for(ib=0; ib<nb; ib++) { w = bndr[nb-ib-1]; ibz = nzpad-ib-1; p0[ix][ib ] *= w; /* top sponge */ p0[ix][ibz] *= w; /* bottom sponge */ } } for(ib=0; ib<nb; ib++) { ibx = nxpad-ib-1; for(iz=0; iz<nzpad; iz++) { p0[ib ][iz] *= w; /* left sponge */ p0[ibx][iz] *= w; /* right sponge */ } }
Clean	the	code	from	unresolved	issues, compile	and	run KAUST King Abdullah University of Science and Technology 24 ❖  vim CrayData.c ❖  Remove the lines with unresolved if you are sure. ❖  Compile your application with MPI and OpenMP •  make –f Makefile_omp •  The new executable is called CrayData_omp.exe •  Comment the active srun line in the submit.sh and uncomment the next srun call. •  Uncomment also the line with OMP_NUM_THREADS=2 •  Now, we will execute the application with 48 MPI processes (ntasks) and 2 threads per MPI process (cpus-per-task) •  srun --ntasks=48 --ntasks-per-node=16 --ntasks-per-socket=8 -- hint=nomultithread --cpus-per-task=2 ./CrayData_omp.exe
Different	cases	and	results KAUST King Abdullah University of Science and Technology 25 ❖  Results for 2 threads •  Change according: §  export OMP_NUM_THREADS=2 §  srun –ntasks=48 --ntasks-per-node=16 --ntasks-per- socket=8 --hint=nomultithread --cpus-per-task=2 ./ CrayData_omp.exe •  51.211s (2.86X) ❖  Results 4 threads •  Change according: §  export OMP_NUM_THREADS=4 §  srun --ntasks=24 --ntasks-per-node=8 --ntasks-per-socket=4 --hint=nomultithread --cpus-per-task=4 ./CrayData_omp.exe •  24.815s (5.9X)
Different	cases	and	results KAUST King Abdullah University of Science and Technology 26 ❖  Results 8 threads •  12.222s (11.98X) ❖  Results 16 threads •  Change according: §  export OMP_NUM_THREADS=16 §  srun --ntasks=6 --ntasks-per-node=2 --ntasks-per-socket=1 -- hint=nomultithread --cpus-per-task=16 ./CrayData_omp.exe •  8.895s (16.45X)
The	original	version	was	improved	19.19 .mes KAUST King Abdullah University of Science and Technology 27 170.67 106.36 8.895 0 20 40 60 80 100 120 140 160 180 Original	version Op.mized	MPI version MPI+OpenMP Time	(in	sec.) Execu.on	.me
Valida.on KAUST King Abdullah University of Science and Technology 28 Original version Optimized MPI+OpenMP
Summary KAUST King Abdullah University of Science and Technology 29 ❖  Reveal is an easy to use tool ❖  The user should be careful though, give notice to compiler messages ❖  You can have great speedup with this tool ❖  We need to investigate more complicated applications
KAUST Supercomputing Laboratory KAUST King Abdullah University of Science and Technology 30

Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II

  • 1.
  • 2.
    Outline KAUST King AbdullahUniversity of Science and Technology 2 ❖  Introduction ❖  Test case ❖  Reveal
  • 3.
    Introduc.on - Components of CrayPat KAUST King AbdullahUniversity of Science and Technology 3 ❖  Module perftools-base •  pat_build – Instruments the program to be analyzed •  pat_report – Generates text reports from the performance data captured during program execution and exports data for use in other programs. •  Cray Apprentice2 – A graphical analysis tool that can be used to visualize and explore the performance data captured during program, execution •  Reveal – A graphical source code analysis tool that can be used to correlate performance analysis data with annotated source code listings, to identify key opportunities for optimization (it works only with Cray compiler)
  • 4.
    Studying case KAUST King AbdullahUniversity of Science and Technology 4 ❖  Application from seismic group related to acoustic wave solver •  Why this application? A user asked for it •  MPI application •  Test on 3 nodes with totally 96 cores on Shaheen II
  • 5.
    Prepare for the tutorial KAUST King AbdullahUniversity of Science and Technology 5 •  Connect to Shaheen II and copy the material: •  ssh –X username@shaheen.kaust.edu.sa •  cp /scratch/tmp/model_reveal.tgz . •  tar zxvf model_reveal.tgz •  cd model_reveal •  Reservation name: k1056_141
  • 6.
    Reveal A tool toport your application to OpenMP programming model KAUST King Abdullah University of Science and Technology 6
  • 7.
    Reveal KAUST King AbdullahUniversity of Science and Technology 7 ❖  Reveal is Cray’s next-generation integrated performance analysis and code optimization tool. •  Source code navigation using whole program analysis (data provided by the Cray compilation environment only) •  Coupling with performance data collected during execution by CrayPAT. Understand which high level serial loops could benefit from parallelism. •  Enhanced loop mark listing functionality. •  Dependency information for targeted loops •  Assist users optimize code by providing variable scoping feedback and suggested compile directives.
  • 8.
    Prepare for Reveal KAUST King AbdullahUniversity of Science and Technology 8 ❖  Load Perftools •  module unload darshan •  module load perftools-base/6.3.2 •  module load perftools/6.3.2 ❖  Execute the MPI version •  cd model_reveal •  make clean •  make •  In the submit.sh file changed to your account number and submit the job §  sbatch submit.sh •  tail -n 10 testdata.XXX.err §  1m46.361s Reservation: k1056_141
  • 9.
    Prepare the applica.on for Reveal KAUST King AbdullahUniversity of Science and Technology 9 ❖  Compile the version for Reveal tool •  make clean –f Makefile_reveal •  In the Makefile_reveal file §  $(CC) -h profile_generate -hpl=data.pl -h noomp $< -o $@ $ (CFLAGS) §  ${CC} -h profile_generate -hpl=data.pl -h noomp -c $< CrayData.c §  Reveal needs the object of the files, so you need to modify the Makefile if needed •  make –f Makefile_reveal •  The folder data.pl is created in the folder •  Instrument your application §  pat_build –w CrayData.exe §  New executable is called CrayData.exe+pat, replace it to submit.sh
  • 10.
    Submit the job for Reveal tool KAUST King AbdullahUniversity of Science and Technology 10 ❖  Submit your job script and do not forget the reservation name (--reservation=…) •  sbatch submit.sh ❖  A performance file (extension .xf) is created, if not something was wrong in the previous steps ❖  Generate the report and the ap2 file •  pat_report -o report.txt CrayData.exe+pat+58072-37t.xf ❖  Execute Reveal •  reveal data.pl CrayData.exe+pat+58072-37t.ap2
  • 11.
    Reveal – Loop Performance KAUST King AbdullahUniversity of Science and Technology 11
  • 12.
    Reveal – Scoping KAUST King AbdullahUniversity of Science and Technology 12
  • 13.
    Reveal – Program view KAUST King AbdullahUniversity of Science and Technology 13
  • 14.
    Reveal – Func.on View KAUST King AbdullahUniversity of Science and Technology 14
  • 15.
    Reveal – Array View KAUST King AbdullahUniversity of Science and Technology 15
  • 16.
    Reveal – Compiler Messages KAUST King AbdullahUniversity of Science and Technology 16
  • 17.
    Reveal – Loop Performance KAUST King AbdullahUniversity of Science and Technology 17
  • 18.
    Reveal – Scoping Tool KAUST King AbdullahUniversity of Science and Technology 18
  • 19.
    Reveal – Scoping Results KAUST King AbdullahUniversity of Science and Technology 19
  • 20.
    Reveal – OpenMP pragmas KAUST King AbdullahUniversity of Science and Technology 20
  • 21.
    Reveal – Inserted OpenMP pragmas KAUST King AbdullahUniversity of Science and Technology 21
  • 22.
    Clean the code from unresolved issues and observe OpenMP pragmas KAUST King AbdullahUniversity of Science and Technology 22 ❖  vim CrayData.c ❖  Remove the lines with unresolved, only if you are sure. #pragma omp parallel for default(none) private (i1,i2,u) shared (nxpad,nzpad) #pragma omp parallel for default(none) private (ix,ib,ibz) shared (nxpad,nb,nzpad,bndr,p0) lastprivate (w)
  • 23.
    Check an OpenMP pragma and its valida.on KAUST King AbdullahUniversity of Science and Technology 23 #pragma omp parallel for default(none) private (ix,ib,ibz) shared (nxpad,nb,nzpad,bndr,p0) lastprivate (w) for(ix=0; ix<nxpad; ix++) { for(ib=0; ib<nb; ib++) { w = bndr[nb-ib-1]; ibz = nzpad-ib-1; p0[ix][ib ] *= w; /* top sponge */ p0[ix][ibz] *= w; /* bottom sponge */ } } for(ib=0; ib<nb; ib++) { ibx = nxpad-ib-1; for(iz=0; iz<nzpad; iz++) { p0[ib ][iz] *= w; /* left sponge */ p0[ibx][iz] *= w; /* right sponge */ } }
  • 24.
    Clean the code from unresolved issues, compile and run KAUST King AbdullahUniversity of Science and Technology 24 ❖  vim CrayData.c ❖  Remove the lines with unresolved if you are sure. ❖  Compile your application with MPI and OpenMP •  make –f Makefile_omp •  The new executable is called CrayData_omp.exe •  Comment the active srun line in the submit.sh and uncomment the next srun call. •  Uncomment also the line with OMP_NUM_THREADS=2 •  Now, we will execute the application with 48 MPI processes (ntasks) and 2 threads per MPI process (cpus-per-task) •  srun --ntasks=48 --ntasks-per-node=16 --ntasks-per-socket=8 -- hint=nomultithread --cpus-per-task=2 ./CrayData_omp.exe
  • 25.
    Different cases and results KAUST King AbdullahUniversity of Science and Technology 25 ❖  Results for 2 threads •  Change according: §  export OMP_NUM_THREADS=2 §  srun –ntasks=48 --ntasks-per-node=16 --ntasks-per- socket=8 --hint=nomultithread --cpus-per-task=2 ./ CrayData_omp.exe •  51.211s (2.86X) ❖  Results 4 threads •  Change according: §  export OMP_NUM_THREADS=4 §  srun --ntasks=24 --ntasks-per-node=8 --ntasks-per-socket=4 --hint=nomultithread --cpus-per-task=4 ./CrayData_omp.exe •  24.815s (5.9X)
  • 26.
    Different cases and results KAUST King AbdullahUniversity of Science and Technology 26 ❖  Results 8 threads •  12.222s (11.98X) ❖  Results 16 threads •  Change according: §  export OMP_NUM_THREADS=16 §  srun --ntasks=6 --ntasks-per-node=2 --ntasks-per-socket=1 -- hint=nomultithread --cpus-per-task=16 ./CrayData_omp.exe •  8.895s (16.45X)
  • 27.
    The original version was improved 19.19 .mes KAUST King AbdullahUniversity of Science and Technology 27 170.67 106.36 8.895 0 20 40 60 80 100 120 140 160 180 Original version Op.mized MPI version MPI+OpenMP Time (in sec.) Execu.on .me
  • 28.
    Valida.on KAUST King AbdullahUniversity of Science and Technology 28 Original version Optimized MPI+OpenMP
  • 29.
    Summary KAUST King AbdullahUniversity of Science and Technology 29 ❖  Reveal is an easy to use tool ❖  The user should be careful though, give notice to compiler messages ❖  You can have great speedup with this tool ❖  We need to investigate more complicated applications
  • 30.
    KAUST Supercomputing Laboratory KAUSTKing Abdullah University of Science and Technology 30