Skip to content

Multiple failures running collective-big-count tests with OMPI main branch and 'han' collective component #10187

@drwootton

Description

@drwootton

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

OpenMPI main branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from current main branch (3/22/22)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

 git submodule status 1b86a35db2816ee9c0f3a41988005a2ba7d29adb 3rd-party/openpmix (v1.1.3-3481-g1b86a35d) 91f791e209ccbdfb4b8647900d292ef51d52f37d 3rd-party/prrte (psrvr-v2.0.0rc1-4319-g91f791e2) 

Please describe the system on which you are running

  • Operating system/version:
  • RHEL 8.4
  • Computer hardware:
  • Single Power8 node
  • Network type:
  • Localhost

Details of the problem

I ran the set of self-checking tests from ompi-tests-public/collective-big-count with collective components specified as --mca coll_han_priority 100 --mca coll han,basic,sm,self,inter,libnbc

The following testcases had failures. The remaining testcases were successful:

  • test_allgather_uniform_count
  • test-alltoall-uniform_count
  • test-gather-uniform-count
  • test-scatter-uniform-count

The tests were compiled by running make in the directory containing the source files

The following environment variables were set for all tests:

BIGCOUNT_HOSTS : -np 3 BIGCOUNT_MEMORY_PERCENT : 70 BIGCOUNT_MEMORY_DIFF : 10 

The following command fails with a self-check error message indicating invalid data, followed by a SIGSEGV.

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca btl ^openib --mca coll_han_priority 100 --mca coll han,basic,sm,self,inter,libnbc /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count 

The error and traceback is

Results from MPI_Iallgather(int x 6442450941 = 25769803764 or 24.0 GB): MPI_IN_PLACE Rank 1: ERROR: DI in 2147483647 of 6442450941 slots ( 33.3 % wrong) Rank 0: ERROR: DI in 2147483647 of 6442450941 slots ( 33.3 % wrong) Rank 2: PASSED _____________________ Adjust count to fit in memory: 2147483647 x 60.0% = 1288490188 Root : payload 61847529024 57.6 GB = 16 dt x 1288490188 count x 3 peers x 1.0 inflation Peer : payload 61847529024 57.6 GB = 16 dt x 1288490188 count x 3 peers x 1.0 inflation Total : payload 185542587072 172.8 GB = 57.6 GB root + 57.6 GB x 2 local peers _____________________ Results from MPI_Iallgather(double _Complex x 3865470564 = 61847529024 or 57.6 GB): MPI_IN_PLACE [c656f6n01:1747159] *** Process received signal *** [c656f6n01:1747159] Signal: Segmentation fault (11) [c656f6n01:1747159] Signal code: Address not mapped (1) [c656f6n01:1747159] Failing at address: 0x1ff9a2999990 [c656f6n01:1747159] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] [c656f6n01:1747159] [ 1] /usr/lib64/libc.so.6(+0xb083c)[0x20000074083c] [c656f6n01:1747159] [ 2] /u/dwootton/ompi-master/lib/libmpi.so.0(+0x3f85e8)[0x2000004785e8] [c656f6n01:1747159] [ 3] /u/dwootton/ompi-master/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x2a4)[0x20000047c034] [c656f6n01:1747159] [ 4] /u/dwootton/ompi-master/lib/libmpi.so.0(+0x3ea700)[0x20000046a700] [c656f6n01:1747159] [ 5] /u/dwootton/ompi-master/lib/libmpi.so.0(+0x3ea7c0)[0x20000046a7c0] [c656f6n01:1747159] [ 6] /u/dwootton/ompi-master/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x2e4)[0x20000046cca8] [c656f6n01:1747159] [ 7] /u/dwootton/ompi-master/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x200)[0x2000009f076c] [c656f6n01:1747159] [ 8] /u/dwootton/ompi-master/lib/libopen-pal.so.0(+0xf0890)[0x2000009f0890] [c656f6n01:1747159] [ 9] /u/dwootton/ompi-master/lib/libopen-pal.so.0(+0xf0c08)[0x2000009f0c08] [c656f6n01:1747159] [10] /u/dwootton/ompi-master/lib/libopen-pal.so.0(opal_progress+0x5c)[0x20000093d5b0] [c656f6n01:1747159] [11] /u/dwootton/ompi-master/lib/libmpi.so.0(+0xded50)[0x20000015ed50] [c656f6n01:1747159] [12] /u/dwootton/ompi-master/lib/libmpi.so.0(ompi_request_default_wait+0x38)[0x20000015edfc] [c656f6n01:1747159] [13] /u/dwootton/ompi-master/lib/libmpi.so.0(MPI_Wait+0x194)[0x200000258e38] [c656f6n01:1747159] [14] /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count[0x100030d8] [c656f6n01:1747159] [15] /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count[0x100029d0] [c656f6n01:1747159] [16] /usr/lib64/libc.so.6(+0x24c78)[0x2000006b4c78] [c656f6n01:1747159] [17] /usr/lib64/libc.so.6(__libc_start_main+0xb4)[0x2000006b4e64] [c656f6n01:1747159] *** End of error message *** 

The following command fails with a self-check error message indicating invalid data, followed by a SIGSEGV.

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca btl ^openib --mca coll_han_priority 100 --mca coll han,basic,sm,self,inter,libnbc /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_alltoall_uniform_count 

The error message and traceback is

Results from MPI_Alltoall(int x 6442450941 = 25769803764 or 24.0 GB): MPI_IN_PLACE Rank 1: ERROR: DI in 2147483647 of 2147483647 slots ( 100.0 % wrong) Rank 2: ERROR: DI in 4294967294 of 2147483647 slots ( 200.0 % wrong) Rank 0: ERROR: DI in 5637111799 of 2147483647 slots ( 262.5 % wrong) _____________________ Adjust count to fit in memory: 2147483647 x 60.0% = 1288490188 Root : payload 61847529024 57.6 GB = 16 dt x 1288490188 count x 3 peers x 1.0 inflation Peer : payload 61847529024 57.6 GB = 16 dt x 1288490188 count x 3 peers x 1.0 inflation Total : payload 185542587072 172.8 GB = 57.6 GB root + 57.6 GB x 2 local peers [c656f6n01:1748510] *** Process received signal *** [c656f6n01:1748510] Signal: Segmentation fault (11) [c656f6n01:1748510] Signal code: Address not mapped (1) [c656f6n01:1748510] Failing at address: 0x1ff9a2999990 [c656f6n01:1748510] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] [c656f6n01:1748510] [ 1] /usr/lib64/libc.so.6(+0xb083c)[0x20000074083c] [c656f6n01:1748510] [ 2] /u/dwootton/ompi-master/lib/libmpi.so.0(mca_coll_base_alltoall_intra_basic_inplace+0x22c)[0x2000002b3c94] [c656f6n01:1748510] [ 3] /u/dwootton/ompi-master/lib/libmpi.so.0(ompi_coll_base_alltoall_intra_basic_linear+0x8c)[0x2000002b5684] [c656f6n01:1748510] [ 4] /u/dwootton/ompi-master/lib/libmpi.so.0(PMPI_Alltoall+0x538)[0x200000193cd4] [c656f6n01:1748510] [ 5] /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_alltoall_uniform_count[0x10002dd0] [c656f6n01:1748510] [ 6] /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_alltoall_uniform_count[0x1000289c] [c656f6n01:1748510] [ 7] /usr/lib64/libc.so.6(+0x24c78)[0x2000006b4c78] [c656f6n01:1748510] [ 8] /usr/lib64/libc.so.6(__libc_start_main+0xb4)[0x2000006b4e64] [c656f6n01:1748510] *** End of error message *** 

The following command fails with a self-check error message indicating invalid data, followed by a double
free or storage corruption

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca btl ^openib --mca coll_han_priority 100 --mca coll han,basic,sm,self,inter,libnbc /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_gather_uniform_count 

The error and traceback is

_____________________ Results from MPI_Igather(int x 6442450941 = 25769803764 or 24.0 GB): Rank 0: ERROR: DI in 4294967292 of 6442450941 slots ( 66.7 % wrong) _____________________ Results from MPI_Igather(double _Complex x 6442450941 = 103079215056 or 96.0 GB): Rank 0: ERROR: DI in 4294967292 of 6442450941 slots ( 66.7 % wrong) double free or corruption (out) [c656f6n01:1748925] *** Process received signal *** [c656f6n01:1748925] Signal: Aborted (6) [c656f6n01:1748925] Signal code: (-6) [c656f6n01:1748925] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] [c656f6n01:1748925] [ 1] /usr/lib64/libc.so.6(gsignal+0xd8)[0x2000006d44d8] [c656f6n01:1748925] [ 2] /usr/lib64/libc.so.6(abort+0x164)[0x2000006b462c] [c656f6n01:1748925] [ 3] /usr/lib64/libc.so.6(+0x908bc)[0x2000007208bc] [c656f6n01:1748925] [ 4] /usr/lib64/libc.so.6(+0x9b828)[0x20000072b828] [c656f6n01:1748925] [ 5] /usr/lib64/libc.so.6(+0x9e0ec)[0x20000072e0ec] [c656f6n01:1748925] [ 6] ./test_gather_uniform_count[0x100030b0] [c656f6n01:1748925] [ 7] ./test_gather_uniform_count[0x10002920] [c656f6n01:1748925] [ 8] /usr/lib64/libc.so.6(+0x24c78)[0x2000006b4c78] [c656f6n01:1748925] [ 9] /usr/lib64/libc.so.6(__libc_start_main+0xb4)[0x2000006b4e64] [c656f6n01:1748925] *** End of error message *** 

The following command fails with a self-check error message indicating invalid data

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca btl ^openib --mca coll_han_priority 100 --mca coll han,basic,sm,self,inter,libnbc /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_scatter_uniform_count 

The error message is

____________________ Results from MPI_Iscatter(int x 6442450941 = 25769803764 or 24.0 GB): Rank 2: ERROR: DI in 2147483647 of 2147483647 slots ( 100.0 % wrong) Rank 1: PASSED Rank 0: PASSED ____________________ Results from MPI_Iscatter(double _Complex x 6442450941 = 103079215056 or 96.0 GB): Rank 2: ERROR: DI in 2147483647 of 2147483647 slots ( 100.0 % wrong) Rank 1: PASSED Rank 0: PASSED 

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions