- Notifications
You must be signed in to change notification settings - Fork 929
Open
Labels
Description
In doing some manual testing, I'm seeing segv's in about 60% of my runs on master when run on 2 nodes, ppn=16, with mca=tcp,self (no vader), and with btl_tcp_progress_thread=1.
The core file stack traces are varied, but they all have a few things in common:
- Thread 4 is somewhere in
ompi_mpi_finalize().- Sometimes it's deep inside the loop of closing frameworks/components in MPI_FINALIZE.
- Other times it's closing down MPI_Info.
- ...etc.
- Threads 2 and 3 looks like ORTE/PMIX/whatever progress threads.
- Thread 1 looks to be the thread that caused the segv; it always has a
btsomething like this:
(gdb) bt #0 0x00002aaab9fcec5b in __divtf3 (a=<invalid float value>, b=-nan(0xffffffffffffffff)) at ../../../libgcc/soft-fp/divtf3.c:47 #1 0x00002aaab9fc5d60 in ?? () #2 0x0000000000000000 in ?? () __divtf3 looks to be a gcc-internal division function (https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html).
Is it possible that the TCP progress thread has not been shut down properly / is still running, and the TCP BTL component got dlclose()ed? That could lead to Badness like this.
This is happening on master; I have not checked any release branches to see if/where else it is happening.
FYI: @bosilca @bwbarrett