Skip to content

TCP BTL progress thread segv's #5902

@jsquyres

Description

@jsquyres

In doing some manual testing, I'm seeing segv's in about 60% of my runs on master when run on 2 nodes, ppn=16, with mca=tcp,self (no vader), and with btl_tcp_progress_thread=1.

The core file stack traces are varied, but they all have a few things in common:

  1. Thread 4 is somewhere in ompi_mpi_finalize().
    • Sometimes it's deep inside the loop of closing frameworks/components in MPI_FINALIZE.
    • Other times it's closing down MPI_Info.
    • ...etc.
  2. Threads 2 and 3 looks like ORTE/PMIX/whatever progress threads.
  3. Thread 1 looks to be the thread that caused the segv; it always has a bt something like this:
(gdb) bt #0 0x00002aaab9fcec5b in __divtf3 (a=<invalid float value>, b=-nan(0xffffffffffffffff)) at ../../../libgcc/soft-fp/divtf3.c:47 #1 0x00002aaab9fc5d60 in ?? () #2 0x0000000000000000 in ?? () 

__divtf3 looks to be a gcc-internal division function (https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html).

Is it possible that the TCP progress thread has not been shut down properly / is still running, and the TCP BTL component got dlclose()ed? That could lead to Badness like this.

This is happening on master; I have not checked any release branches to see if/where else it is happening.

FYI: @bosilca @bwbarrett

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions