odls/base: Fix abormal cleanup when app is wrapped #3337

jjhursey · 2017-04-12T20:53:35Z

The scenario is that we have a wrapper process placed before the MPI application:

 mpirun -np 2 wrapper ./hello_c

Wrapper can be as simple as:

#!/bin/bash -e eval "$@" exit 0

If hello_c crashes and wrapper detects it, then wrapper will
exit with a non-zero exit status. The orted will notice that and
start a kill process for all local processes.
- The orted will send SIGKILL to the wrapper process, and that
  process will terminate and leave the hello_c running. The hello_c
  will continue to run (in this test case will wait in MPI_Finalize)
  and the job will seem to hang.
This commit does two things each will fix this scenario.
1. After killing the process mark it as not alive since we are not
  going to wait on it. This prevents orted_cmd from seeing the process
  as alive and waiting for it to complete (note that the pid is set
  to 0 so we wouldn't be able to mark it correctly later even if
  we did get a notice.
2. Instead of sending the SIGKILL signal to just the PID of wrapper
  send it to -PID so that the kernel will send the signal to the
  whole process group under wrapper as well. This will case the
  hello_c program to terminate as well.

* The scenario is that we have a wrapper process placed before the MPI application: ```shell mpirun -np 2 wrapper ./hello_c ``` * If `hello_c` crashes and `wrapper` detects it, then `wrapper` will exit with a non-zero exit status. The orted will notice that and start a kill process for all local processes. - The orted will send `SIGKILL` to the `wrapper` process, and that process will terminate and leave the `hello_c` running. The `hello_c` will continue to run (in this test case will wait in `MPI_Finalize`) and the job will seem to hang. * This commit does two things each will fix this scenario. 1. After killing the process mark it as not alive since we are not going to wait on it. This prevents orted_cmd from seeing the process as alive and waiting for it to complete (note that the pid is set to `0` so we wouldn't be able to mark it correctly later even if we did get a notice. 2. Instead of sending the `SIGKILL` signal to just the `PID` of `wrapper` send it to `-PID` so that the kernel will send the signal to the whole process group under `wrapper` as well. This will case the `hello_c` program to terminate as well. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>

rhc54 · 2017-04-12T20:55:40Z

Didn't we make a change at some not-too-distant point so we no longer put the apps in their own process group? I seem to recall @gpaulsen asking us to do so, and I thought we did?

jjhursey · 2017-04-13T21:36:19Z

Humm. Maybe I was testing against an old version of the code. I can no longer reproduce this issue with master or the release branches. A few months ago I was able to reliability reproduce, so maybe this was fixed differently since then.

I'm going to close this PR, and if it comes back up we can revive it.

I think the ORTE_FLAG_UNSET(cd->child, ORTE_PROC_FLAG_ALIVE); is still a good thing to do though.

rhc54 · 2017-04-13T22:12:39Z

Sounds reasonable to me. Given timing, I'll do it on your behalf.

jjhursey added bug Target: v2.0.x Target: v2.x Target: v3.0.x labels Apr 12, 2017

jjhursey assigned rhc54 and gpaulsen Apr 12, 2017

jjhursey requested a review from rhc54 April 12, 2017 20:53

jjhursey closed this Apr 13, 2017

rhc54 mentioned this pull request Apr 14, 2017

On behalf of Josh, ensure we flag that the child is no longer alive since we are killing it with SIGKILL #3350

Merged

jjhursey mentioned this pull request Jun 27, 2017

Need to signal -pgrp to get to all members of a process group. #3773

Merged

jjhursey deleted the fix/wrapped-prog-term branch March 9, 2021 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

odls/base: Fix abormal cleanup when app is wrapped #3337

odls/base: Fix abormal cleanup when app is wrapped #3337

Uh oh!

jjhursey commented Apr 12, 2017

rhc54 commented Apr 12, 2017

jjhursey commented Apr 13, 2017

rhc54 commented Apr 13, 2017

Labels

3 participants

odls/base: Fix abormal cleanup when app is wrapped #3337

odls/base: Fix abormal cleanup when app is wrapped #3337

Uh oh!

Conversation

jjhursey commented Apr 12, 2017

rhc54 commented Apr 12, 2017

jjhursey commented Apr 13, 2017

rhc54 commented Apr 13, 2017

Labels

3 participants