Skip to content

Conversation

@jjhursey
Copy link
Member

  • The scenario is that we have a wrapper process placed before the MPI application:
 mpirun -np 2 wrapper ./hello_c
  • Wrapper can be as simple as:
#!/bin/bash -e eval "$@" exit 0
  • If hello_c crashes and wrapper detects it, then wrapper will
    exit with a non-zero exit status. The orted will notice that and
    start a kill process for all local processes.
    • The orted will send SIGKILL to the wrapper process, and that
      process will terminate and leave the hello_c running. The hello_c
      will continue to run (in this test case will wait in MPI_Finalize)
      and the job will seem to hang.
  • This commit does two things each will fix this scenario.
    1. After killing the process mark it as not alive since we are not
      going to wait on it. This prevents orted_cmd from seeing the process
      as alive and waiting for it to complete (note that the pid is set
      to 0 so we wouldn't be able to mark it correctly later even if
      we did get a notice.
    2. Instead of sending the SIGKILL signal to just the PID of wrapper
      send it to -PID so that the kernel will send the signal to the
      whole process group under wrapper as well. This will case the
      hello_c program to terminate as well.
 * The scenario is that we have a wrapper process placed before the MPI application: ```shell mpirun -np 2 wrapper ./hello_c ``` * If `hello_c` crashes and `wrapper` detects it, then `wrapper` will exit with a non-zero exit status. The orted will notice that and start a kill process for all local processes. - The orted will send `SIGKILL` to the `wrapper` process, and that process will terminate and leave the `hello_c` running. The `hello_c` will continue to run (in this test case will wait in `MPI_Finalize`) and the job will seem to hang. * This commit does two things each will fix this scenario. 1. After killing the process mark it as not alive since we are not going to wait on it. This prevents orted_cmd from seeing the process as alive and waiting for it to complete (note that the pid is set to `0` so we wouldn't be able to mark it correctly later even if we did get a notice. 2. Instead of sending the `SIGKILL` signal to just the `PID` of `wrapper` send it to `-PID` so that the kernel will send the signal to the whole process group under `wrapper` as well. This will case the `hello_c` program to terminate as well. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
@rhc54
Copy link
Contributor

rhc54 commented Apr 12, 2017

Didn't we make a change at some not-too-distant point so we no longer put the apps in their own process group? I seem to recall @gpaulsen asking us to do so, and I thought we did?

@jjhursey
Copy link
Member Author

Humm. Maybe I was testing against an old version of the code. I can no longer reproduce this issue with master or the release branches. A few months ago I was able to reliability reproduce, so maybe this was fixed differently since then.

I'm going to close this PR, and if it comes back up we can revive it.

I think the ORTE_FLAG_UNSET(cd->child, ORTE_PROC_FLAG_ALIVE); is still a good thing to do though.

@jjhursey jjhursey closed this Apr 13, 2017
@rhc54
Copy link
Contributor

rhc54 commented Apr 13, 2017

Sounds reasonable to me. Given timing, I'll do it on your behalf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment