orte/pmix: Do not set orted exit status to one from proc abort #3331

artpol84 · 2017-04-12T15:42:37Z

The fact that application proc called Abort (read failed) doesn't
mean that ORTE subsystem has failed - vice versa it does it's work
to gracefuly exit the whole application.

orted exiting with non-zero status creates a problem for at least
plm/slurm environments where orteds are launched via srun with
"--kill-on-bad-exit" flag. If one of orteds has exited with non-
zero status slurm will immediately kill all other orteds. As the
result we see a lot of leftover in the /tmp directory.

artpol84 · 2017-04-12T15:44:07Z

We were chasing this problem from the last year.
@miked-mellanox @jladd-mlnx @karasevb @yosefe @alinask

rhc54 · 2017-04-12T16:48:12Z

I'm not necessarily opposed to this change, but I am disturbed a bit by the implication that slurm is no longer behaving as expected or documented. The --kill-on-bad-exit option is supposed to result in slurm hitting each of the remaining processes with a SIGTERM, which the orted traps so it can clean up. We rely on that as well for direct launch so the apps can clean up.

Is that no longer happening? Or is the orted not properly trapping it, or failing to cleanup after it does trap the signal?

artpol84 · 2017-04-12T16:50:47Z

Honestly I read this option literally thinking that SIGKILL will be sent.
However I can say that removing "--kill-on-bad-exit" from slurm/plm fixes the problem with leftover.

artpol84 · 2017-04-12T16:52:39Z

I will check that now.

rhc54 · 2017-04-12T16:53:04Z

Could you please check with slurm to see what happened to kill-on-bad-exit? I realize this fixes a symptom, but we should understand the source of the problem. If the orted isn't properly handling SIGTERM, then that's the issue we should resolve.

One simple way of testing: just have the SIGTERM trap print out "OUCH".

rhc54 · 2017-04-12T16:53:12Z

Just saw your note - thx!

artpol84 · 2017-04-12T17:00:28Z

It looks like SIGKILL is sent:

artpol84 · 2017-04-12T17:02:47Z

And it seems like it was like that "forever". At least since 2012:
https://github.com/SchedMD/slurm/blame/master/src/plugins/launch/slurm/launch_slurm.c#L723:L727

artpol84 · 2017-04-12T17:28:19Z

2-checked at runtime: launched on 3 nodes through a batch script (first will be controlled by mpirun).
I put the sleep-loop into abort functions and attached to both orteds. Then I released one of them and this is the output from another:

(gdb) thr 4 [Switching to thread 4 (Thread 0x7ffff11c3700 (LWP 16365))] #0 0x00007ffff6583efd in nanosleep () from /usr/lib64/libc.so.6 (gdb) c Continuing. [Thread 0x7fffef78b700 (LWP 16367) exited] [Thread 0x7ffff09c2700 (LWP 16366) exited] [Thread 0x7ffff7fb1740 (LWP 16359) exited] Program terminated with signal SIGKILL, Killed. The program no longer exists.

I believe that GDB would show SIGTERM if it would come first.

artpol84 · 2017-04-12T17:41:18Z

Checked SLURM srun man page:
Section related to "kill-on-bad-exit" doesn't clearly specify what behavior should be expected:

-K, --kill-on-bad-exit[=0|1] Controls whether or not to terminate a job if any task exits with a non-zero exit code. If this option is not specified, the default action will be based upon the Slurm configuration parameter of KillOnBadExit. If this option is specified, it will take precedence over KillOnBadExit. An option argument of zero will not terminate the job. A non-zero argument or no argument will terminate the job. Note: This option takes precedence over the -W, --wait option to terminate the job immediately if a task exits with a non-zero exit code. Since this option's argument is optional, for proper parsing the single letter option must be followed immediately with the value and not include a space between them. For example "-K1" and not "-K 1". This option applies to job allocations.

So we can't say that SLURM not behaves as documented, probably not as expected. But I personally read it exactly as it behaves.

As opposed to this "--time" section explicitly says what behavior one should expect:

-t, --time=<time> Set a limit on the total run time of the job allocation. If the requested time limit exceeds the partition's time limit, the job will be left in a PENDING state (possibly indefinitely). The default time limit is the partition's default time limit. When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL. The interval between signals is specified by the Slurm configuration parameter KillWait. The OverTimeLimit configuration parameter may permit the job to run longer than scheduled. Time resolution is one minute and second values are rounded up to the next minute. A time limit of zero requests that no time limit be imposed. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". This option applies to job and step allocations.

artpol84 · 2017-04-12T17:52:50Z

I talked to @dannyauble. He thinks that we shouldn't use "--kill-on-bad-exit" if we want to get a normal sequence of cleanup signals: SIGCONT, SIGTERM, SIGKILL. If we use "--kill-on-bad-exit" it will be immediate SIGKILL.
if we don't - it will be a normal sequence of signals that would allow cleanup.

rhc54 · 2017-04-12T17:57:46Z

settle down there, my friend - moe is looking at this now as he believes there should have been a sigterm. so let's let those guys noodle on this a bit before we jump around too much.

artpol84 · 2017-04-12T17:58:32Z

I've updated this PR.
I still think that we shouldn't set orted exit status to what proc returns from PMIx_Abort, because RTE is not the application itself. If app aborted but orte cleaned up everything I believe that orted should exit with 0 status.

rhc54 · 2017-04-12T17:59:33Z

there's a reason why we chose to do it - let's let schedmd think about this a bit

artpol84 · 2017-04-12T17:59:49Z

Sure - no rush with this PR.

artpol84 · 2017-04-12T18:00:05Z

BTW - what was the reason?

The fact that application proc called Abort (read failed) doesn't mean that ORTE subsystem has failed - vice versa it does it's work to gracefuly exit the whole application. orted exiting with non-zero status creates a problem for at least plm/slurm environments where orteds are launched via `srun` with "--kill-on-bad-exit" flag. If one of orteds has exited with non- zero status slurm will immediately kill all other orteds. As the result we see a lot of leftover in the `/tmp` directory. Signed-off-by: Artem Polyakov <artpol84@gmail.com>

rhc54 · 2017-04-12T20:53:25Z

We sometimes fail to cleanly order all the orteds to die - for example, if one of them hangs or fails and we lose communication path downstream of it. It was a problem in the past, hence the use of that option. When we made the change, we always got a SIGTERM first - but I don't recall last time we ever checked it.

For direct launch, there is no other option - the procs all create session directories, and if one fails then you need SLURM to be a little more friendly with the others or else we leave droppings.

artpol84 · 2017-04-12T21:25:49Z

If I understood @dannyauble correctly, if you don't specify --kill-on-bad-exit option you will get the desired behavior.
I mean if one of the tasks failed SLURM will eventually issue SIGCONT, SIGTERM, SIGKILL to all other tasks. You can use "-W" option of srun to tell SLURM the delay between first task exit and the cleanup sequence:

-W, --wait=<seconds> Specify how long to wait after the first task terminates before terminating all remaining tasks. A value of 0 indicates an unlimited wait (a warning will be issued after 60 seconds). The default value is set by the WaitTime parameter in the slurm configuration file (see slurm.conf(5)). This option can be useful to insure that a job is terminated in a timely fashion in the event that one or more tasks terminate prematurely. Note: The -K, --kill-on-bad-exit option takes precedence over -W, --wait to terminate the job immediately if a task exits with a non-zero exit code. This option applies to job allocations.

artpol84 · 2017-04-12T21:27:14Z

Maybe this is the right way to go, and this is what @dannyauble suggested as well.

artpol84 · 2017-04-12T21:32:18Z

I'm going to 2-check this later today

rhc54 · 2017-04-12T23:23:23Z

I truly don't think that is what we want. While it might be okay for the orted's since mpirun synchronizes their normal termination, it would definitely not work for application procs that are directly launched as they can terminate at very different times. For example, it isn't uncommon for all but rank=0 to leave while that rank continues on for quite some time as it saves the results to a parallel file system.

I think Moe realizes that this is something they need to fix, and that the current behavior is not what he intended.

artpol84 · 2017-04-12T23:40:25Z

I was talking specifically about plm/slurm.
Can you clarify what do you mean by direct launched procs? Is this direct srun launch? In this case we don't need any OMPI modifications, don't we?

rhc54 · 2017-04-13T00:21:54Z

Direct launch means applications launched via srun instead of mpirun. No, we don't need any OMPI changes there, but the point is that the bad behavior of SLURM causes our users equal problems in that use-case. So we want SchedMD to fix the problem.

It's important to remember that we have a varied community of users out there that depend on us to make things work correctly, for all the ways they use our software. They don't understand that something is a slurm vs ompi issue - all they see is that running OMPI leaves droppings behind.

In this case, fixing it at the root cause solves both use-cases. So let's concentrate on helping SchedMD to do the right thing. We'll still need to coordinate with them on how to resolve the problem for all those installations running earlier versions, but that's a separate discussion.

artpol84 · 2017-04-13T00:23:54Z

Sounds good, thank you.

rhc54 · 2017-04-13T15:22:31Z

Per Moe:

I have confirmed the logic is as the user described. Slurm does send a SIGCONT/SIGTERM/SIGKILL when a job reached the end of its time limit, but if someone explicitly runs Slurm's scancel command or the --kill-on-bad-exit is triggered then only SIGKILL is sent. I've modified the code to change the behavior in our next major release (late 2017), but the patch should go into older versions of Slurm pretty cleanly. With this change the --kill-on-bad-exit triggers the SIGCONT/SIGTERM/SIGKILL signals. Here is the commit: https://github.com/SchedMD/slurm/commit/8ed8098b8b8838567a8c348869119d1ec38fa581

I'm okay with having the orted not set its exit code as you are quite correct that it didn't have the problem. I would like to hold off on removing kill-on-bad-exit a bit until we better understand the implications - given the slurm change, it might not be necessary.

artpol84 · 2017-04-13T17:00:19Z

PR is updated, I removed kill-on-bad-exit portion.

rhc54

Thanks for catching it, and your patience.

artpol84 added Severity: blocker bug Target: v2.x Target: v3.0.x labels Apr 12, 2017

artpol84 assigned rhc54 Apr 12, 2017

artpol84 requested a review from rhc54 April 12, 2017 15:42

artpol84 force-pushed the orte_cleanup/master branch from a0363df to f68882a Compare April 12, 2017 15:56

artpol84 force-pushed the orte_cleanup/master branch from f87bb79 to a952ff2 Compare April 12, 2017 17:56

artpol84 force-pushed the orte_cleanup/master branch 2 times, most recently from 9ef8746 to 876959a Compare April 12, 2017 18:48

artpol84 added the Target: v1.10 label Apr 12, 2017

artpol84 force-pushed the orte_cleanup/master branch from 876959a to 4af7a08 Compare April 13, 2017 16:58

artpol84 removed the Target: v1.10 label Apr 13, 2017

rhc54 approved these changes Apr 13, 2017

View reviewed changes

rhc54 merged commit fbf714d into open-mpi:master Apr 13, 2017

rhc54 mentioned this pull request Apr 14, 2017

Track RTE-related commits that need to go to v3.0 #3289

Closed

27 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

orte/pmix: Do not set orted exit status to one from proc abort #3331

orte/pmix: Do not set orted exit status to one from proc abort #3331

Uh oh!

artpol84 commented Apr 12, 2017

artpol84 commented Apr 12, 2017

rhc54 commented Apr 12, 2017

artpol84 commented Apr 12, 2017

artpol84 commented Apr 12, 2017

rhc54 commented Apr 12, 2017

rhc54 commented Apr 12, 2017

artpol84 commented Apr 12, 2017 •

edited

Loading

artpol84 commented Apr 12, 2017 •

edited

Loading

artpol84 commented Apr 12, 2017

artpol84 commented Apr 12, 2017 •

edited

Loading

artpol84 commented Apr 12, 2017 •

edited

Loading

rhc54 commented Apr 12, 2017

artpol84 commented Apr 12, 2017

rhc54 commented Apr 12, 2017

artpol84 commented Apr 12, 2017

artpol84 commented Apr 12, 2017

rhc54 commented Apr 12, 2017

artpol84 commented Apr 12, 2017 •

edited

Loading

artpol84 commented Apr 12, 2017

artpol84 commented Apr 12, 2017

rhc54 commented Apr 12, 2017

artpol84 commented Apr 12, 2017

rhc54 commented Apr 13, 2017

artpol84 commented Apr 13, 2017

rhc54 commented Apr 13, 2017

artpol84 commented Apr 13, 2017

rhc54 left a comment

Labels

2 participants

Uh oh!

orte/pmix: Do not set orted exit status to one from proc abort #3331

orte/pmix: Do not set orted exit status to one from proc abort #3331

Uh oh!

Conversation

artpol84 commented Apr 12, 2017

artpol84 commented Apr 12, 2017

rhc54 commented Apr 12, 2017

artpol84 commented Apr 12, 2017

artpol84 commented Apr 12, 2017

rhc54 commented Apr 12, 2017

rhc54 commented Apr 12, 2017

artpol84 commented Apr 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

artpol84 commented Apr 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

artpol84 commented Apr 12, 2017

artpol84 commented Apr 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

artpol84 commented Apr 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

rhc54 commented Apr 12, 2017

artpol84 commented Apr 12, 2017

rhc54 commented Apr 12, 2017

artpol84 commented Apr 12, 2017

artpol84 commented Apr 12, 2017

rhc54 commented Apr 12, 2017

artpol84 commented Apr 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

artpol84 commented Apr 12, 2017

artpol84 commented Apr 12, 2017

rhc54 commented Apr 12, 2017

artpol84 commented Apr 12, 2017

rhc54 commented Apr 13, 2017

artpol84 commented Apr 13, 2017

rhc54 commented Apr 13, 2017

artpol84 commented Apr 13, 2017

rhc54 left a comment

Choose a reason for hiding this comment

Labels

2 participants

artpol84 commented Apr 12, 2017 •

edited

Loading

artpol84 commented Apr 12, 2017 •

edited

Loading

artpol84 commented Apr 12, 2017 •

edited

Loading

artpol84 commented Apr 12, 2017 •

edited

Loading

artpol84 commented Apr 12, 2017 •

edited

Loading