Skip to content

Conversation

@artpol84
Copy link
Contributor

The fact that application proc called Abort (read failed) doesn't
mean that ORTE subsystem has failed - vice versa it does it's work
to gracefuly exit the whole application.

orted exiting with non-zero status creates a problem for at least
plm/slurm environments where orteds are launched via srun with
"--kill-on-bad-exit" flag. If one of orteds has exited with non-
zero status slurm will immediately kill all other orteds. As the
result we see a lot of leftover in the /tmp directory.

@artpol84
Copy link
Contributor Author

We were chasing this problem from the last year.
@miked-mellanox @jladd-mlnx @karasevb @yosefe @alinask

@artpol84 artpol84 force-pushed the orte_cleanup/master branch from a0363df to f68882a Compare April 12, 2017 15:56
@rhc54
Copy link
Contributor

rhc54 commented Apr 12, 2017

I'm not necessarily opposed to this change, but I am disturbed a bit by the implication that slurm is no longer behaving as expected or documented. The --kill-on-bad-exit option is supposed to result in slurm hitting each of the remaining processes with a SIGTERM, which the orted traps so it can clean up. We rely on that as well for direct launch so the apps can clean up.

Is that no longer happening? Or is the orted not properly trapping it, or failing to cleanup after it does trap the signal?

@artpol84
Copy link
Contributor Author

Honestly I read this option literally thinking that SIGKILL will be sent.
However I can say that removing "--kill-on-bad-exit" from slurm/plm fixes the problem with leftover.

@artpol84
Copy link
Contributor Author

I will check that now.

@rhc54
Copy link
Contributor

rhc54 commented Apr 12, 2017

Could you please check with slurm to see what happened to kill-on-bad-exit? I realize this fixes a symptom, but we should understand the source of the problem. If the orted isn't properly handling SIGTERM, then that's the issue we should resolve.

One simple way of testing: just have the SIGTERM trap print out "OUCH".

@rhc54
Copy link
Contributor

rhc54 commented Apr 12, 2017

Just saw your note - thx!

@artpol84
Copy link
Contributor Author

artpol84 commented Apr 12, 2017

And it seems like it was like that "forever". At least since 2012:
https://github.com/SchedMD/slurm/blame/master/src/plugins/launch/slurm/launch_slurm.c#L723:L727

@artpol84
Copy link
Contributor Author

2-checked at runtime: launched on 3 nodes through a batch script (first will be controlled by mpirun).
I put the sleep-loop into abort functions and attached to both orteds. Then I released one of them and this is the output from another:

(gdb) thr 4 [Switching to thread 4 (Thread 0x7ffff11c3700 (LWP 16365))] #0 0x00007ffff6583efd in nanosleep () from /usr/lib64/libc.so.6 (gdb) c Continuing. [Thread 0x7fffef78b700 (LWP 16367) exited] [Thread 0x7ffff09c2700 (LWP 16366) exited] [Thread 0x7ffff7fb1740 (LWP 16359) exited] Program terminated with signal SIGKILL, Killed. The program no longer exists. 

I believe that GDB would show SIGTERM if it would come first.

@artpol84
Copy link
Contributor Author

artpol84 commented Apr 12, 2017

Checked SLURM srun man page:
Section related to "kill-on-bad-exit" doesn't clearly specify what behavior should be expected:

-K, --kill-on-bad-exit[=0|1] Controls whether or not to terminate a job if any task exits with a non-zero exit code. If this option is not specified, the default action will be based upon the Slurm configuration parameter of KillOnBadExit. If this option is specified, it will take precedence over KillOnBadExit. An option argument of zero will not terminate the job. A non-zero argument or no argument will terminate the job. Note: This option takes precedence over the -W, --wait option to terminate the job immediately if a task exits with a non-zero exit code. Since this option's argument is optional, for proper parsing the single letter option must be followed immediately with the value and not include a space between them. For example "-K1" and not "-K 1". This option applies to job allocations. 

So we can't say that SLURM not behaves as documented, probably not as expected. But I personally read it exactly as it behaves.

As opposed to this "--time" section explicitly says what behavior one should expect:

-t, --time=<time> Set a limit on the total run time of the job allocation. If the requested time limit exceeds the partition's time limit, the job will be left in a PENDING state (possibly indefinitely). The default time limit is the partition's default time limit. When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL. The interval between signals is specified by the Slurm configuration parameter KillWait. The OverTimeLimit configuration parameter may permit the job to run longer than scheduled. Time resolution is one minute and second values are rounded up to the next minute. A time limit of zero requests that no time limit be imposed. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". This option applies to job and step allocations. 
@artpol84
Copy link
Contributor Author

artpol84 commented Apr 12, 2017

I talked to @dannyauble. He thinks that we shouldn't use "--kill-on-bad-exit" if we want to get a normal sequence of cleanup signals: SIGCONT, SIGTERM, SIGKILL. If we use "--kill-on-bad-exit" it will be immediate SIGKILL.
if we don't - it will be a normal sequence of signals that would allow cleanup.

@artpol84 artpol84 force-pushed the orte_cleanup/master branch from f87bb79 to a952ff2 Compare April 12, 2017 17:56
@rhc54
Copy link
Contributor

rhc54 commented Apr 12, 2017

settle down there, my friend - moe is looking at this now as he believes there should have been a sigterm. so let's let those guys noodle on this a bit before we jump around too much.

@artpol84
Copy link
Contributor Author

I've updated this PR.
I still think that we shouldn't set orted exit status to what proc returns from PMIx_Abort, because RTE is not the application itself. If app aborted but orte cleaned up everything I believe that orted should exit with 0 status.

@rhc54
Copy link
Contributor

rhc54 commented Apr 12, 2017

there's a reason why we chose to do it - let's let schedmd think about this a bit

@artpol84
Copy link
Contributor Author

Sure - no rush with this PR.

@artpol84
Copy link
Contributor Author

BTW - what was the reason?

The fact that application proc called Abort (read failed) doesn't mean that ORTE subsystem has failed - vice versa it does it's work to gracefuly exit the whole application. orted exiting with non-zero status creates a problem for at least plm/slurm environments where orteds are launched via `srun` with "--kill-on-bad-exit" flag. If one of orteds has exited with non- zero status slurm will immediately kill all other orteds. As the result we see a lot of leftover in the `/tmp` directory. Signed-off-by: Artem Polyakov <artpol84@gmail.com>
@artpol84 artpol84 force-pushed the orte_cleanup/master branch 2 times, most recently from 9ef8746 to 876959a Compare April 12, 2017 18:48
@rhc54
Copy link
Contributor

rhc54 commented Apr 12, 2017

We sometimes fail to cleanly order all the orteds to die - for example, if one of them hangs or fails and we lose communication path downstream of it. It was a problem in the past, hence the use of that option. When we made the change, we always got a SIGTERM first - but I don't recall last time we ever checked it.

For direct launch, there is no other option - the procs all create session directories, and if one fails then you need SLURM to be a little more friendly with the others or else we leave droppings.

@artpol84
Copy link
Contributor Author

artpol84 commented Apr 12, 2017

If I understood @dannyauble correctly, if you don't specify --kill-on-bad-exit option you will get the desired behavior.
I mean if one of the tasks failed SLURM will eventually issue SIGCONT, SIGTERM, SIGKILL to all other tasks. You can use "-W" option of srun to tell SLURM the delay between first task exit and the cleanup sequence:

-W, --wait=<seconds> Specify how long to wait after the first task terminates before terminating all remaining tasks. A value of 0 indicates an unlimited wait (a warning will be issued after 60 seconds). The default value is set by the WaitTime parameter in the slurm configuration file (see slurm.conf(5)). This option can be useful to insure that a job is terminated in a timely fashion in the event that one or more tasks terminate prematurely. Note: The -K, --kill-on-bad-exit option takes precedence over -W, --wait to terminate the job immediately if a task exits with a non-zero exit code. This option applies to job allocations. 
@artpol84
Copy link
Contributor Author

Maybe this is the right way to go, and this is what @dannyauble suggested as well.

@artpol84
Copy link
Contributor Author

I'm going to 2-check this later today

@rhc54
Copy link
Contributor

rhc54 commented Apr 12, 2017

I truly don't think that is what we want. While it might be okay for the orted's since mpirun synchronizes their normal termination, it would definitely not work for application procs that are directly launched as they can terminate at very different times. For example, it isn't uncommon for all but rank=0 to leave while that rank continues on for quite some time as it saves the results to a parallel file system.

I think Moe realizes that this is something they need to fix, and that the current behavior is not what he intended.

@artpol84
Copy link
Contributor Author

I was talking specifically about plm/slurm.
Can you clarify what do you mean by direct launched procs? Is this direct srun launch? In this case we don't need any OMPI modifications, don't we?

@rhc54
Copy link
Contributor

rhc54 commented Apr 13, 2017

Direct launch means applications launched via srun instead of mpirun. No, we don't need any OMPI changes there, but the point is that the bad behavior of SLURM causes our users equal problems in that use-case. So we want SchedMD to fix the problem.

It's important to remember that we have a varied community of users out there that depend on us to make things work correctly, for all the ways they use our software. They don't understand that something is a slurm vs ompi issue - all they see is that running OMPI leaves droppings behind.

In this case, fixing it at the root cause solves both use-cases. So let's concentrate on helping SchedMD to do the right thing. We'll still need to coordinate with them on how to resolve the problem for all those installations running earlier versions, but that's a separate discussion.

@artpol84
Copy link
Contributor Author

Sounds good, thank you.

@rhc54
Copy link
Contributor

rhc54 commented Apr 13, 2017

Per Moe:

I have confirmed the logic is as the user described. Slurm does send a SIGCONT/SIGTERM/SIGKILL when a job reached the end of its time limit, but if someone explicitly runs Slurm's scancel command or the --kill-on-bad-exit is triggered then only SIGKILL is sent. I've modified the code to change the behavior in our next major release (late 2017), but the patch should go into older versions of Slurm pretty cleanly. With this change the --kill-on-bad-exit triggers the SIGCONT/SIGTERM/SIGKILL signals. Here is the commit: https://github.com/SchedMD/slurm/commit/8ed8098b8b8838567a8c348869119d1ec38fa581 

I'm okay with having the orted not set its exit code as you are quite correct that it didn't have the problem. I would like to hold off on removing kill-on-bad-exit a bit until we better understand the implications - given the slurm change, it might not be necessary.

@artpol84 artpol84 force-pushed the orte_cleanup/master branch from 876959a to 4af7a08 Compare April 13, 2017 16:58
@artpol84
Copy link
Contributor Author

PR is updated, I removed kill-on-bad-exit portion.

Copy link
Contributor

@rhc54 rhc54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching it, and your patience.

@rhc54 rhc54 merged commit fbf714d into open-mpi:master Apr 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment