-
Couldn't load subscription status.
- Fork 928
orte/pmix: Do not set orted exit status to one from proc abort #3331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| We were chasing this problem from the last year. |
a0363df to f68882a Compare | I'm not necessarily opposed to this change, but I am disturbed a bit by the implication that slurm is no longer behaving as expected or documented. The --kill-on-bad-exit option is supposed to result in slurm hitting each of the remaining processes with a SIGTERM, which the orted traps so it can clean up. We rely on that as well for direct launch so the apps can clean up. Is that no longer happening? Or is the orted not properly trapping it, or failing to cleanup after it does trap the signal? |
| Honestly I read this option literally thinking that SIGKILL will be sent. |
| I will check that now. |
| Could you please check with slurm to see what happened to kill-on-bad-exit? I realize this fixes a symptom, but we should understand the source of the problem. If the orted isn't properly handling SIGTERM, then that's the issue we should resolve. One simple way of testing: just have the SIGTERM trap print out "OUCH". |
| Just saw your note - thx! |
| And it seems like it was like that "forever". At least since 2012: |
| 2-checked at runtime: launched on 3 nodes through a batch script (first will be controlled by mpirun). I believe that GDB would show SIGTERM if it would come first. |
| Checked SLURM srun man page: So we can't say that SLURM not behaves as documented, probably not as expected. But I personally read it exactly as it behaves. As opposed to this "--time" section explicitly says what behavior one should expect: |
| I talked to @dannyauble. He thinks that we shouldn't use "--kill-on-bad-exit" if we want to get a normal sequence of cleanup signals: |
f87bb79 to a952ff2 Compare | settle down there, my friend - moe is looking at this now as he believes there should have been a sigterm. so let's let those guys noodle on this a bit before we jump around too much. |
| I've updated this PR. |
| there's a reason why we chose to do it - let's let schedmd think about this a bit |
| Sure - no rush with this PR. |
| BTW - what was the reason? |
The fact that application proc called Abort (read failed) doesn't mean that ORTE subsystem has failed - vice versa it does it's work to gracefuly exit the whole application. orted exiting with non-zero status creates a problem for at least plm/slurm environments where orteds are launched via `srun` with "--kill-on-bad-exit" flag. If one of orteds has exited with non- zero status slurm will immediately kill all other orteds. As the result we see a lot of leftover in the `/tmp` directory. Signed-off-by: Artem Polyakov <artpol84@gmail.com>
9ef8746 to 876959a Compare | We sometimes fail to cleanly order all the orteds to die - for example, if one of them hangs or fails and we lose communication path downstream of it. It was a problem in the past, hence the use of that option. When we made the change, we always got a SIGTERM first - but I don't recall last time we ever checked it. For direct launch, there is no other option - the procs all create session directories, and if one fails then you need SLURM to be a little more friendly with the others or else we leave droppings. |
| If I understood @dannyauble correctly, if you don't specify |
| Maybe this is the right way to go, and this is what @dannyauble suggested as well. |
| I'm going to 2-check this later today |
| I truly don't think that is what we want. While it might be okay for the orted's since mpirun synchronizes their normal termination, it would definitely not work for application procs that are directly launched as they can terminate at very different times. For example, it isn't uncommon for all but rank=0 to leave while that rank continues on for quite some time as it saves the results to a parallel file system. I think Moe realizes that this is something they need to fix, and that the current behavior is not what he intended. |
| I was talking specifically about plm/slurm. |
| Direct launch means applications launched via srun instead of mpirun. No, we don't need any OMPI changes there, but the point is that the bad behavior of SLURM causes our users equal problems in that use-case. So we want SchedMD to fix the problem. It's important to remember that we have a varied community of users out there that depend on us to make things work correctly, for all the ways they use our software. They don't understand that something is a slurm vs ompi issue - all they see is that running OMPI leaves droppings behind. In this case, fixing it at the root cause solves both use-cases. So let's concentrate on helping SchedMD to do the right thing. We'll still need to coordinate with them on how to resolve the problem for all those installations running earlier versions, but that's a separate discussion. |
| Sounds good, thank you. |
| Per Moe: I'm okay with having the orted not set its exit code as you are quite correct that it didn't have the problem. I would like to hold off on removing kill-on-bad-exit a bit until we better understand the implications - given the slurm change, it might not be necessary. |
876959a to 4af7a08 Compare | PR is updated, I removed kill-on-bad-exit portion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching it, and your patience.
The fact that application proc called Abort (read failed) doesn't
mean that ORTE subsystem has failed - vice versa it does it's work
to gracefuly exit the whole application.
orted exiting with non-zero status creates a problem for at least
plm/slurm environments where orteds are launched via
srunwith"--kill-on-bad-exit" flag. If one of orteds has exited with non-
zero status slurm will immediately kill all other orteds. As the
result we see a lot of leftover in the
/tmpdirectory.