- Notifications
You must be signed in to change notification settings - Fork 536
ENH: Large SGE jobs were overloading on qstat #728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@satra This is now cleaned up and has a single patch set for review. |
@satra Can this patch set now be merged? It GREATLY improved performance and citizenship on our shared cluster. I was able to run 127 simultaneous nipype instances without over taxing the cluster server or the submit host with this patch. Hans |
nipype/pipeline/plugins/sge.py Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we switch these to logging and use workflow DEBUG logging?
just making a github note that this work needs to be carried over to the other schedulers. |
@satra Thanks for the feedback. I'll make those recommended changes later this week and test. |
Is there a reason for all the camelCase method names? |
Habit from other projects. |
Would it be too much to ask for a fix before merge? Otherwise it's just going to show up on my wall of PEP8 shame at some point. |
Sure. I'm really swamped right now, but I hope to get back to this during Hans From: Michael Waskom notifications@github.com Would it be too much to ask for a fix before merge? Otherwise it's just ‹ |
@mwaskom What is the convention for variable names? Do those also need to be converted from CamelCase to underscore_separated? |
Yep. Pretty much the only thing capitals should be used for are class names, which are in CapCase. If you run |
@hjmjohnson - @chrisfilo is planning to release today - is it ok to punt this to a future update? i'm potentially open to merging this with a promise of the fixes in the near future. |
Satra, I've been testing and it looks good, but I really want to ensure that my I've also been able to add better debugging and simplify the logic a some I think it is best to wait on this. Are there others that really need it? Hans From: Satrajit Ghosh notifications@github.com @hjmjohnson https://github.com/hjmjohnson - @chrisfilo ‹ |
@satra I think this is now ready for review and merging. |
@hjmjohnson - thanks for this. i'm going to merge this an open an issue that this idea needs to be generalized across the other schedulers. i have to say the ability to switch schedulers has been such a boost - i was recently given access to a slurm cluster and a few changed lines and i was up and running - pretty sweet! |
nipype/pipeline/plugins/base.py Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pep8 - column width
When large jobs were run with SGE, the accumulation of qstat queries was causing a massive load on the cluster server, and was affecting overal system performance. This modification queries based on userid, stores information about all that users jobs (both running and recently finished), subsequent queries are then addressed by looking at the cached values first, then only updating with a system qstat call periodically. This change so that qstat is called on a more reasonable replication time frame. User can supply externally supplied version of qmake. Prevent DOS style of attacks on the batch processing server by preventing continuous queries by many jobs. This was affecting the performance of the entire server and the excess load of querying when jobs were done in the niavie way was affecting the performance of dispatching new jobs. The following two scripts can be used as plugin_args to provide cached versions of qmake suitable for running huge jobs. plugin_args=dict(template=JOB_SCRIPT, qsub_args="-S /bin/bash -cwd -pe smp 1-12 -l h_vmem=19G,mem_free=9G -o /dev/null -e /dev/null " + CLUSTER_QUEUE, qstatProgramPath=qstat_immediate.sh, qstatCachedProgramPath=qstat_cached.sh)) =qstat_cached.sh=================================================== \#!/bin/bash \# \author Hans J. Johnson \# This file provides a wrapper around qstat to ensure that \# the qstat server is not overloaded with too many requests \#debug_file=/dev/null debug_file=/tmp/TESTINGLOG qstat_cache=/tmp/qstat.xml echo "USING EXTERNAL QSTAT: $@" >> ${debug_file} 2>&1 older_than_60_sec=$( find $(dirname ${qstat_cache}) -maxdepth 1 -name $(basename ${qstat_cache}) -mmin $(echo 5/60 |bc -l) ) if [ -z "${older_than_60_sec}" ]; then DoQstatNow=$(dirname ${0})/qstat_immediate.sh ${DoQstatNow} $@ else echo "using cache $(date)" >> ${debug_file} 2>&1 cat ${qstat_cache} fi =================================================================== =qstat_immediate.sh=============================================== \#!/bin/bash \# \author Hans J. Johnson \# This file provides a wrapper around qstat to ensure that \# the qstat server is not overloaded with too many requests \#debug_file=/dev/null debug_file=/tmp/TESTINGLOG qstat_cache=/tmp/qstat.xml echo "USING EXTERNAL QSTAT: $@" >> ${debug_file} 2>&1 echo "Refreshing $(date)" >> ${debug_file} 2>&1 cacheUpdated=0; while [ ${cacheUpdated} -eq 0 ]; do if [ ! -f ${qstat_cache}_lock ]; then touch ${qstat_cache}_lock DoQstatNow=$(which qstat) ${DoQstatNow} $@ > ${qstat_cache}_tmp 2>&1 mv ${qstat_cache}_tmp ${qstat_cache} rm ${qstat_cache}_lock let cacheUpdated=1 else echo "Waiting for contention lock $(date)" >> ${debug_file} 2>&1 sleep 1 fi done cat ${qstat_cache} ===================================================================
@hjmjohnson - could you merge my pull request to your branch? |
fix: parenthesis fixes
Statra, Done. Hans From: Satrajit Ghosh <notifications@github.commailto:notifications@github.com> @hjmjohnsonhttps://github.com/hjmjohnson - could you merge my pull request to your branch? — Notice: This UI Health Care e-mail (including attachments) is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is confidential and may be legally privileged. If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution, or copying of this communication is strictly prohibited. Please reply to the sender that you have received the message in error, then delete it. Thank you. |
ENH: Large SGE jobs were overloading on qstat
When large jobs were run with SGE, the accumulation of
qstat queries was causing a massive load on the cluster
server, and was affecting overal system performance.
This modification queries based on userid, stores information
about all that users jobs (both running and recently finished),
subsequent queries are then addressed by looking at the cached
values first, then only updating with a system qstat call
periodically. This change so that qstat is called on a
more reasonable replication time frame.
User can supply externally supplied version of qmake.
Prevent DOS style of attacks on the batch processing
server by preventing continuous queries by many jobs.
This was affecting the performance of the entire server
and the excess load of querying when jobs were done in the
niavie way was affecting the performance of dispatching new
jobs.
The following two scripts can be used as plugin_args to
provide cached versions of qmake suitable for running
huge jobs.
qstat_cached.sh