Skip to content

Conversation

hjmjohnson
Copy link
Contributor

When large jobs were run with SGE, the accumulation of
qstat queries was causing a massive load on the cluster
server, and was affecting overal system performance.

This modification queries based on userid, stores information
about all that users jobs (both running and recently finished),
subsequent queries are then addressed by looking at the cached
values first, then only updating with a system qstat call
periodically. This change so that qstat is called on a
more reasonable replication time frame.

User can supply externally supplied version of qmake.

Prevent DOS style of attacks on the batch processing
server by preventing continuous queries by many jobs.

This was affecting the performance of the entire server
and the excess load of querying when jobs were done in the
niavie way was affecting the performance of dispatching new
jobs.

The following two scripts can be used as plugin_args to
provide cached versions of qmake suitable for running
huge jobs.

plugin_args=dict(template=JOB_SCRIPT, qsub_args="-S /bin/bash -cwd -pe smp 1-12 -l h_vmem=19G,mem_free=9G -o /dev/null -e /dev/null " + CLUSTER_QUEUE, qstatProgramPath=qstat_immediate.sh, qstatCachedProgramPath=qstat_cached.sh)) =qstat_immediate.sh=============================================== \#!/bin/bash \# \author Hans J. Johnson \# This file provides a wrapper around qstat to ensure that \# the qstat server is not overloaded with too many requests \#debug_file=/dev/null debug_file=/tmp/TESTINGLOG qstat_cache=/tmp/qstat.xml echo "USING EXTERNAL QSTAT: $@" >> ${debug_file} 2>&1 older_than_60_sec=$( find $(dirname ${qstat_cache}) -maxdepth 1 -name $(basename ${qstat_cache}) -mmin $(echo 5/60 |bc -l) ) if [ -z "${older_than_60_sec}" ]; then DoQstatNow=$(dirname ${0})/qstat_immediate.sh ${DoQstatNow} $@ else echo "using cache $(date)" >> ${debug_file} 2>&1 cat ${qstat_cache} fi 

qstat_cached.sh

\#!/bin/bash \# \author Hans J. Johnson \# This file provides a wrapper around qstat to ensure that \# the qstat server is not overloaded with too many requests \#debug_file=/dev/null debug_file=/tmp/TESTINGLOG qstat_cache=/tmp/qstat.xml echo "USING EXTERNAL QSTAT: $@" >> ${debug_file} 2>&1 echo "Refreshing $(date)" >> ${debug_file} 2>&1 cacheUpdated=0; while [ ${cacheUpdated} -eq 0 ]; do if [ ! -f ${qstat_cache}_lock ]; then touch ${qstat_cache}_lock DoQstatNow=$(which qstat) ${DoQstatNow} $@ > ${qstat_cache}_tmp 2>&1 mv ${qstat_cache}_tmp ${qstat_cache} rm ${qstat_cache}_lock let cacheUpdated=1 else echo "Waiting for contention lock $(date)" >> ${debug_file} 2>&1 sleep 1 fi done cat ${qstat_cache} 

@hjmjohnson
Copy link
Contributor Author

@satra This is now cleaned up and has a single patch set for review.

@hjmjohnson
Copy link
Contributor Author

@satra Can this patch set now be merged? It GREATLY improved performance and citizenship on our shared cluster. I was able to run 127 simultaneous nipype instances without over taxing the cluster server or the submit host with this patch.

Hans

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we switch these to logging and use workflow DEBUG logging?

@satra
Copy link
Member

satra commented Dec 3, 2013

just making a github note that this work needs to be carried over to the other schedulers.

@hjmjohnson
Copy link
Contributor Author

@satra Thanks for the feedback. I'll make those recommended changes later this week and test.

@mwaskom
Copy link
Member

mwaskom commented Dec 4, 2013

Is there a reason for all the camelCase method names?

@hjmjohnson
Copy link
Contributor Author

Habit from other projects.

@mwaskom
Copy link
Member

mwaskom commented Dec 4, 2013

Would it be too much to ask for a fix before merge? Otherwise it's just going to show up on my wall of PEP8 shame at some point.

@hjmjohnson
Copy link
Contributor Author

Sure. I'm really swamped right now, but I hope to get back to this during
the holidays.

Hans

From: Michael Waskom notifications@github.com
Reply-To: nipy/nipype
<reply+i-22950785-5e9aafe358e809aed7fcf09ece3e94bcc658c703-313970@reply.gith
ub.com>
Date: Wednesday, December 4, 2013 1:53 PM
To: nipy/nipype nipype@noreply.github.com
Cc: Hans Johnson hans.j.johnson@gmail.com
Subject: Re: [nipype] ENH: Large SGE jobs were overloading on qstat (#728)

Would it be too much to ask for a fix before merge? Otherwise it's just
going to show up on my wall of PEP8 shame at some point.


Reply to this email directly or view it on GitHub
#728 (comment) .

@hjmjohnson
Copy link
Contributor Author

@mwaskom What is the convention for variable names? Do those also need to be converted from CamelCase to underscore_separated?

@mwaskom
Copy link
Member

mwaskom commented Dec 4, 2013

Yep. Pretty much the only thing capitals should be used for are class names, which are in CapCase.

If you run pep8 on the module, I would expect it to complain about most of these things. Here's the relevant section of PEP8 itself.

@satra
Copy link
Member

satra commented Dec 19, 2013

@hjmjohnson - @chrisfilo is planning to release today - is it ok to punt this to a future update? i'm potentially open to merging this with a promise of the fixes in the near future.

@hjmjohnson
Copy link
Contributor Author

Satra,

I've been testing and it looks good, but I really want to ensure that my
version works with more testing. I'm 90 % sure I have a working version,
but I won't have time to do better testing until after the holidays.

I've also been able to add better debugging and simplify the logic a some
more, but that is still on my private branch.

I think it is best to wait on this. Are there others that really need it?

Hans

From: Satrajit Ghosh notifications@github.com
Reply-To: nipy/nipype
<reply+i-22950785-5e9aafe358e809aed7fcf09ece3e94bcc658c703-313970@reply.gith
ub.com>
Date: Thursday, December 19, 2013 8:52 AM
To: nipy/nipype nipype@noreply.github.com
Cc: Hans Johnson hans.j.johnson@gmail.com
Subject: Re: [nipype] ENH: Large SGE jobs were overloading on qstat (#728)

@hjmjohnson https://github.com/hjmjohnson - @chrisfilo
https://github.com/chrisfilo is planning to release today - is it ok to
punt this to a future update? i'm potentially open to merging this with a
promise of the fixes in the near future.


Reply to this email directly or view it on GitHub
#728 (comment) .

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling fcf3690 on BRAINSia:EnhanceQSubProcessing into 99240e0 on nipy:master.

@hjmjohnson
Copy link
Contributor Author

@satra I think this is now ready for review and merging.

@satra
Copy link
Member

satra commented Feb 10, 2014

@hjmjohnson - thanks for this. i'm going to merge this an open an issue that this idea needs to be generalized across the other schedulers. i have to say the ability to switch schedulers has been such a boost - i was recently given access to a slurm cluster and a few changed lines and i was up and running - pretty sweet!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8 - column width

hjmjohnson and others added 5 commits February 10, 2014 08:30
When large jobs were run with SGE, the accumulation of qstat queries was causing a massive load on the cluster server, and was affecting overal system performance. This modification queries based on userid, stores information about all that users jobs (both running and recently finished), subsequent queries are then addressed by looking at the cached values first, then only updating with a system qstat call periodically. This change so that qstat is called on a more reasonable replication time frame. User can supply externally supplied version of qmake. Prevent DOS style of attacks on the batch processing server by preventing continuous queries by many jobs. This was affecting the performance of the entire server and the excess load of querying when jobs were done in the niavie way was affecting the performance of dispatching new jobs. The following two scripts can be used as plugin_args to provide cached versions of qmake suitable for running huge jobs. plugin_args=dict(template=JOB_SCRIPT, qsub_args="-S /bin/bash -cwd -pe smp 1-12 -l h_vmem=19G,mem_free=9G -o /dev/null -e /dev/null " + CLUSTER_QUEUE, qstatProgramPath=qstat_immediate.sh, qstatCachedProgramPath=qstat_cached.sh)) =qstat_cached.sh=================================================== \#!/bin/bash \# \author Hans J. Johnson \# This file provides a wrapper around qstat to ensure that \# the qstat server is not overloaded with too many requests \#debug_file=/dev/null debug_file=/tmp/TESTINGLOG qstat_cache=/tmp/qstat.xml echo "USING EXTERNAL QSTAT: $@" >> ${debug_file} 2>&1 older_than_60_sec=$( find $(dirname ${qstat_cache}) -maxdepth 1 -name $(basename ${qstat_cache}) -mmin $(echo 5/60 |bc -l) ) if [ -z "${older_than_60_sec}" ]; then DoQstatNow=$(dirname ${0})/qstat_immediate.sh ${DoQstatNow} $@ else echo "using cache $(date)" >> ${debug_file} 2>&1 cat ${qstat_cache} fi =================================================================== =qstat_immediate.sh=============================================== \#!/bin/bash \# \author Hans J. Johnson \# This file provides a wrapper around qstat to ensure that \# the qstat server is not overloaded with too many requests \#debug_file=/dev/null debug_file=/tmp/TESTINGLOG qstat_cache=/tmp/qstat.xml echo "USING EXTERNAL QSTAT: $@" >> ${debug_file} 2>&1 echo "Refreshing $(date)" >> ${debug_file} 2>&1 cacheUpdated=0; while [ ${cacheUpdated} -eq 0 ]; do if [ ! -f ${qstat_cache}_lock ]; then touch ${qstat_cache}_lock DoQstatNow=$(which qstat) ${DoQstatNow} $@ > ${qstat_cache}_tmp 2>&1 mv ${qstat_cache}_tmp ${qstat_cache} rm ${qstat_cache}_lock let cacheUpdated=1 else echo "Waiting for contention lock $(date)" >> ${debug_file} 2>&1 sleep 1 fi done cat ${qstat_cache} ===================================================================
@satra
Copy link
Member

satra commented Feb 16, 2014

@hjmjohnson - could you merge my pull request to your branch?

fix: parenthesis fixes
@hjmjohnson
Copy link
Contributor Author

Statra,

Done.

Hans

From: Satrajit Ghosh <notifications@github.commailto:notifications@github.com>
Reply-To: nipy/nipype <reply@reply.github.commailto:reply@reply.github.com>
Date: Sunday, February 16, 2014 at 9:50 AM
To: nipy/nipype <nipype@noreply.github.commailto:nipype@noreply.github.com>
Cc: Hans Johnson <hans.j.johnson@gmail.commailto:hans.j.johnson@gmail.com>
Subject: Re: [nipype] ENH: Large SGE jobs were overloading on qstat (#728)

@hjmjohnsonhttps://github.com/hjmjohnson - could you merge my pull request to your branch?


Reply to this email directly or view it on GitHubhttps://github.com//pull/728#issuecomment-35199908.


Notice: This UI Health Care e-mail (including attachments) is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is confidential and may be legally privileged. If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution, or copying of this communication is strictly prohibited. Please reply to the sender that you have received the message in error, then delete it. Thank you.


@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling e4dca02 on BRAINSia:EnhanceQSubProcessing into 99240e0 on nipy:master.

satra added a commit that referenced this pull request Feb 19, 2014
ENH: Large SGE jobs were overloading on qstat
@satra satra merged commit 699df35 into nipy:master Feb 19, 2014
@hjmjohnson hjmjohnson deleted the EnhanceQSubProcessing branch March 14, 2014 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants