[gridengine users] qstat sometimes doesn't report all currently running jobs
reuti at staff.uni-marburg.de
Fri Feb 12 20:30:20 UTC 2016
Am 12.02.2016 um 20:34 schrieb Wagner, Justin:
> I’m running SoGE v8.1.0 and we notice from time to time that qstat doesn’t always report all of the jobs that are currently executing.
Sometimes there is a small gap between the states "qw", "t" and "r". I would assume that the transfer between states is not atomic.
> Is this a known issue with qstat? Is there is a fix for it in newer versions of the code?
> Here is the context:
> This causes a problem to systems we have that poll qstat to determine if jobs have completed or not. In fact I’ve never noticed the problem when running qstat myself, but the problem seems to only present itself to applications that are periodically polling qstat.
What "-hold_jid <wc_job_list>" help? It can also include jobnames and wildcards.
> I know there is a workaround where you can query qacct via “qacct -j job_number” to see if the job is done, so I don’t necessarily need any suggested workarounds.
It could also mean that the job got reschduled (in this case you end up witht several entries in the accounting file for one and the same job), not to mention parallel jobs where you can get a bunch of entries (depending on the PE setting "accounting_summary").
What's the goal behind it? Do you want to start another job or start something external?
Besides looking into DRMAA, there is also a small tool `qevent` to trigger an external script in case a job/task finishes.
> I’m simply asking if this is a known issue with qstat, and if there is a fix for it in newer versions of the code.
> users mailing list
> users at gridengine.org
More information about the users