[gridengine users] qstat sometimes doesn't report all currently running jobs
jwagner at ciena.com
Fri Feb 12 21:29:23 UTC 2016
The purpose is to notify the centralized "monitoring" process that one of it's workers has completed, so that the central process can go analyze the results from the worker.
qevent looks interesting and may be a more robust alternative to our current qacct implementation.
From: Reuti [mailto:reuti at staff.uni-marburg.de]
Sent: Friday, February 12, 2016 12:30 PM
To: Wagner, Justin
Cc: users at gridengine.org
Subject: Re: [gridengine users] qstat sometimes doesn't report all currently running jobs
Am 12.02.2016 um 20:34 schrieb Wagner, Justin:
> I'm running SoGE v8.1.0 and we notice from time to time that qstat doesn't always report all of the jobs that are currently executing.
Sometimes there is a small gap between the states "qw", "t" and "r". I would assume that the transfer between states is not atomic.
> Is this a known issue with qstat? Is there is a fix for it in newer versions of the code?
> Here is the context:
> This causes a problem to systems we have that poll qstat to determine if jobs have completed or not. In fact I've never noticed the problem when running qstat myself, but the problem seems to only present itself to applications that are periodically polling qstat.
What "-hold_jid <wc_job_list>" help? It can also include jobnames and wildcards.
> I know there is a workaround where you can query qacct via "qacct -j job_number" to see if the job is done, so I don't necessarily need any suggested workarounds.
It could also mean that the job got reschduled (in this case you end up witht several entries in the accounting file for one and the same job), not to mention parallel jobs where you can get a bunch of entries (depending on the PE setting "accounting_summary").
What's the goal behind it? Do you want to start another job or start something external?
Besides looking into DRMAA, there is also a small tool `qevent` to trigger an external script in case a job/task finishes.
> I'm simply asking if this is a known issue with qstat, and if there is a fix for it in newer versions of the code.
> users mailing list
> users at gridengine.org
More information about the users