[gridengine users] SoGE 8.1.8 - Qsub issue when using request able variable and parallel environment - need your help.
yuribu at mellanox.com
Fri Oct 30 09:40:39 UTC 2015
Hallo to distinguished forum members,
Recently we have a need to submit jobs in way that qsub request both requestable variable hostname and parallel environment.
For example if we submit 'xterm' job:
* $SGE_ROOT/bin/lx-amd64/qsub -V -cwd -b y -l hostname=host_in_grid -pe somePe 1 xterm
This kind of request results in a strange behavior of the scheduler - this requests results to one of the below states of the submission:
1. xterm job opened as expected.
2. There is a very long delay and then xterm opened.
3. Job enters 'qw' state with similar to below error:
cannot run because it exceeds limit "/////" in rule "some_rule/1"
cannot run in PE "somePe" because it only offers 0 slots
In all of the above states the "host_in_grid" has enough free slots and the quota rule "some_rule" is not related in any way to the consumable/request able variable in the job submission request.
If we try to remove "some_rule" quota from the SGE quotas, then this error picks up another rule and again states that its limit was exceeded.
NOTE: somePe parallel environment has enough free slots - it is initially defined with 999 slots.
Basically these "cannot run" messages do not reflect the real reason why the job can't be run, since all conditions are actually met - this is very confusing, why this happen?
We also found a workaround without the requestable variable "hostname" like below when it ALWAYS work:
$SGE_ROOT/bin/lx-amd64/qsub -V -cwd -b y -q host_in_grid -pe testpe 1 xterm
Any ideas why does this strange behavior occur? Is this some kind of a bug? How this can be resolved?
Appreciate your help.
Yuri Burmachenko | Sr. Engineer | IT | Mellanox Tech
Work: +972 74 7236386 | Cell +972 54 7542188 |Fax: +972 4 959 3245
Follow us on Twitter<http://twitter.com/mellanoxtech> and Facebook<http://www.facebook.com/pages/Mellanox-Technologies/223164879116>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the users