[gridengine users] qalter not successful
Kevin Buckley
kevin.buckley.ecs.vuw.ac.nz at gmail.com
Sun Jun 17 23:01:36 UTC 2012
> I'll let you know what happens,
I got to a chance to try things out on a Xen mimic of the grid and
starting up a
new execd does seem to allow one to carry on using the resource on which you
have orpahned any jobs by taking out the original execd.
A full write-up of my testing can be found here
http://homepages.ecs.vuw.ac.nz/~kevin/forSGE/Extending_Grid_Engine_Runtimes_with_an_execd_softstop.html
but the salient points follow to keep things in the thread.
In between the softstop and the restart, replace the execute host's
configuration
which just had these defaults
execd_spool_dir /var/opt/gridengine/default/spool
gid_range 20000-20100
by creating a local conf for it
qconf -mconf localnode
with new values as follows
execd_spool_dir /var/opt/gridengine/default/spool2
gid_range 20101-20200
The restart even creates the new spool directory.
A qstat still shows the job on that node with a slot taken
# qstat -f -u \*
queuename qtype resv/used/tot. load_avg arch states
-------------------------------------------------------------------------------
all.q at scifachpc-c01n03.local BIP 0/0/1 0.00 lx24-amd64
-------------------------------------------------------------------------------
all.q at scifachpc-c01n04.local BIP 0/1/1 0.00 lx24-amd64
7 0.55500 qsub2.sh buckleke r 06/17/2012 12:00:05 1
A pstree shows a new execd tree and the orpahned job
+-sge_execd---4*[{sge_execd}]
+-sge_shepherd---sh---sleep
Even after altering the configuration to add another slot works
# qstat -f -u \*
queuename qtype resv/used/tot. load_avg arch states
-------------------------------------------------------------------------------
all.q at scifachpc-c01n03.local BIP 0/0/1 0.00 lx24-amd64
-------------------------------------------------------------------------------
all.q at scifachpc-c01n04.local BIP 0/1/2 0.00 lx24-amd64
7 0.55500 qsub2.sh buckleke r 06/17/2012 12:00:05 1
Submitting another job to the same queue sees
job-ID prior name user state submit/start at
queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
7 0.55500 qsub2.sh buckleke r 06/17/2012 12:00:05
all.q at scifachpc-c01n04.local 1
8 0.55500 qsub3.sh buckleke r 06/17/2012 12:07:05
all.q at scifachpc-c01n04.local 1
with the pstree showing both
+-sge_execd-+-sge_shepherd---sh---sleep
| +-4*[{sge_execd}]
+-sge_shepherd---sh---sleep
with the Grid Engine now believing that both slots are used
# qstat -f -u \*
queuename qtype resv/used/tot. load_avg arch states
-------------------------------------------------------------------------------
all.q at scifachpc-c01n03.local BIP 0/0/1 0.00 lx24-amd64
-------------------------------------------------------------------------------
all.q at scifachpc-c01n04.local BIP 0/2/2 0.01 lx24-amd64
7 0.55500 qsub2.sh buckleke r 06/17/2012 12:00:05 1
8 0.55500 qsub3.sh buckleke r 06/17/2012 12:07:05 1
Eventually, the newer job stops as normal yet, the qmaster thinks the
old one is still running, even though it has finished
# qstat -f -u \*
queuename qtype resv/used/tot. load_avg arch states
-------------------------------------------------------------------------------
all.q at scifachpc-c01n03.local BIP 0/0/1 0.00 lx24-amd64
-------------------------------------------------------------------------------
all.q at scifachpc-c01n04.local BIP 0/1/2 0.00 lx24-amd64
7 0.55500 qsub2.sh buckleke r 06/17/2012 12:00:05 1
and the Grid Engine knows nothing about it finsihing either
# qacct -j 7
error: job id 7 not found
and nor does the user looking for their job
$ qstat
job-ID prior name user state submit/start at
queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
7 0.55500 qsub2.sh buckleke r 06/17/2012 12:00:05
all.q at scifachpc-c01n04.local 1
even though that job has run its course on the node we mangled, with a
pstree there now only showing
+-sge_execd---4*[{sge_execd}]
To get back to the "original" environment, we "softstop" the new
execd, although, with no jobs running node it, we could just ==stopp=
it..
Modify the execd's conf back to what it was (in this case, the
defaults, so we could just delete the local config)
The system now thinks the job that was orpahned finshed when it did
(after 10 minutes)
qsub_time Sun Jun 17 11:59:53 2012
start_time Sun Jun 17 12:00:05 2012
end_time Sun Jun 17 12:10:05 2012
This will get my user out of a major bind, so thanks to all for the
insight and feedback.
Kevin Buckley
ECS, VUW, NZ
More information about the users
mailing list