[gridengine users] rebooting nodes nicely - what happened?

Michael Stauffer mgstauff at gmail.com
Tue Mar 1 22:44:39 UTC 2016


SoGE 8.1.8

Hi,

I need to reboot my compute nodes after the glibc patch, and wanted to do
so nicely, i.e. wait for each node's jobs to finish before rebooting. I've
done this before and it worked, but now my setup is a little more
complicated and I changed my reinstall script.

I have a queue for qsub jobs and one for qlogin. Each is assigned a
different number of cores per node so that some nodes always have at least
a couple cores available for qlogin sessions, and some nodes are used only
for qsub jobs.

However my reinstall script (taken from the sge examples, listed below)
does its thing by submitting a job that requests all the cores on a node,
so it only runs when other jobs have completed. So I created a new queue
called reboot.q that is allotted all cores on all nodes. My understanding
was that the queues would cooperatively manage resources, so if a node was
using, for example, 8 cores for jobs on my qsub queue, then my reboot job
that's requesting 16 cores would wait until those jobs finish.

But when I ran my script, all nodes rebooted for reinstall immediately. I
guess I don't understand things correctly? Can someone set me straight? How
do I do a node reboot only after jobs have finished under these
circumstances?

script:

ME=`hostname`

EXECHOSTS=`qconf -sel`

for TARGETHOST in $EXECHOSTS; do

        if [ "$ME" == "$TARGETHOST" ]; then

                echo "Skipping $ME. This is the submission host"

        else

                numprocs=`qconf -se $TARGETHOST | \

                        awk '/^processors/ {print $2}'`

                /opt/rocks/bin/rocks set host boot $TARGETHOST
action=install

                qsub -p 1024 -pe unihost $numprocs -binding
linear:${numprocs} -q reboot.q@$TARGETHOST \

                        /root/admin/scripts/sge-reboot.qsub

                echo "Set $TARGETHOST for Reinstallation"

        fi

done

Thanks

-M
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20160301/e7d6ff00/attachment.html>


More information about the users mailing list