[gridengine users] GE2011.11 and ge6.2u5

Michael Coffman michael.coffman at avagotech.com
Fri Jun 15 16:01:18 UTC 2012


I am trying to update my sge_execd and sge_shepherd binaries.   Based on
recent emails, I figured I could drop the GE2011.11 bits into place and
they would work fine.    I am however having issues:

My grid environment is:

Current Version - SGE - 6.2u5
SGE_CELL=ftcrnd
SGE_ROOT=/opt/grid-6.2u5
SGE_CLUSTER_NAME=ftcrnd

Binary path is /opt/grid/bin/lx24-amd64.

I had to make l link in /opt/grid/bin for linux-x64 to get things to work.

I used the following commands and it did indeed update live and the running
processes seemed happy and all seemed to be working fine:

gbits=/opt/sa/tmp/gbits
service sgeexecd softstop
cd /opt/grid/bin
ln -s lx24-amd64 linux-x64
cd lx24-amd64
mv sge_shepherd  sge_shepherd.old
mv sge_execd  sge_execd.old
cp $gbits/sge_shepherd .
cp $gbits/sge_execd .
service sgeexecd start

Since yesterday though I have had a couple of jobs fail and put their queue
into an error state.

Mail from the failing job:
Shepherd
error:
06/14/2012 21:29:37 [20339:8436]: can't open file job_pid: Permission
denied

>From the qmaster messages file:
06/14/2012 21:29:39|worker|gemaster|W|job 3885.1 failed on host
cs428.ftc.avagotech.net general before job because: 06/14/2012 21:29:37
[20339:8436]: can't open file job_pid: Permission denied

I checked a job_pid file on a currently running job on the system that had
the above errors, permission down the entire tree seems fine and here is
the job_id file:

-rw-r--r-- 1 grid  grid       6 Jun 14 17:40
job_pid

Any clues?    Is the path perhaps hard coded into sge_shepherd for this
file?

Thanks.
-- 
-MichaelC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20120615/815e78c1/attachment.html>


More information about the users mailing list