[gridengine users] Configure gridengine on CentOS 6.3

Petter Gustad gridengine at gustad.com
Wed Nov 7 19:59:44 UTC 2012


From: Reuti <reuti at Staff.Uni-Marburg.DE>
Subject: Re: [gridengine users] Configure gridengine on CentOS 6.3
Date: Wed, 7 Nov 2012 20:26:54 +0100

> Am 07.11.2012 um 18:49 schrieb Petter Gustad:
> 
>> From: Reuti <reuti at Staff.Uni-Marburg.DE>
>> Subject: Re: [gridengine users] Configure gridengine on CentOS 6.3
>> Date: Wed, 7 Nov 2012 16:37:22 +0100
>> 
>>> Am 07.11.2012 um 15:46 schrieb Petter Gustad:
>>> 
>>>>> From: Reuti <reuti at staff.uni-marburg.de>
>>>>> Subject: Re: [gridengine users] Configure gridengine on CentOS 6.3
>>>>> Date: Tue, 30 Oct 2012 11:27:49 +0100
>>>>> 
>>>>>> Just use the version you have already in the shared /usr/sge or your
>>>>>> particular mountpoint.
>>>>> 
>>>>> I should probably try this first, at least to verify that it's
>>>>> working. But later I would like to migrate to the CentOS on all my
>>>>> exechosts and leave the installation to somebody else.
>>>> 
>>>> I did this and it worked out fine on the first machine I migrated.
>>>> However, on the next set of machines I run into the problem where the
>>>> submitted job will cause the queue to go into the error state.
>>>> 
>>>> I observe that:
>>>> 
>>>> 1) It will not be submitted
>>>> 2) The queue will be marked with the 'E' state
>>>> 3) I get an e-mail saying
>>>>   Shepherd pe_hostfile:
>>>>   node 1 queue at node UNDEFINED
>>>> 4) The node will log the following in the spool/node/messages file:
>>>>   11/07/2012 15:33:07|  main|node|E|shepherd of job 48548.1 exited with exit status = 11
>>>> 5) qstat -j jobnumber returns
>>>> 
>>>>   error reason    1:          11/07/2012 15:33:06 [555:29681]: unable to find job file "/work/gridengine/spool/node/job_scr
>> 
>> Is this output always truncated,
> 
> Yes.

OK. Good.

> 
>> or could this be the source of the problem?
> 
> No.
> 
> 
>>> This looks like an unusual path for the spool directory. The name of the node should be included.
>> 
>> I've subsituted the string "node" for the actual node name. It appears
>> to be the same for all the nodes, hence I just used "node".
> 
> Good.
> 
> 
>>> $ qconf -sconf
>>> 
>>> (at the top something like: execd_spool_dir              /var/spool/sge, the directory for the particular node will be created automatically when the execd starts up)
>> 
>> This will show the spool directory on the qmaster, which is different
> 
> No, it's the global setting for the execd spool directory. This can be overridden, in case you have different paths on all the node.
> 
> If all nodes are the same, you can even delete all the local definitions which were listed in `qconf -sconfl`.
> 
> NB: The location of the qmaster spool directory is in "/usr/sge/default/common/bootstrap" (adjust the path for your installation): like for me "qmaster_spool_dir       /var/spool/sge/qmaster"
> 
> 
>> from the above. But for all the nodes this is /work/gridengine/spool.
> 
> Yes, but if you check the directory /work/gridengine/spool there should be a level for the node  /work/gridengine/spool/node001 or whatever. This directory is readable for the sgeadmin user account?

That was the problem. Thanks! This directory was readable by the
gridengine account only. By making this world readable I managed to
submit a job. These permissions were different on the working and
non-working nodes as well.


> 
>>> $ qconf -sconfl
>>> 
>>> (get all exechost definitions [if any are present at all]), then for the particular node:
>>> 
>>> $ qconf -sconf node42
>>> 
>>> and check the path to the execd_spool_dir.
>> 
>> They are all identical. If I do something like:
>> 
>> qconf -sconf good-node > /tmp/good-node
>> qconf -sconf bad-node > /tmp/bad-node
>> 
>> and diff the two, the only diff will be the hostname part.
>> 
>> All the nodes are using spool on a local filesystem located at
>> /work/gridengine/spool
>> 
>> 
>> The only difference I see on the bad nodes is that there is a "." at
>> the end of the permissions in the spool directory so I think this
>> might be related to SELinux. I'll have to investegate this further.
> 
> Yep. It means access limits by other facility, like it is a "+" for ACL.
> 
> I suggest to switch off SELinux.
> -- Reuti


Best regards
//Petter


More information about the users mailing list