[gridengine users] Jobs on qw state and exec node on au state

Bill Bryce bbryce at univa.com
Mon May 30 18:14:27 UTC 2016


Okay,

can you run any qconf commands such as ‘qconf -sconf’.  Try having a look at the messages files for the execution daemons.  They should be in

$SGE_ROOT/default/spool/ and in there are directories for the master and exec hosts (if you have this installed in a shared filesystem envirionment).  You can check both the qmaster messages file and the execd messages files in those directories.

A question.  Do you have the qmaster running on one host or on many?  I noticed that you have the ps output for compute010 and it is running a qmaster.

Other things you can check is to see if all nodes can contact the qmaster machine i.e. the networking is configured properly.  You can also make sure that the host naming is correct, either configure DNS properly or configure a /etc/hosts file for all nodes so the IP to host name mapping is consistent across the cluster.  Grid Engine is very picky about host names.



> On May 30, 2016, at 1:36 PM, Radhouane Aniba <aradwen at gmail.com> wrote:
> 
> Hi Bill
> 
> Yes I am sure
> 
> This is what I have when I login to one of the nodes and do
> 
> ubuntu at compute010:~$ ps -ef | grep sge_
> sgeadmin  1254     1  0 May28 ?        00:00:39 /usr/lib/gridengine/sge_qmaster
> sgeadmin  1446     1  0 May28 ?        00:00:22 /usr/lib/gridengine/sge_execd
> ubuntu    2552  2527  0 17:36 pts/0    00:00:00 grep --color=auto sge_
> 
> 
> On Mon, May 30, 2016 at 10:33 AM, Bill Bryce <bbryce at univa.com <mailto:bbryce at univa.com>> wrote:
> Hi Rad,
> 
> Are you sure that the execution daemons are running on your compute nodes?  Can you login to one of the nodes say ‘compute001’ and do a ps looking for the execd?  When an execd is functioning normally it provides the load and memory, etc… none of your nodes are showing that.
> 
> Regards,
> 
> Bill.
> 
>> On May 30, 2016, at 1:20 PM, Radhouane Aniba <aradwen at gmail.com <mailto:aradwen at gmail.com>> wrote:
>> 
>> Hello all,
>> 
>> I am trying to submit a simple "hello world" to test a gridengine (I used it before with no problems)
>> 
>> The problem is that my job is waiting in the queue forever
>> 
>> The qhost command shows a wired state of the compute nodes
>> 
>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
>> -------------------------------------------------------------------------------
>> global                  -               -     -       -       -       -       -
>> compute001              lx26-amd64      4     -   31.4G       -     0.0       -
>> compute002              lx26-amd64      4     -   31.4G       -     0.0       -
>> compute003              lx26-amd64      4     -   31.4G       -     0.0       -
>> compute004              lx26-amd64      4     -   31.4G       -     0.0       -
>> compute005              lx26-amd64      4     -   31.4G       -     0.0       -
>> compute006              lx26-amd64      4     -   31.4G       -     0.0       -
>> compute007              lx26-amd64      4     -   31.4G       -     0.0       -
>> compute008              lx26-amd64      4     -   31.4G       -     0.0       -
>> compute009              lx26-amd64      4     -   31.4G       -     0.0       -
>> compute010              lx26-amd64      4     -   31.4G       -     0.0       -
>> compute011              lx26-amd64      4     -   31.4G       -     0.0
>> In normal times even when the compute nodes are not used I used to have some information on the load and memuse columns
>> 
>> I am not an SGE persons but I am familiar with all the commands, any help would be much appreciated
>> 
>> the qstat -f command shows all my nodes in au state. I've been reading a lot about it and I understood its an alarm state (overloaded ?)
>> 
>> the only heavy activity I had on the head node was a script downloading 19T of data, could the headnode be the problem and not the compute nodes ?
>> 
>> sge_execd is working on all the compute/exec nodes :/
>> 
>> --
>> Rad
>> _______________________________________________
>> users mailing list
>> users at gridengine.org <mailto:users at gridengine.org>
>> https://gridengine.org/mailman/listinfo/users <https://gridengine.org/mailman/listinfo/users>
> 
> William Bryce | VP Products
> Univa Corporation, Toronto
> E: bbryce at univa.com <mailto:bbryce at univa.com> | D: 647-9742841 <tel:647-9742841> | Toll-Free (800) 370-5320 <tel:%28800%29%20370-5320>
> W: Univa.com <http://univa.com/> | FB: facebook.com/univa.corporation <http://facebook.com/univa.corporation> | T: twitter.com/Grid_Engine <http://twitter.com/Grid_Engine>
> 
> 
> 
> --
> Radhouane Aniba
> Bioinformatics Scientist
> BC Cancer Agency, Vancouver, Canada

William Bryce | VP Products
Univa Corporation, Toronto
E: bbryce at univa.com | D: 647-9742841 | Toll-Free (800) 370-5320
W: Univa.com <http://univa.com/> | FB: facebook.com/univa.corporation <http://facebook.com/univa.corporation> | T: twitter.com/Grid_Engine <http://twitter.com/Grid_Engine>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20160530/fa2dc827/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 496 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://gridengine.org/pipermail/users/attachments/20160530/fa2dc827/attachment-0001.sig>


More information about the users mailing list