[gridengine users] Jobs on qw state and exec node on au state

Bill Bryce bbryce at univa.com
Mon May 30 19:00:41 UTC 2016


So typically with Grid Engine you need to select one machine as the ‘master’ machine in the cluster (you can have backups but they are running a ‘shadow_master’ so don’t worry about that for now).  The qmaster needs to be on one host that all the nodes can communicate with over the network.  Each nodes must have a fully qualified name in order to communicate in a Grid Engine cluster.  So although you can ping one host from another that does not mean that you have the host names setup properly.  for example if I have a host compute010 it has to have a fully qualified name such as compute010.example.com <http://compute010.example.com/> (example.com <http://example.com/> is used here because it always works in examples, but don’t use it in your real cluster - you should have something at your site).  And that fully qualified name needs to map to an IP address for the machine.

this means that when I am on machine compute010  I can run:

# hostname

# cat /etc/hosts

and the names should match.

If I do a

# ping compute010.example.com <http://compute010.example.com/> it should return the IP address of the machine

And you definitely cannot map the loopback address to the hostname i.e. 127.0.0.1 to compute010  that won’t work.  The messages file below from host compute010 is indicating that it can’t communicate with the master on frontend001.  So if the physical networking is not messed up and the machines have IP addresses then the name resolution is messed up or something else is running on the port that Grid Engine wants to use or you have a firewall that is getting in the way of communications between the hosts and blocking the communication.

basically check everything you can between two hosts to make sure they can communicate properly.

Regards,

Bill.

> On May 30, 2016, at 2:24 PM, Radhouane Aniba <aradwen at gmail.com> wrote:
> 
> Ok here is what I have
> 
> connected to one node compute010
> 
> qconf -sconf gives me this
> 
> 
> #global:
> execd_spool_dir              /var/spool/gridengine/execd
> mailer                       /usr/bin/mail
> xterm                        /usr/bin/xterm
> load_sensor                  none
> prolog                       none
> epilog                       none
> shell_start_mode             posix_compliant
> login_shells                 bash,sh,ksh,csh,tcsh
> min_uid                      0
> min_gid                      0
> user_lists                   none
> xuser_lists                  none
> projects                     none
> xprojects                    none
> enforce_project              false
> enforce_user                 auto
> load_report_time             00:00:40
> max_unheard                  00:05:00
> reschedule_unknown           00:00:00
> loglevel                     log_warning
> administrator_mail           root
> set_token_cmd                none
> pag_cmd                      none
> token_extend_time            none
> shepherd_cmd                 none
> qmaster_params               none
> execd_params                 none
> reporting_params             accounting=true reporting=false \
>                              flush_time=00:00:15 joblog=false sharelog=00:00:00
> finished_jobs                100
> gid_range                    65400-65500
> max_aj_instances             2000
> max_aj_tasks                 75000
> max_u_jobs                   0
> max_jobs                     0
> auto_user_oticket            0
> auto_user_fshare             0
> auto_user_default_project    none
> auto_user_delete_time        86400
> delegated_file_staging       false
> reprioritize                 0
> rlogin_daemon                /usr/sbin/sshd -i
> rlogin_command               /usr/bin/ssh
> qlogin_daemon                /usr/sbin/sshd -i
> qlogin_command               /usr/share/gridengine/qlogin-wrapper
> rsh_daemon                   /usr/sbin/sshd -i
> rsh_command                  /usr/bin/ssh
> jsv_url                      none
> jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
> 
> 
> the message in spool :
> 
> 
> ubuntu at compute010:~$ more /var/spool/gridengine/execd/compute010/messages
> 05/02/2016 18:10:11|  main|compute010|E|can't find connection
> 05/02/2016 18:10:11|  main|compute010|E|can't get configuration from qmaster -- backgrounding
> 05/04/2016 16:58:28|  main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
> 05/18/2016 17:10:36|  main|compute010|W|can't register at qmaster "frontend001": abort qmaster registration due to communication errors
> 05/18/2016 17:37:55|  main|compute010|I|controlled shutdown 6.2u5
> 05/18/2016 17:46:28|  main|compute010|E|can't find connection
> 05/18/2016 17:46:28|  main|compute010|E|can't get configuration from qmaster -- backgrounding
> 05/18/2016 17:46:31|  main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
> 05/20/2016 14:27:40|  main|compute010|I|controlled shutdown 6.2u5
> 05/22/2016 17:00:28|  main|compute010|E|can't find connection
> 05/22/2016 17:00:28|  main|compute010|E|can't get configuration from qmaster -- backgrounding
> 05/22/2016 17:01:38|  main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
> 05/28/2016 03:59:31|  main|compute010|I|controlled shutdown 6.2u5
> 05/28/2016 03:59:49|  main|compute010|W|local configuration compute010 not defined - using global configuration
> 05/28/2016 03:59:49|  main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
> 05/30/2016 17:41:50|  main|compute010|W|can't register at qmaster "compute010": abort qmaster registration due to communication errors
> 05/30/2016 17:41:50|  main|compute010|E|commlib error: got select error (Connection refused)
> 05/30/2016 17:42:14|  main|compute010|I|controlled shutdown 6.2u5
> 05/30/2016 17:58:58|  main|compute010|W|local configuration compute010 not defined - using global configuration
> 05/30/2016 17:58:58|  main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
> 
> I had the qmaster running on all nodes before, with no problem (master and executors)
> when I kill sge_master on the node, the sge_execd is not working anymore because its not able to connect to the master
> 
> a ping on the node to the frontend node shows that it is visible though
> 
> :/
> 
> On Mon, May 30, 2016 at 11:14 AM, Bill Bryce <bbryce at univa.com <mailto:bbryce at univa.com>> wrote:
> Okay,
> 
> can you run any qconf commands such as ‘qconf -sconf’.  Try having a look at the messages files for the execution daemons.  They should be in
> 
> $SGE_ROOT/default/spool/ and in there are directories for the master and exec hosts (if you have this installed in a shared filesystem envirionment).  You can check both the qmaster messages file and the execd messages files in those directories.
> 
> A question.  Do you have the qmaster running on one host or on many?  I noticed that you have the ps output for compute010 and it is running a qmaster.
> 
> Other things you can check is to see if all nodes can contact the qmaster machine i.e. the networking is configured properly.  You can also make sure that the host naming is correct, either configure DNS properly or configure a /etc/hosts file for all nodes so the IP to host name mapping is consistent across the cluster.  Grid Engine is very picky about host names.
> 
> 
> 
>> On May 30, 2016, at 1:36 PM, Radhouane Aniba <aradwen at gmail.com <mailto:aradwen at gmail.com>> wrote:
>> 
>> Hi Bill
>> 
>> Yes I am sure
>> 
>> This is what I have when I login to one of the nodes and do
>> 
>> ubuntu at compute010:~$ ps -ef | grep sge_
>> sgeadmin  1254     1  0 May28 ?        00:00:39 /usr/lib/gridengine/sge_qmaster
>> sgeadmin  1446     1  0 May28 ?        00:00:22 /usr/lib/gridengine/sge_execd
>> ubuntu    2552  2527  0 17:36 pts/0    00:00:00 grep --color=auto sge_
>> 
>> 
>> On Mon, May 30, 2016 at 10:33 AM, Bill Bryce <bbryce at univa.com <mailto:bbryce at univa.com>> wrote:
>> Hi Rad,
>> 
>> Are you sure that the execution daemons are running on your compute nodes?  Can you login to one of the nodes say ‘compute001’ and do a ps looking for the execd?  When an execd is functioning normally it provides the load and memory, etc… none of your nodes are showing that.
>> 
>> Regards,
>> 
>> Bill.
>> 
>>> On May 30, 2016, at 1:20 PM, Radhouane Aniba <aradwen at gmail.com <mailto:aradwen at gmail.com>> wrote:
>>> 
>>> Hello all,
>>> 
>>> I am trying to submit a simple "hello world" to test a gridengine (I used it before with no problems)
>>> 
>>> The problem is that my job is waiting in the queue forever
>>> 
>>> The qhost command shows a wired state of the compute nodes
>>> 
>>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
>>> -------------------------------------------------------------------------------
>>> global                  -               -     -       -       -       -       -
>>> compute001              lx26-amd64      4     -   31.4G       -     0.0       -
>>> compute002              lx26-amd64      4     -   31.4G       -     0.0       -
>>> compute003              lx26-amd64      4     -   31.4G       -     0.0       -
>>> compute004              lx26-amd64      4     -   31.4G       -     0.0       -
>>> compute005              lx26-amd64      4     -   31.4G       -     0.0       -
>>> compute006              lx26-amd64      4     -   31.4G       -     0.0       -
>>> compute007              lx26-amd64      4     -   31.4G       -     0.0       -
>>> compute008              lx26-amd64      4     -   31.4G       -     0.0       -
>>> compute009              lx26-amd64      4     -   31.4G       -     0.0       -
>>> compute010              lx26-amd64      4     -   31.4G       -     0.0       -
>>> compute011              lx26-amd64      4     -   31.4G       -     0.0
>>> In normal times even when the compute nodes are not used I used to have some information on the load and memuse columns
>>> 
>>> I am not an SGE persons but I am familiar with all the commands, any help would be much appreciated
>>> 
>>> the qstat -f command shows all my nodes in au state. I've been reading a lot about it and I understood its an alarm state (overloaded ?)
>>> 
>>> the only heavy activity I had on the head node was a script downloading 19T of data, could the headnode be the problem and not the compute nodes ?
>>> 
>>> sge_execd is working on all the compute/exec nodes :/
>>> 
>>> --
>>> Rad
>>> _______________________________________________
>>> users mailing list
>>> users at gridengine.org <mailto:users at gridengine.org>
>>> https://gridengine.org/mailman/listinfo/users <https://gridengine.org/mailman/listinfo/users>
>> 
>> William Bryce | VP Products
>> Univa Corporation, Toronto
>> E: bbryce at univa.com <mailto:bbryce at univa.com> | D: 647-9742841 <tel:647-9742841> | Toll-Free (800) 370-5320 <tel:%28800%29%20370-5320>
>> W: Univa.com <http://univa.com/> | FB: facebook.com/univa.corporation <http://facebook.com/univa.corporation> | T: twitter.com/Grid_Engine <http://twitter.com/Grid_Engine>
>> 
>> 
>> 
>> --
>> Radhouane Aniba
>> Bioinformatics Scientist
>> BC Cancer Agency, Vancouver, Canada
> 
> William Bryce | VP Products
> Univa Corporation, Toronto
> E: bbryce at univa.com <mailto:bbryce at univa.com> | D: 647-9742841 <tel:647-9742841> | Toll-Free (800) 370-5320 <tel:%28800%29%20370-5320>
> W: Univa.com <http://univa.com/> | FB: facebook.com/univa.corporation <http://facebook.com/univa.corporation> | T: twitter.com/Grid_Engine <http://twitter.com/Grid_Engine>
> 
> 
> 
> --
> Radhouane Aniba
> Bioinformatics Scientist
> BC Cancer Agency, Vancouver, Canada

William Bryce | VP Products
Univa Corporation, Toronto
E: bbryce at univa.com | D: 647-9742841 | Toll-Free (800) 370-5320
W: Univa.com <http://univa.com/> | FB: facebook.com/univa.corporation <http://facebook.com/univa.corporation> | T: twitter.com/Grid_Engine <http://twitter.com/Grid_Engine>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20160530/8adcde60/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 496 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://gridengine.org/pipermail/users/attachments/20160530/8adcde60/attachment-0001.sig>


More information about the users mailing list