[gridengine users] Jobs on qw state and exec node on au state

Radhouane Aniba aradwen at gmail.com
Mon May 30 18:24:54 UTC 2016


Ok here is what I have

connected to one node compute010

qconf -sconf gives me this


#global:
execd_spool_dir              /var/spool/gridengine/execd
mailer                       /usr/bin/mail
xterm                        /usr/bin/xterm
load_sensor                  none
prolog                       none
epilog                       none
shell_start_mode             posix_compliant
login_shells                 bash,sh,ksh,csh,tcsh
min_uid                      0
min_gid                      0
user_lists                   none
xuser_lists                  none
projects                     none
xprojects                    none
enforce_project              false
enforce_user                 auto
load_report_time             00:00:40
max_unheard                  00:05:00
reschedule_unknown           00:00:00
loglevel                     log_warning
administrator_mail           root
set_token_cmd                none
pag_cmd                      none
token_extend_time            none
shepherd_cmd                 none
qmaster_params               none
execd_params                 none
reporting_params             accounting=true reporting=false \
                             flush_time=00:00:15 joblog=false
sharelog=00:00:00
finished_jobs                100
gid_range                    65400-65500
max_aj_instances             2000
max_aj_tasks                 75000
max_u_jobs                   0
max_jobs                     0
auto_user_oticket            0
auto_user_fshare             0
auto_user_default_project    none
auto_user_delete_time        86400
delegated_file_staging       false
reprioritize                 0
rlogin_daemon                /usr/sbin/sshd -i
rlogin_command               /usr/bin/ssh
qlogin_daemon                /usr/sbin/sshd -i
qlogin_command               /usr/share/gridengine/qlogin-wrapper
rsh_daemon                   /usr/sbin/sshd -i
rsh_command                  /usr/bin/ssh
jsv_url                      none
jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w


the message in spool :


ubuntu at compute010:~$ more /var/spool/gridengine/execd/compute010/messages
05/02/2016 18:10:11|  main|compute010|E|can't find connection
05/02/2016 18:10:11|  main|compute010|E|can't get configuration from
qmaster -- backgrounding
05/04/2016 16:58:28|  main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
05/18/2016 17:10:36|  main|compute010|W|can't register at qmaster
"frontend001": abort qmaster registration due to communication errors
05/18/2016 17:37:55|  main|compute010|I|controlled shutdown 6.2u5
05/18/2016 17:46:28|  main|compute010|E|can't find connection
05/18/2016 17:46:28|  main|compute010|E|can't get configuration from
qmaster -- backgrounding
05/18/2016 17:46:31|  main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
05/20/2016 14:27:40|  main|compute010|I|controlled shutdown 6.2u5
05/22/2016 17:00:28|  main|compute010|E|can't find connection
05/22/2016 17:00:28|  main|compute010|E|can't get configuration from
qmaster -- backgrounding
05/22/2016 17:01:38|  main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
05/28/2016 03:59:31|  main|compute010|I|controlled shutdown 6.2u5
05/28/2016 03:59:49|  main|compute010|W|local configuration compute010 not
defined - using global configuration
05/28/2016 03:59:49|  main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
05/30/2016 17:41:50|  main|compute010|W|can't register at qmaster
"compute010": abort qmaster registration due to communication errors
05/30/2016 17:41:50|  main|compute010|E|commlib error: got select error
(Connection refused)
05/30/2016 17:42:14|  main|compute010|I|controlled shutdown 6.2u5
05/30/2016 17:58:58|  main|compute010|W|local configuration compute010 not
defined - using global configuration
05/30/2016 17:58:58|  main|compute010|I|starting up GE 6.2u5 (lx26-amd64)

I had the qmaster running on all nodes before, with no problem (master and
executors)
when I kill sge_master on the node, the sge_execd is not working anymore
because its not able to connect to the master

a ping on the node to the frontend node shows that it is visible though

:/

On Mon, May 30, 2016 at 11:14 AM, Bill Bryce <bbryce at univa.com> wrote:

> Okay,
>
> can you run any qconf commands such as ‘qconf -sconf’.  Try having a look
> at the messages files for the execution daemons.  They should be in
>
> $SGE_ROOT/default/spool/ and in there are directories for the master and
> exec hosts (if you have this installed in a shared filesystem
> envirionment).  You can check both the qmaster messages file and the execd
> messages files in those directories.
>
> A question.  Do you have the qmaster running on one host or on many?  I
> noticed that you have the ps output for compute010 and it is running a
> qmaster.
>
> Other things you can check is to see if all nodes can contact the qmaster
> machine i.e. the networking is configured properly.  You can also make sure
> that the host naming is correct, either configure DNS properly or configure
> a /etc/hosts file for all nodes so the IP to host name mapping is
> consistent across the cluster.  Grid Engine is very picky about host names.
>
>
>
> On May 30, 2016, at 1:36 PM, Radhouane Aniba <aradwen at gmail.com> wrote:
>
> Hi Bill
>
> Yes I am sure
>
> This is what I have when I login to one of the nodes and do
>
> ubuntu at compute010:~$ ps -ef | grep sge_
> sgeadmin  1254     1  0 May28 ?        00:00:39
> /usr/lib/gridengine/sge_qmaster
> sgeadmin  1446     1  0 May28 ?        00:00:22
> /usr/lib/gridengine/sge_execd
> ubuntu    2552  2527  0 17:36 pts/0    00:00:00 grep --color=auto sge_
>
>
> On Mon, May 30, 2016 at 10:33 AM, Bill Bryce <bbryce at univa.com> wrote:
>
>> Hi Rad,
>>
>> Are you sure that the execution daemons are running on your compute
>> nodes?  Can you login to one of the nodes say ‘compute001’ and do a ps
>> looking for the execd?  When an execd is functioning normally it provides
>> the load and memory, etc… none of your nodes are showing that.
>>
>> Regards,
>>
>> Bill.
>>
>> On May 30, 2016, at 1:20 PM, Radhouane Aniba <aradwen at gmail.com> wrote:
>>
>> Hello all,
>>
>> I am trying to submit a simple "hello world" to test a gridengine (I used
>> it before with no problems)
>>
>> The problem is that my job is waiting in the queue forever
>>
>> The qhost command shows a wired state of the compute nodes
>>
>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
>> -------------------------------------------------------------------------------
>> global                  -               -     -       -       -       -       -
>> compute001              lx26-amd64      4     -   31.4G       -     0.0       -
>> compute002              lx26-amd64      4     -   31.4G       -     0.0       -
>> compute003              lx26-amd64      4     -   31.4G       -     0.0       -
>> compute004              lx26-amd64      4     -   31.4G       -     0.0       -
>> compute005              lx26-amd64      4     -   31.4G       -     0.0       -
>> compute006              lx26-amd64      4     -   31.4G       -     0.0       -
>> compute007              lx26-amd64      4     -   31.4G       -     0.0       -
>> compute008              lx26-amd64      4     -   31.4G       -     0.0       -
>> compute009              lx26-amd64      4     -   31.4G       -     0.0       -
>> compute010              lx26-amd64      4     -   31.4G       -     0.0       -
>> compute011              lx26-amd64      4     -   31.4G       -     0.0
>>
>> In normal times even when the compute nodes are not used I used to have
>> some information on the load and memuse columns
>>
>> I am not an SGE persons but I am familiar with all the commands, any help
>> would be much appreciated
>>
>> the qstat -f command shows all my nodes in au state. I've been reading a
>> lot about it and I understood its an alarm state (overloaded ?)
>>
>> the only heavy activity I had on the head node was a script downloading
>> 19T of data, could the headnode be the problem and not the compute nodes ?
>> sge_execd is working on all the compute/exec nodes :/
>>
>> --
>> *Rad*
>> _______________________________________________
>> users mailing list
>> users at gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>>
>>
>> William Bryce | VP Products
>> Univa Corporation, Toronto
>> E: bbryce at univa.com | D: 647-9742841 | Toll-Free (800) 370-5320
>> W: Univa.com <http://univa.com/> | FB: facebook.com/univa.corporation |
>> T: twitter.com/Grid_Engine
>>
>>
>
>
> --
> *Radhouane Aniba*
> *Bioinformatics Scientist*
> *BC Cancer Agency, Vancouver, Canada*
>
>
> William Bryce | VP Products
> Univa Corporation, Toronto
> E: bbryce at univa.com | D: 647-9742841 | Toll-Free (800) 370-5320
> W: Univa.com | FB: facebook.com/univa.corporation | T:
> twitter.com/Grid_Engine
>
>


-- 
*Radhouane Aniba*
*Bioinformatics Scientist*
*BC Cancer Agency, Vancouver, Canada*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20160530/a8900f33/attachment-0001.html>


More information about the users mailing list