[gridengine users] help

jan roels janroels at gmail.com
Thu Nov 22 13:42:24 UTC 2012


I work on an nfs share that is also available on the node. I'm currently
testing with only one node so it's unique...


2012/11/22 Reuti <reuti at staff.uni-marburg.de>

> Am 22.11.2012 um 14:29 schrieb jan roels:
>
> > I tried it with the root-account and with another account... both the
> same error
>
> Is the directory local on "camilla" and the nodename is unqiue?
>
> -- Reuti
>
>
> >
> > 2012/11/22 Reuti <reuti at staff.uni-marburg.de>
> > Am 22.11.2012 um 12:30 schrieb jan roels:
> >
> > > Hi,
> > >
> > > qstat -j <jobid> didn't show the full error message, this one is the
> full error message:
> > >
> > > 11/22/2012 12:26:11|  main|camilla|E|shepherd of job 76.226 exited
> with exit status = 27
> > > 11/22/2012 12:26:11|  main|camilla|E|can't open usage file
> "active_jobs/76.226/usage" for job 76.226: No such file or directory
> > > 11/22/2012 12:26:11|  main|camilla|E|11/22/2012 12:26:10 [0:11412]:
> execvlp(/var/spool/gridengine/execd/camilla/job_scripts/76,
> "/var/spool/gridengine/execd/camilla/job_scripts/76") failed: No such file
> or directory
> >
> > Could be a permission problem. Everyone needs read-access to this
> directory as the jobscript is executed from there.
> >
> > -- Reuti
> >
> >
> > >
> > >
> > > 2012/11/22 jan roels <janroels at gmail.com>
> > > Hi,
> > >
> > > Do you guys now what this error could be:
> > >
> > > error reason    2:          11/22/2012 11:12:25 [0:31220]:
> execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
> > > error reason    3:          11/22/2012 11:12:25 [0:31221]:
> execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
> > >
> > > this goes on as long as iets running... and my state went to:
> > >
> > >      69 0.50000 SA         root         Eqw   11/22/2012 09:12:05
> 1 1-500:1
> > >      69 0.00000 SA         root         qw    11/22/2012 09:12:05
> 1 501-4200:1
> > >
> > > This is the script i was running:
> > >
> > > #!/bin/bash
> > > #$-cwd
> > > #$-N SA
> > > #$-t 1-4200:1
> > >
> > > /var/software/packages/Mathematica/7.0/Executables/math -run
> "teller=$SGE_TASK_ID;<< ModelCaCO31.m"
> > >
> > > Hope somebody can fix the problem.
> > >
> > > Kind Regards
> > >
> > >
> > > 2012/11/14 Reuti <reuti at staff.uni-marburg.de>
> > > Am 14.11.2012 um 10:08 schrieb jan roels:
> > >
> > > > I got it working again, there was already a proces of execd running
> that needed to be killed and then restart the services.
> > > >
> > > > I'm trying to run a script now:
> > > >
> > > >
> > > > #!/bin/bash
> > > > #$-cwd
> > > > #$-N SA
> > > > #$-S /bin/sh
> > > > #$-t 1-4200:
> > >
> > > Don't run scripts at root. If something goes wring it might trash your
> machine(s).
> > >
> > >
> > > > /var/software/packages/Mathematica/7.0/Executables/math -run
> "teller=$SGE_TASK_ID;<< ModelCaCO31.m"
> > > >
> > > > but it gives the following output:
> > > >
> > > > stdin: is not a tty
> > >
> > > It's just a warning - unless someone complains I would suggest to
> ignore it.
> > >
> > >
> > > > and this is the output of my qstat -f:
> > > >
> > > > queuename                      qtype resv/used/tot. load_avg arch
>        states
> > > >
> ---------------------------------------------------------------------------------
> > > > main.q at camilla.UGent.be        BIP   0/1/1          0.70
> lx26-amd64
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 1
> > > >
> ---------------------------------------------------------------------------------
> > > > main.q at node0                   BIP   0/24/24        27.71
>  lx26-amd64
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 2
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 3
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 4
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 5
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 6
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 7
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 8
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 9
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 10
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 11
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 12
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 13
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 14
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 15
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 16
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 17
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 18
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 19
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 20
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 21
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 22
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 23
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 24
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47
>   1 25
> > > >
> > > >
> ############################################################################
> > > >  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -
> PENDING JOBS
> > > >
> ############################################################################
> > > >      35 0.50000 SA         root         qw    11/14/2012 09:57:38
>   1 26-4200:1
> > > >
> > > >
> > > > root at camilla:/nfs/share/sge#  qstat -explain c -j 35
> > > > ==============================================================
> > > > job_number:                 35
> > > > exec_file:                  job_scripts/35
> > > > submission_time:            Wed Nov 14 09:57:38 2012
> > > > owner:                      root
> > > > uid:                        0
> > > > group:                      root
> > > > gid:                        0
> > > > sge_o_home:                 /root
> > > > sge_o_log_name:             root
> > > > sge_o_path:
> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
> > > > sge_o_shell:                /bin/bash
> > > > sge_o_workdir:              /nfs/share/sge
> > > > sge_o_host:                 camilla
> > > > account:                    sge
> > > > cwd:                        /nfs/share/sge
> > > > mail_list:                  root at camilla
> > > > notify:                     FALSE
> > > > job_name:                   SA
> > > > jobshare:                   0
> > > > shell_list:                 NONE:/bin/sh
> > > > env_list:
> > > > script_file:                HistDisCaCO31.sh
> > > > job-array tasks:            1-4200:1
> > > > usage    1:                 cpu=00:05:20, mem=105.16135 GBs,
> io=0.01537, vmem=1.110G, maxvmem=1.110G
> > > > usage    2:                 cpu=00:04:17, mem=179.44371 GBs,
> io=0.01395, vmem=3.643G, maxvmem=3.643G
> > > > usage    3:                 cpu=00:04:37, mem=191.69532 GBs,
> io=0.01394, vmem=3.657G, maxvmem=3.657G
> > > > usage    4:                 cpu=00:04:34, mem=188.12645 GBs,
> io=0.01394, vmem=3.655G, maxvmem=3.655G
> > > > usage    5:                 cpu=00:04:16, mem=180.18292 GBs,
> io=0.01394, vmem=3.636G, maxvmem=3.636G
> > > > usage    6:                 cpu=00:04:22, mem=183.47616 GBs,
> io=0.01394, vmem=3.644G, maxvmem=3.644G
> > > > usage    7:                 cpu=00:04:15, mem=179.89624 GBs,
> io=0.01400, vmem=3.640G, maxvmem=3.640G
> > > > usage    8:                 cpu=00:04:55, mem=207.28643 GBs,
> io=0.01394, vmem=3.669G, maxvmem=3.669G
> > > > usage    9:                 cpu=00:04:27, mem=184.86707 GBs,
> io=0.01394, vmem=3.653G, maxvmem=3.653G
> > > > usage   10:                 cpu=00:04:14, mem=179.09446 GBs,
> io=0.01394, vmem=3.635G, maxvmem=3.635G
> > > > usage   11:                 cpu=00:04:47, mem=195.80372 GBs,
> io=0.01400, vmem=3.668G, maxvmem=3.668G
> > > > usage   12:                 cpu=00:04:49, mem=203.43895 GBs,
> io=0.01394, vmem=3.665G, maxvmem=3.665G
> > > > usage   13:                 cpu=00:04:45, mem=196.67175 GBs,
> io=0.01394, vmem=3.663G, maxvmem=3.663G
> > > > usage   14:                 cpu=00:04:24, mem=185.68047 GBs,
> io=0.01400, vmem=3.648G, maxvmem=3.648G
> > > > usage   15:                 cpu=00:04:40, mem=195.96253 GBs,
> io=0.01394, vmem=3.656G, maxvmem=3.656G
> > > > usage   16:                 cpu=00:04:11, mem=179.84016 GBs,
> io=0.01394, vmem=3.633G, maxvmem=3.633G
> > > > usage   17:                 cpu=00:04:43, mem=196.21689 GBs,
> io=0.01394, vmem=3.662G, maxvmem=3.662G
> > > > usage   18:                 cpu=00:04:37, mem=197.39875 GBs,
> io=0.01394, vmem=3.653G, maxvmem=3.653G
> > > > usage   19:                 cpu=00:04:35, mem=191.55982 GBs,
> io=0.01394, vmem=3.653G, maxvmem=3.653G
> > > > usage   20:                 cpu=00:04:26, mem=191.62928 GBs,
> io=0.01394, vmem=3.643G, maxvmem=3.643G
> > > > usage   21:                 cpu=00:04:42, mem=197.87398 GBs,
> io=0.01394, vmem=3.660G, maxvmem=3.660G
> > > > usage   22:                 cpu=00:04:36, mem=193.43107 GBs,
> io=0.01394, vmem=3.652G, maxvmem=3.652G
> > > > usage   23:                 cpu=00:04:32, mem=193.12103 GBs,
> io=0.01394, vmem=3.652G, maxvmem=3.652G
> > > > usage   24:                 cpu=00:04:25, mem=186.56485 GBs,
> io=0.01400, vmem=3.644G, maxvmem=3.644G
> > > > usage   25:                 cpu=00:04:51, mem=201.81706 GBs,
> io=0.01400, vmem=3.669G, maxvmem=3.669G
> > > > scheduling info:            queue instance "main.q at camilla" dropped
> because it is full
> > > >                             queue instance "main.q at node0" dropped
> because it is full
> > > >                             All queues dropped because of overload
> or full
> > > >                             not all array task may be started due to
> 'max_aj_instances'
> > >
> > > The machine is just full.
> > >
> > > -- Reuti
> > >
> > >
> > > > You guys know how this can be solved?
> > > >
> > > >
> > > >
> > > > 2012/11/13 Reuti <reuti at staff.uni-marburg.de>
> > > > Am 13.11.2012 um 13:42 schrieb jan roels:
> > > >
> > > > > Hi,
> > > > >
> > > > > I followed the following tutorial:
> > > > >
> > > > >
> http://verahill.blogspot.be/2012/06/setting-up-sun-grid-engine-with-three.htmlon how to install the SGE. It all went fine on my masternode but on my exec
> node i have some troubles.
> > > > >
> > > > > First it gave the following error:
> > > > >
> > > > > 11/13/2012 13:44:43|  main|node0|E|communication error for
> "node0/execd/1" running on port 6445: "can't bind socket"
> > > >
> > > > Is there already something running on this port - any older version
> of the execd?
> > > >
> > > >
> > > > > 11/13/2012 13:44:44|  main|node0|E|commlib error: can't bind
> socket (no additional information available)
> > > > > 11/13/2012 13:45:12|  main|node0|C|abort qmaster registration due
> to communication errors
> > > > > 11/13/2012 13:45:14|  main|node0|W|daemonize error: child exited
> before sending daemonize state
> > > > >
> > > > > but then i killed the proces and restarted the gridengine-execd
> but then i get the following:
> > > > >
> > > > > /etc/init.d/gridengine-exec restart
> > > > > * Restarting Sun Grid Engine Execution Daemon sge_execd
>                                          error: can't resolve host name
> > > > > error: can't get configuration from qmaster -- backgrounding
> > > > >
> > > > > What can i do to fix this?
> > > >
> > > > Any firewall on the machines? Ports 6444 and 6445 need to be
> excluded.
> > > >
> > > > -- Reuti
> > > >
> > > > > _______________________________________________
> > > > > users mailing list
> > > > > users at gridengine.org
> > > > > https://gridengine.org/mailman/listinfo/users
> > > >
> > > >
> > >
> > >
> > >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20121122/86ff03cf/attachment.html>


More information about the users mailing list