[gridengine users] execution node installation error

Reuti reuti at staff.uni-marburg.de
Thu Oct 15 12:56:19 UTC 2015


The spool directory is created when the execd starts. I.e. it can also be removed in case of problems in this spool directory and with the next restart it's recreated.

Is there any file in /tmp on the exechost having execd in its name? If execd runs into problems during startup, it's the only output you may get.

-- Reuti


> Am 15.10.2015 um 14:52 schrieb Hatem Elshazly <hmelshazly at gmail.com>:
> 
> Yes it is.
> 
> Why do you think that the exec dirs weren't created? all the permissions and ownerships are granted.
> I'm using this script: inst_sge_sc to make the installation on ec2 instances not using apt-get gridengine-exec because I want to make the installation in noninteractive mode but it seems that there is something I'm dropping.
> 
> On Thu, Oct 15, 2015 at 2:38 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> 
> > Am 15.10.2015 um 14:33 schrieb Hatem Elshazly <hmelshazly at gmail.com>:
> >
> > It is in state qw.
> >
> > home directory is mounted.
> >
> > I used qalter command it produces this output:
> > instance "node" dropped because it is temporarily not available
> > I checked the firewalls and all of them are dropped and daemons are listing on the ports on the master and executions nodes.
> >
> > I noticed that there is no directory in /opt/sge/default/spool/ shouldn't a directory with the name of the execution node be created in this path??
> 
> Yes.
> 
> Is the $SGE_ROOT shared too?
> 
> The location of the spool directory for the exechosts can be checked in `qconf -sconf` ("execd_spool_dir").
> 
> -- Reuti
> 
> 
> >
> > -- Shazly
> >
> > On Thu, Oct 15, 2015 at 11:45 AM, Reuti <reuti at staff.uni-marburg.de> wrote:
> > Hi,
> >
> > > Am 15.10.2015 um 01:16 schrieb Hatem Elshazly <hmelshazly at gmail.com>:
> > >
> > > Hi there,
> > >
> > > I'm having a problem getting an execution host to work. The master node seems it can't sense the execution node, when I submit a job it stalls in the queue.
> >
> > Is it in state "qw" or "t"?
> >
> > $ qalter -w v <job_id>
> >
> > will check whether the job could be started in an empty cluster in the current configuration.
> >
> > The home directory is shared in the cluster, so that the user's home directory can be accessed?
> >
> >
> > > Both daemons are running on master and executing node, I added the execution node to the queue and made sure the ports are open and can ssh without password from/to both nodes
> >
> > It's not necessary to have passphraseless SSH in the cluster. Even parallel jobs can run without this setting. In fact, I allow SSH access to nodes only for admin staff.
> >
> >
> > > , sge_root and sge_cell are open to read and write. The strange thing is when I change the ncpu of the execution node it gets reflected when I use qhost command on master node.
> >
> > You mean "num_proc"? This should be seen as a read only value and it's normally not necessary to adjust it. The slot count in the queues is independent from this setting.
> >
> > -- Reuti
> >
> >
> > > This is the output of qhost command: (Arch and mem is NA although I set them in the node's values)
> > >
> > > HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
> > > -------------------------------------------------------------------------------
> > > global                  -               -     -       -       -       -       -
> > > node001               -               1     -       -       -       -       -
> > > master                 linux-x64       1  0.01    3.7G  157.8M     0.0     0.0
> > >
> > >
> > > Any suggestions on what might be wrong is really appreciated.
> > >
> > > Thanks.
> > > _______________________________________________
> > > users mailing list
> > > users at gridengine.org
> > > https://gridengine.org/mailman/listinfo/users
> >
> >
> 
> 





More information about the users mailing list