[gridengine users] SGE-6.2u5: Sudden death of *all* cluster jobs.
Reuti
reuti at staff.uni-marburg.de
Mon Mar 14 13:37:33 UTC 2011
Am 14.03.2011 um 12:45 schrieb Erik Soyez:
> Good day,
>
> one of our customers suffered an incident that I've never seen before.
>
> On Friday night all jobs running on the cluster died within a few hours:
> ------------------------------------------------------------------------
> Application: CFX
> Integration: Tight
> ------------------------------------------------------------------------
>
> Afterwards no new jobs could be submitted, only after all execds(!) had
> been restarted. Unfortunately I could not have a look onto the cluster
> myself when it had happened, so have to rely on log files etc. which
> do not seem to fit together - sorry for this lengthy email, but I need
> some hint to understand what's going on:
>
What is the setting of rsh_command/-daemon in SGE's configuration?
-- Reuti
> ------------------------------------------------------------------------
> In "qmaster/messages" (job 18690 did not even exist at that time):
> ------------------------------------------------------------------------
> 03/14/2011 08:28:54|worker|xxxxxxxxx1|E|unable to find job 18690 from the scheduler order package
> 03/14/2011 08:28:54|schedu|xxxxxxxxx1|E|unable to find job 18690 from the scheduler order package
>
>> qstat -j 18690
> Following jobs do not exist:
> 18690
>
>> qacct -j 18690
> error: job id 18690 not found
>
>> cat /opt/sge/6.2u5/default/spool/qmaster/jobseqnum
> 18718
> ------------------------------------------------------------------------
>
>
> E-Mail:
> ------------------------------------------------------------------------
> GE 6.2u5: Job 17765 failed
> ------------------------------------------------------------------------
> :
> :
> failed assumedly before job:can not find an unused add_grp_id
> Shepherd pe_hostfile:
> xxxxxx208.xxxxx.xxxxx.xxx 4 standard at xxxxxx208.xxxxx.xxxxx.xxx UNDEFINED
> xxxxxx207.xxxxx.xxxxx.xxx 4 standard at xxxxxx207.xxxxx.xxxxx.xxx UNDEFINED
> xxxxxx205.xxxxx.xxxxx.xxx 4 standard at xxxxxx205.xxxxx.xxxxx.xxx UNDEFINED
> xxxxxx204.xxxxx.xxxxx.xxx 4 standard at xxxxxx204.xxxxx.xxxxx.xxx UNDEFINED
> ------------------------------------------------------------------------
>
> But:
> ------------------------------------------------------------------------
> gid_range 20000-20100
> ------------------------------------------------------------------------
> This should be more than enough, shouldn't it?
>
>
>
> The application log files show some totally different error messages:
>
> One outfile:
> ------------------------------------------------------------------------
> +--------------------------------------------------------------------+
> | Warning! |
> | |
> | /opt/sge/6.2u5/mpi/rsh connection to host |
> | xxxxxx206.xxxxx.xxxxx.xxx produces the following output after the |
> | output of the command: |
> | |
> | TRUE |
> | |
> | This may cause problems spawning parallel slaves. |
> +--------------------------------------------------------------------+
> ------------------------------------------------------------------------
>
> Another outfile:
> ------------------------------------------------------------------------
> +--------------------------------------------------------------------+
> | An error has occurred in cfx5solve: |
> | |
> | Remote connection to xxxxxx216.xxxxx.xxxxx.xxx was terminated due |
> | to a timeout. It was interrupted by signal TERM (15) It gave the |
> | following output: |
> | |
> | /opt/sge/6.2u5/bin/lx26-amd64/qrsh -inherit -nostdin xxxxxx216- |
> | .xxxxx.xxxxx.xxx echo TRUE |
> | error: got no connection within 60 seconds. "Timeout occured w- |
> | hile waiting for connection" |
> :
> :
> :
> ------------------------------------------------------------------------
>
> Any ideas if these are different minor problems or one major problem?
>
> Many thanks!
>
> Erik Soyez.
>
>
> --
>
>
>
> --
> Vorstand/Board of Management:
> Dr. Bernd Finkbeiner, Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech
> Vorsitzender des Aufsichtsrats/
> Chairman of the Supervisory Board:
> Michel Lepert
> Sitz/Registered Office: Tuebingen
> Registergericht/Registration Court: Stuttgart
> Registernummer/Commercial Register No.: HRB 382196
>
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users
More information about the users
mailing list