[gridengine users] Intermittent MPI problem on OGS 2011.11

Reuti reuti at staff.uni-marburg.de
Wed Jun 18 20:20:18 UTC 2014


Hi,

Am 18.06.2014 um 20:45 schrieb Connell, Jesse:

> We've been having a seemingly-random problem with MPI jobs on our install
> of Open Grid Scheduler 2011.11.  For some varying length of time from when
> the execd processes start up, MPI jobs running across multiple hosts will
> run fine.  Then, at some point, they will start failing at the mpirun
> step, and will keep failing until execd is restarted on the affected
> hosts.  They then work again, before eventually failing, and so on.  If I
> increase the SGE debug level before calling mpirun in my job script, I see
> things like this:
> 
>   842  11556         main     ../clients/qsh/qsh.c 1840 executing task of

qsh? Are you using an X11 session?

-- Reuti


> job 6805430 failed: failed sending task to execd@<hostname>: got send error
> 
> ...but nothing more interesting that I can see.  (I also get the same sort
> of "send error" message from mpirun itself if I use its --mca
> ras_gridengine_debug --mca ras_gridengine_verbose flags, but nothing
> else.)  Jobs that run on multiple cores on a single host are fine, but
> ones that try to start up workers on additional hosts fail.  Since
> restarting execd makes it work again, I assumed the problem was on that
> end, and tried dumping verbose log output for execd (using dl 10) to a
> file.  But, despite many thousands of lines, I can't spot anything that
> looks different when the jobs start failing from when they are working, as
> far as execd is concerned.  Ordinary grid jobs (no parallel environment)
> continue to run fine no matter what.
> 
> So for now, I'm stumped!  Any other ideas of what to look for, or thoughts
> of what the unpredictable off-and-on behavior could possibly mean?  Thanks
> in advance,
> 
> Jesse
> 
> P.S.  This is on CentOS 6, with its openmpi 1.5.4 package.
> 
> 
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users




More information about the users mailing list