[gridengine users] preventing Myrinet endpoint starvation

Riccardo Murri riccardo.murri at uzh.ch
Mon Aug 11 20:54:36 UTC 2014


Hello,

we're running an ageing cluster, which was initially built a few years
ago with Myrinet as its high-performance interconnect.  The cluster
has recently acquired some new "fat" nodes with 32 cores, and things
have started to break: apparently the Myrinet MX kernel module only
allows 16 endpoints, but MPI processes allocate one MX endpoint per
process. So on a fat node, 16 processes out of 32 are not able to
communicate over Myrinet, and die with an error.

Is there a way I can tell SGE that there are only 16 endpoints on a
node, so it would not allocate more than 16 MPI processes to a single
node?  (This seems to call for per-node consumable, which AFAIK do not
exist.)

Thanks for any suggestion!

Riccardo

--
Riccardo Murri
http://www.s3it.uzh.ch/about/team/

S3IT: Services and Support for Science IT
University of Zurich
Winterthurerstrasse 190, CH-8057 Zürich (Switzerland)
Tel: +41 44 635 4222
Fax: +41 44 635 6888




More information about the users mailing list