[gridengine users] Parallel GE jobs on 48-way nodes
reuti at staff.uni-marburg.de
Tue Oct 11 11:55:52 UTC 2011
Am 10.10.2011 um 20:46 schrieb Gerald Ragghianti:
> We have a cluster consisting of 48-core compute nodes where we need to run parallel (MPI) jobs across nodes. There is a hardware limitation on the QDR Infiniband cards that limits the available hardware contexts to 16 per card. We have to ensure that we don't over-subscribe these hardware contexts because parallel jobs without available contexts will crash. The difficulty is that the contexts needed for a job are a function of the number of compute nodes the job uses, not the number of job slots.
When I get you right, you are seeking for something like a complex with "consumable HOST" (instead of JOB or YES, i.e. consume it one time on each used exechost independent from the total number of slots granted on this machine). Unfortunately it was discussed before but not implemented yet.
> We don't want to make each node dedicated to a single job because we also want to be able to run smaller multi-threaded and single-slot jobs. If we assume (for now) that we allow each parallel job to use all 16 contexts on each compute node, how can we ensure that no other parallel jobs will be allocated to these nodes?
You mean each job may consume 1 or all 16 contexts on an exechost? How do you decide which case to use?
More information about the users