[gridengine users] Resource quotas and parallel jobs across multiple queues
reuti at staff.uni-marburg.de
Fri Jan 13 01:39:31 UTC 2012
Am 12.01.2012 um 23:52 schrieb Brendan Moloney:
>>> All the queues are on the same machines. I am not sure which "algorithm" you refer to.
>> I refer to the internal algorithm of SGE how to collect slots from various queues.
>>> As mentioned, the scheduler sorts by sequence number so the queues are checked in shortest to longest order.
>> Not for parallel jobs. Only the allocation_rule is used (except for $pe_slots).
>> Does your observation fit to the aspects of parallel jobs at the end of the above link?
> There is definitely still some interaction between the scheduler configuration and the pe allocation rule. The allocation rule for the "mpi" pe is $round_robin. When I run this example successfully (the per node slot limits done through complex values) then the grid engine will do round robin allocation in short.q (animal and kermit get 12 slots, piggy gets 8) followed by round robin allocation in long.q (animal and kermit get 4 slots).
>> Interesting. Collecting slots from different queues has some implications anyway:
>> - the name of the $TMPDIR depends on the name of the queue, hence it's not the same on all nodes
> This should not be an issue for correctly written software, right?
This depends on what you define as "correctly":
Case 1: you have no queuing system, users are requested to create by hand something like /scratch/reuti/foobar17 on all nodes for a particular job. You set this value as an argument to `mpiexec` and you are quite happy that it's forwarded by the application internally to all nodes. Changing ~/.profile to set it by the ssh login would mean to change it for each `mpiexec`. Even if it's only /scratch/reuti to be created as a one time setup, it's the same on all nodes. No need to set any variable.
Case 2: You have a queuing system and want to use $TMPDIR - it must be the one on the node, not the one forwarded from the master node of the parallel job like in case 1. It depends whether the software honors something like $TMP, $TMPDIR or has the behavior like in case 1.
Case 3: The software is just using the $PWD for its scratch data. Hence you make a `cd $TMPDIR` on the master node and this will also be used as path on all slave nodes. If the directory isn't there, you are out of luck or use only /tmp (or your home) and lose the handling of $TMPDIR by SGE.
In fact: this was tricky with some applications with Codine 5.3 - no cluster queues, and although the $TMPDIR was created on the slave nodes they all had different names, as each queue had an unique name like node01.long.q node02.long.q (with only one host per queue)... IIRC I made a loop across the involved nodes to create a symbolic link with a name I like to Codine's created $TMPDIR. Oh dear, long ago...
>> - `qrsh -inherit ...` can't distinguish between the granted queues:
> I don't think this will affect us. We only run MPI programs with a tightly integrated MPICH2 or SMP programs with the allocation rule set to $pe_slots.
> So is it safe to say that I have found a bug?
I think so. The limit in the RQS should be handled as you expect it, especially as it's working as you note by setting individual slot counts in the exechost definitions.
> It seems like my original RQS should work.
> Or at least doing qsub with '-w e' should fail immediately instead of allowing the job to wait in 'qw' state forever.
This would be like "no suitable queue", but it first finds a possible assignment but fails to collect slots later on.
More information about the users