[gridengine users] serial and mpi jobs running on the same nodes

Reuti reuti at staff.uni-marburg.de
Tue Jan 27 13:53:15 UTC 2015

Am 27.01.2015 um 14:25 schrieb Ursula Winkler <ursula.winkler at uni-graz.at>:
> On 01/27/2015 11:54 AM, Reuti wrote:
>> Hi,
>>> Am 27.01.2015 um 09:26 schrieb Ursula Winkler <ursula.winkler at uni-graz.at>
>>> :
>>> On 01/26/2015 10:03 PM, Reuti wrote:
>>>>> I'll trying to find a solution for an environment running serial jobs as well as mpi jobs on
>>>>> 6 hosts where each host has 32 cores/slots. Due to the small number of nodes, assigning
>>>>> each sort of jobs to separate nodes (e.g. nodes 1-2 for serial, nodes 3-6 for mpi jobs) is
>>>>> not an option, expecially because the ratio serial:mpi is quite a variable one.
>>>>> I tried out to set up 2 queues with "serial" as a subordinate queue to "mpi". - But that
>>>>> only is unwasteful if the mpi job(s) use ~ 32 slots per host. Otherwise there are serial
>>>>> jobs which could run but persist unnecessarily in a suspended state due to the fact
>>>>> that the whole queue "serial" is suspended.
>>>>> The other possible option should be the subordination of slots, but that doesn't work either
>>>>> because the scheduler obviously (concerning subordination) is not capable of figuring out how many slots a mpi job actually is requesting, and so suspends stubbornly only one serial job -
>>>>> which of course causes core oversubscription.
>>>>> Has somebody an idea to solve this problem in a satisfying way?
>>>> Why not submitting all jobs to one and the same queue?
>>>> It might be good to provide a suitable:
>>>> $ qconf -ssconf
>>>> ...
>>>> max_reservation                   20
>>>> default_duration                  8760:00:00
>>>> and submit the parallel jobs with "-R y" to avoid starvation. To use the backfilling in a proper way a value h_rt  needs to be provided too during submission.
>>>> -- Reuti
>>> Hi,
>>> I hoped I could avoid that. On all the other clusters we have separated nodes for each queue and that works fine without runtime limitations/requestions. I wanted to provide the same (usage) conditions also on the new cluster, but ok, if it should not be...
>> And you are submitting to queues then (this would be more Torque-style submissions)? 
> yes. 
>> You could also use hostgroups to have different parts in the cluster which you can address. What was the idea to have different queues for different parts of the clusters?
> I must figure out how the cluster will be used in the future. At the moment there are a bunch of single processor jobs (more than could run at a given time;  running times >= 4 d) and several mpi jobs (running times < 1 d and up to 32 cores; the cluster has much memory but no high performance network so it doesn't make sense to use more than one host per job). Core/Host reservation is at the moment not really an option because of the long running times of the serial jobs and the fact, that the mpi jobs are most still in a testing mode. 
> Thank you. hostgroups could be an option, and (at the moment maybe) a limitation of the jobsnumber/user, I'll see.

With two queues and sort by sequence number you could fill the cluster with serial jobs from the one side and parallel jobs from the other side. As long as allocation rule is $pe_slots it should honor the sequence number.

There were some posts that the sequence number is not honored for $pe_slots (as for parallel jobs in general). But nevertheless routing serial jobs preferable to one side could leave complete machines free.

>> To me it sounds like the setup of the new cluster (shared usage per exechosts) is different from the goal in the other clusters where only a single queue was set up on each machine (resp. a single queue for each part of the cluster)
> yes. All the other clusters have many nodes with up to 12 cores/Host, as well as Infiniband, but much less memory and so are intensely used by mpi-Jobs which need less memory (so a separate queue make sense). 

You could also request the memory during submission, and the jobs will be routed to the machines where resources are available for it.

-- Reuti

More information about the users mailing list