[gridengine users] hard_queue_list & parallel environment error
Reuti
reuti at staff.uni-marburg.de
Wed May 16 08:26:58 UTC 2012
Am 16.05.2012 um 09:44 schrieb Arturo:
> What I want to do is that a user could request 128 slots in just 2 nodes of 64 cores. Using the parallel environment
> $ qsub -q conmat -pe foobar 5 submit.sh
> makes the script to be executed in 5 slots, but they could be assigned to 5 different nodes.
The selection of nodes depends on the allocation_rule in the PE setting. It's true that this selection criteria is to be set up by the admin, and not by the user like in Torque.
But a second variable slots_free won't help - how should this variable be used to select any particular number of hosts?
> If I try to use the built in complex_value slots:
> $ qsub -q conmat -l slosts 5 submit.sh
To get 5 slots on a single machine the allocation_rule can be set to $pe_slots. If you have 64 core machines and want to be sure to get 2 and only 2 machines for a request of 128 slots, the "exclusive" complex feature can be used if requested in addition to the allocation_rule $fill_up.
Another option could be a fixed allocation_rule of 64. But then you are limited to multiple of 64 of course.
-- Reuti
> gives an error so I can't use it:
>
> Unable to run job: "job" denied: use parallel environments instead of requesting slots explicitly.
> Exiting.
>
> So, I have created a new complex_value, slotsfree similar to slots, but I can use it as a requestable and consumable.
>
> But now, thanks to William, we have observed that slotsfree consmuption is multiplied by the slots configured in the PE.
>
> Do you understant what my problem is?
>
> Ideally I would like to specify how many slots to use (128 for example) and in how many different nodes, but without specifying explicitly in which nodes.
>
> Many thanks for your help!!!
>
> El 15/05/12 16:32, Reuti escribió:
>> Am 15.05.2012 um 16:23 schrieb Arturo:
>>
>>> <snip>
>>>
>>>
>>> It doesn't matter to which queue I submit the script.
>>>
>>> I would use the built in slot complex, but when I use it gives me this error:
>>>
>>> qsub -q conmat -l slots=5 submit.sh
>>> Unable to run job: "job" denied: use parallel environments instead of requesting slots explicitly.
>>> Exiting.
>> As the message says: you are not requesting a PE with the proper slot count?
>>
>> $ qsub -q conmat -pe foobar 5 submit.sh
>>
>> -- Reuti
>>
>>
>>> Regards
>>>
>>> El 15/05/12 16:12, William Hay escribió:
>>>> Ok that makes more sense. The queue instance on node045 is called
>>>> conmat not test. If test only exists as a single slot on each of
>>>> node046 and node047
>>>> then when you request -q test you are restricting it to those two
>>>> slots which isn't enough for a 4 slot job.
>>>> We would really need the full output of qstat -f to be sure though.
>>>>
>>>>
>>>> William
>>>> On 15 May 2012 14:42, Arturo
>>>> <artginer at bifi.es>
>>>> wrote:
>>>>
>>>>> More info:
>>>>>
>>>>> output of qstat -f
>>>>>
>>>>> ---------------------------------------------------------------------------------
>>>>>
>>>>> conmat at node045.cm.cluster
>>>>> BIP 0/0/64 0.00 lx26-amd64
>>>>> ---------------------------------------------------------------------------------
>>>>>
>>>>> ############################################################################
>>>>> - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
>>>>> ############################################################################
>>>>> 74550 0.60500 test arturo qw 05/15/2012 15:26:50 4
>>>>>
>>>>> qconf -sq test |grep slot
>>>>>
>>>>> slots 64
>>>>>
>>>>>
>>>>> qconf -sp openmpi |grep slots
>>>>>
>>>>> slots 99999
>>>>> urgency_slots min
>>>>>
>>>>> Regards
>>>>>
>>>>> El 15/05/12 15:39, Arturo escribió:
>>>>>
>>>>> Hi William,
>>>>>
>>>>> you were right, it was running in various nodos:
>>>>>
>>>>> 74545 0.60500 test arturo r 05/15/2012 15:17:46
>>>>>
>>>>> conmat at node045.cm.cluster
>>>>> MASTER
>>>>>
>>>>>
>>>>> conmat at node045.cm.cluster
>>>>> SLAVE
>>>>> 74545 0.60500 test arturo r 05/15/2012 15:17:46
>>>>>
>>>>> test at node046.cm.cluster
>>>>> SLAVE
>>>>> 74545 0.60500 test arturo r 05/15/2012 15:17:46
>>>>>
>>>>> test at node047.cm.cluster
>>>>> SLAVE
>>>>>
>>>>> Well, looking deeply, the problem is that I created a complex value
>>>>> "slotsfree" consumable and requestable and I assigned it to the node045 with
>>>>> the value:
>>>>> slotsfree=8 (for example).
>>>>>
>>>>> If I submit a job using a parallel environment to this node without
>>>>> configuring this complex_value, it works perfectly.
>>>>> And when I submit a job without using a PE to this node, but with this
>>>>> complex_value configured, it also works,
>>>>> but when I submit the same job, using a PE and the complex_value, it doen't
>>>>> work, and in the output it only says this:
>>>>>
>>>>> cannot run in PE "openmpi" because it only offers 2 slots
>>>>>
>>>>>
>>>>> Is it more clear now? Why does not work if I PE is configured without slot
>>>>> imitation, the node has 64 slots, and the slotsfree value is greated than 4?
>>>>>
>>>>> Thanks for your help.
>>>>>
>>>>> Regards
>>>>> Arturo
>>>>>
>>>>>
>>>>> El 15/05/12 14:33, William Hay escribió:
>>>>>
>>>>> On 15 May 2012 13:05, Arturo
>>>>> <artginer at bifi.es>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I have a very strange behaviour when I try to use a parallel environment
>>>>> with hard_queue_list option.
>>>>>
>>>>> In my script I have a parallel configuration:
>>>>>
>>>>> #$ -pe openmpi 4
>>>>>
>>>>> and if submit the script in the following way it works and runs in node
>>>>> test at node045
>>>>>
>>>>> qsub script.sh
>>>>>
>>>>> But If I submit the script using the hard_queue_list it doesn't run:
>>>>>
>>>>> qsub -q test script.sh
>>>>>
>>>>> With this error:
>>>>>
>>>>> cannot run in PE "openmpi" because it only offers 2 slots
>>>>>
>>>>> Obviously, the node is always empty. What may be wrong?
>>>>>
>>>>> It's hard to diagnose what's going on without knowing more about your
>>>>> configuration.
>>>>> Are you certain the entire job is running in the queue instance
>>>>> test at node045 when you submit without a queue list?
>>>>> One possibility is that queue test at node045 has only two slots. The
>>>>> master slot of the job plus one slave runs
>>>>> in test at node045 while the remaining slots run elsewhere.
>>>>>
>>>>> When the job is running what output do you get from qstat -g t?
>>>>>
>>>>> William
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users at gridengine.org
>>>>> https://gridengine.org/mailman/listinfo/users
>
>
More information about the users
mailing list