[gridengine users] Resource Reservation logging

Txema Heredia txema.llistes at gmail.com
Mon Oct 7 15:33:44 UTC 2013


I needed to increase the priority of the jobs of one user and I wasn't 
able to do so. No matter how many times I issued qalter -p 1024 -u user, 
the waiting queue remained the same.
I have just rebooted the sge_qmaster daemon, et voilà, the jobs had its 
proper priority and any job that was able to run was scheduled. After 
simply rebooting it, now my cluster is using 292 (+36 reserved) slots 
out of 320 total.

So it seems all this is a matter of qmaster degradation. This rises 
further questions, like how is it possible that the qmaster degraded 
this far only after 4 days of turning on the reservations...

Thanks for all,

Txema


El 07/10/13 16:18, Txema Heredia escribió:
> El 07/10/13 16:12, Reuti escribió:
>> Am 07.10.2013 um 16:09 schrieb Txema Heredia:
>>
>>> El 07/10/13 16:00, Reuti escribió:
>>>> Am 07.10.2013 um 15:59 schrieb Txema Heredia:
>>>>
>>>>> El 07/10/13 14:58, Reuti escribió:
>>>>>> Hi,
>>>>>>
>>>>>> Am 07.10.2013 um 13:15 schrieb Txema Heredia:
>>>>>>
>>>>>>> The problem is that, right now, the mandatory usage of h_rt is 
>>>>>>> not an option. So we need to work considering that all jobs will 
>>>>>>> last to infinity and beyond.
>>>>>>>
>>>>>>> Right now, the scheduler configuration is:
>>>>>>> max_reservation 50
>>>>>>> default_duration 24:00:00
>>>>>>>
>>>>>>> During the weekend, most of the parallel ( and -R y) jobs 
>>>>>>> started running, but now there is something fishy in my queues:
>>>>>>>
>>>>>>> The first 3 jobs in my waiting queue belong to user1. All 3 jobs 
>>>>>>> request -pe mpich_round 12, -R y and -l h_vmem=4G (h_vmem is set 
>>>>>>> to consumable = YES, not JOB).
>>>>>> Which amount of memory did you specify in the exechost 
>>>>>> definition, i.e. what's in the machine physically?
>>>>>>
>>>>>> -- Reuti
>>>>> 26 nodes have 96GB of ram. One node has 48GB.
>>>> And you defined it on an exechost level under "complex_values"? - 
>>>> Reuti
>>> Yes, on all nodes.
>>> # qconf -se c0-0 | grep h_vmem
>>> complex_values        local_disk=400G,slots=12,h_vmem=96G
>> Good, what is the defintion of the requested PE - any special 
>> "allocation_rule"?
>>
> Round robin
>
> # qconf -sp mpich_round
> pe_name            mpich_round
> slots              9999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /opt/gridengine/mpi/startmpi.sh -catch_rsh 
> $pe_hostfile
> stop_proc_args     /opt/gridengine/mpi/stopmpi.sh
> allocation_rule    $round_robin
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary FALSE
>
>
>>> PS: I've been told that there are some problems with local_disk, but 
>>> currently no job is making use of it
>> It may be a custom load sensor, it's nothing SGE provides by default.
>>
> Yes, it's simply a consumable attribute that does nothing. I have just 
> been told that sometimes host-defined consumable attributes + parallel 
> environments don't behave properly (over-requesting and the such), but 
> here shouldn't apply because none of the jobs is using it. We can 
> ignore it.
>
>>>>> Currently nodes range from 4 to 10 free slots and from 26 to 82.1 
>>>>> free GB
>>>>>
>>>>> The first jobs in my waiting queue (after the 3 reserving ones) 
>>>>> require measly 0.9G, 3G and 12G, all with slots=1 and -R n. None 
>>>>> of them is scheduled. But if I manually increase their priority so 
>>>>> they are put BEFORE the 3 -R y jobs, they are immediately scheduled.
>>>>>
>>>>>>> This user has already one job like these running. User1 has a 
>>>>>>> RQS that limits him to use only 12 slots in the whole cluster. 
>>>>>>> Thus the 3 waiting jobs will not be able to run until the first 
>>>>>>> one finishes.
>>>>>>>
>>>>>>> This is the current schedule log:
>>>>>>>
>>>>>>> # grep "::::\|RESERVING" schedule | tail -200 | grep 
>>>>>>> "::::\|Q:all" | tail -37 | sort
>>>>>>> ::::::::
>>>>>>> 2734185:1:RESERVING:1381142325:86460:Q:all.q at compute-0-1.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734185:1:RESERVING:1381142325:86460:Q:all.q at compute-0-2.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734185:1:RESERVING:1381142325:86460:Q:all.q at compute-0-6.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734185:1:RESERVING:1381142325:86460:Q:all.q at compute-0-7.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734185:1:RESERVING:1381142325:86460:Q:all.q at compute-0-8.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734185:1:RESERVING:1381142325:86460:Q:all.q at compute-0-9.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734185:1:RESERVING:1381142325:86460:Q:all.q at compute-1-0.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734185:1:RESERVING:1381142325:86460:Q:all.q at compute-1-10.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734185:1:RESERVING:1381142325:86460:Q:all.q at compute-1-11.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734185:1:RESERVING:1381142325:86460:Q:all.q at compute-1-5.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734185:1:RESERVING:1381142325:86460:Q:all.q at compute-1-6.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734185:1:RESERVING:1381142325:86460:Q:all.q at compute-1-7.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734186:1:RESERVING:1381228785:86460:Q:all.q at compute-0-1.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734186:1:RESERVING:1381228785:86460:Q:all.q at compute-0-2.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734186:1:RESERVING:1381228785:86460:Q:all.q at compute-0-6.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734186:1:RESERVING:1381228785:86460:Q:all.q at compute-0-7.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734186:1:RESERVING:1381228785:86460:Q:all.q at compute-0-8.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734186:1:RESERVING:1381228785:86460:Q:all.q at compute-0-9.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734186:1:RESERVING:1381228785:86460:Q:all.q at compute-1-0.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734186:1:RESERVING:1381228785:86460:Q:all.q at compute-1-10.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734186:1:RESERVING:1381228785:86460:Q:all.q at compute-1-11.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734186:1:RESERVING:1381228785:86460:Q:all.q at compute-1-5.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734186:1:RESERVING:1381228785:86460:Q:all.q at compute-1-6.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734186:1:RESERVING:1381228785:86460:Q:all.q at compute-1-7.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734187:1:RESERVING:1381315245:86460:Q:all.q at compute-0-1.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734187:1:RESERVING:1381315245:86460:Q:all.q at compute-0-2.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734187:1:RESERVING:1381315245:86460:Q:all.q at compute-0-6.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734187:1:RESERVING:1381315245:86460:Q:all.q at compute-0-7.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734187:1:RESERVING:1381315245:86460:Q:all.q at compute-0-8.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734187:1:RESERVING:1381315245:86460:Q:all.q at compute-0-9.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734187:1:RESERVING:1381315245:86460:Q:all.q at compute-1-0.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734187:1:RESERVING:1381315245:86460:Q:all.q at compute-1-10.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734187:1:RESERVING:1381315245:86460:Q:all.q at compute-1-11.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734187:1:RESERVING:1381315245:86460:Q:all.q at compute-1-5.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734187:1:RESERVING:1381315245:86460:Q:all.q at compute-1-6.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734187:1:RESERVING:1381315245:86460:Q:all.q at compute-1-7.local:slots:1.000000 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Right now, the cluster is using 190 slots of 320 total. The 
>>>>>>> schedule log says that the 3 waiting jobs form user1 are the 
>>>>>>> only jobs making any kind of reservation. These jobs are 
>>>>>>> reserving a total of 36 cores. These 3 jobs are effectively 
>>>>>>> blocking 36 already-free slots because the RQS doesn't allow 
>>>>>>> user1 to make usage of more than 12 slots at once. This is not 
>>>>>>> "nice" but I understand that the scheduler has its limitations 
>>>>>>> and cannot predict the future.
>>>>>>>
>>>>>>> Taking into account the jobs running + the slots & memory locked 
>>>>>>> by the reserving jobs, there is a grand total of 226 slots 
>>>>>>> locked. Thus leaving 94 free slots.
>>>>>>>
>>>>>>> Here comes the problem: Even though there are 94 free slots and 
>>>>>>> lots of spare memory, NONE of the 4300 waiting jobs is running. 
>>>>>>> There are nodes with 6 free slots and 59 GB of free RAM but none 
>>>>>>> of the waiting jobs is scheduled. New jobs only star running 
>>>>>>> when one of the 190 slots occupied by running jobs is freed. 
>>>>>>> None of these other waiting jobs is requesting -R y, -pe nor h_rt.
>>>>>>>
>>>>>>>
>>>>>>> Additionaly, this is creating some odd behaviour. It seems that, 
>>>>>>> on each scheduler run, it is trying to start jobs in those 
>>>>>>> "blocked slots", but it fails with no apparent reason. Some of 
>>>>>>> the jobs are even trying to start twice, but almost none 
>>>>>>> (generally none at all) gets to run:
>>>>>>>
>>>>>>> # tail -2000 schedule | grep -A 1000 "::::::" | grep "Q:all" | 
>>>>>>> grep STARTING | sort
>>>>>>> 2734121:1:STARTING:1381144160:86460:Q:all.q at compute-0-11.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734122:1:STARTING:1381144160:86460:Q:all.q at compute-0-8.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734123:1:STARTING:1381144160:86460:Q:all.q at compute-0-0.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734124:1:STARTING:1381144160:86460:Q:all.q at compute-0-7.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734125:1:STARTING:1381144160:86460:Q:all.q at compute-0-11.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734126:1:STARTING:1381144160:86460:Q:all.q at compute-1-13.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734127:1:STARTING:1381144160:86460:Q:all.q at compute-1-1.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734128:1:STARTING:1381144160:86460:Q:all.q at compute-1-10.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734129:1:STARTING:1381144160:86460:Q:all.q at compute-1-7.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734130:1:STARTING:1381144160:86460:Q:all.q at compute-1-12.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734131:1:STARTING:1381144160:86460:Q:all.q at compute-1-4.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734132:1:STARTING:1381144160:86460:Q:all.q at compute-1-1.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734133:1:STARTING:1381144160:86460:Q:all.q at compute-0-0.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734134:1:STARTING:1381144160:86460:Q:all.q at compute-1-5.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734135:1:STARTING:1381144160:86460:Q:all.q at compute-1-11.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734136:1:STARTING:1381144160:86460:Q:all.q at compute-1-7.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734137:1:STARTING:1381144160:86460:Q:all.q at compute-0-11.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734138:1:STARTING:1381144160:86460:Q:all.q at compute-1-12.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734139:1:STARTING:1381144160:86460:Q:all.q at compute-1-13.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734140:1:STARTING:1381144160:86460:Q:all.q at compute-1-1.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734141:1:STARTING:1381144160:86460:Q:all.q at compute-1-9.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734142:1:STARTING:1381144160:86460:Q:all.q at compute-0-3.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734143:1:STARTING:1381144160:86460:Q:all.q at compute-0-6.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734144:1:STARTING:1381144160:86460:Q:all.q at compute-1-10.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734145:1:STARTING:1381144160:86460:Q:all.q at compute-1-5.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734146:1:STARTING:1381144160:86460:Q:all.q at compute-1-3.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734147:1:STARTING:1381144160:86460:Q:all.q at compute-1-11.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734148:1:STARTING:1381144160:86460:Q:all.q at compute-0-4.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734149:1:STARTING:1381144160:86460:Q:all.q at compute-0-7.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734150:1:STARTING:1381144160:86460:Q:all.q at compute-1-7.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734151:1:STARTING:1381144160:86460:Q:all.q at compute-1-8.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734152:1:STARTING:1381144160:86460:Q:all.q at compute-1-2.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734153:1:STARTING:1381144160:86460:Q:all.q at compute-0-10.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734154:1:STARTING:1381144160:86460:Q:all.q at compute-0-6.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734155:1:STARTING:1381144160:86460:Q:all.q at compute-1-11.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734156:1:STARTING:1381144160:86460:Q:all.q at compute-0-4.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734157:1:STARTING:1381144160:86460:Q:all.q at compute-0-1.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734158:1:STARTING:1381144160:86460:Q:all.q at compute-1-8.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734159:1:STARTING:1381144160:86460:Q:all.q at compute-1-2.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734160:1:STARTING:1381144160:86460:Q:all.q at compute-0-10.local:slots:1.000000 
>>>>>>>
>>>>>>> 2734161:1:STARTING:1381144160:86460:Q:all.q at compute-1-6.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735158:1:STARTING:1381144160:86460:Q:all.q at compute-0-5.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735159:1:STARTING:1381144160:86460:Q:all.q at compute-0-2.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735160:1:STARTING:1381144160:86460:Q:all.q at compute-1-0.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735161:1:STARTING:1381144160:86460:Q:all.q at compute-0-2.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735162:1:STARTING:1381144160:86460:Q:all.q at compute-0-0.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735163:1:STARTING:1381144160:86460:Q:all.q at compute-1-10.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735164:1:STARTING:1381144160:86460:Q:all.q at compute-0-11.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735165:1:STARTING:1381144160:86460:Q:all.q at compute-0-0.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735166:1:STARTING:1381144160:86460:Q:all.q at compute-1-5.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735167:1:STARTING:1381144160:86460:Q:all.q at compute-1-8.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735168:1:STARTING:1381144160:86460:Q:all.q at compute-1-12.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735169:1:STARTING:1381144160:86460:Q:all.q at compute-1-0.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735170:1:STARTING:1381144160:86460:Q:all.q at compute-0-0.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735171:1:STARTING:1381144160:86460:Q:all.q at compute-1-5.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735172:1:STARTING:1381144160:86460:Q:all.q at compute-1-8.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735173:1:STARTING:1381144160:86460:Q:all.q at compute-1-12.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735174:1:STARTING:1381144160:86460:Q:all.q at compute-1-9.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735175:1:STARTING:1381144160:86460:Q:all.q at compute-0-2.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735176:1:STARTING:1381144160:86460:Q:all.q at compute-1-8.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735177:1:STARTING:1381144160:86460:Q:all.q at compute-0-10.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735178:1:STARTING:1381144160:86460:Q:all.q at compute-1-0.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735179:1:STARTING:1381144160:86460:Q:all.q at compute-0-0.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735180:1:STARTING:1381144160:86460:Q:all.q at compute-1-11.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735181:1:STARTING:1381144160:86460:Q:all.q at compute-0-11.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735182:1:STARTING:1381144160:86460:Q:all.q at compute-1-13.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735183:1:STARTING:1381144160:86460:Q:all.q at compute-1-0.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735184:1:STARTING:1381144160:86460:Q:all.q at compute-1-9.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735185:1:STARTING:1381144160:86460:Q:all.q at compute-1-10.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735186:1:STARTING:1381144160:86460:Q:all.q at compute-1-3.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735187:1:STARTING:1381144160:86460:Q:all.q at compute-0-7.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735188:1:STARTING:1381144160:86460:Q:all.q at compute-1-8.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735189:1:STARTING:1381144160:86460:Q:all.q at compute-1-2.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735190:1:STARTING:1381144160:86460:Q:all.q at compute-0-10.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735191:1:STARTING:1381144160:86460:Q:all.q at compute-0-8.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735192:1:STARTING:1381144160:86460:Q:all.q at compute-1-4.local:slots:1.000000 
>>>>>>>
>>>>>>> 2735193:1:STARTING:1381144160:86460:Q:all.q at compute-0-0.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743479:1:STARTING:1381144160:86460:Q:all.q at compute-0-2.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743480:1:STARTING:1381144160:86460:Q:all.q at compute-0-2.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743481:1:STARTING:1381144160:86460:Q:all.q at compute-0-2.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743482:1:STARTING:1381144160:86460:Q:all.q at compute-0-2.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743483:1:STARTING:1381144160:86460:Q:all.q at compute-1-0.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743484:1:STARTING:1381144160:86460:Q:all.q at compute-0-11.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743485:1:STARTING:1381144160:86460:Q:all.q at compute-1-0.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743486:1:STARTING:1381144160:86460:Q:all.q at compute-0-2.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743487:1:STARTING:1381144160:86460:Q:all.q at compute-1-5.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743488:1:STARTING:1381144160:86460:Q:all.q at compute-1-7.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743489:1:STARTING:1381144160:86460:Q:all.q at compute-1-0.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743490:1:STARTING:1381144160:86460:Q:all.q at compute-1-10.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743491:1:STARTING:1381144160:86460:Q:all.q at compute-0-2.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743492:1:STARTING:1381144160:86460:Q:all.q at compute-0-7.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743493:1:STARTING:1381144160:86460:Q:all.q at compute-1-7.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743494:1:STARTING:1381144160:86460:Q:all.q at compute-0-11.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743495:1:STARTING:1381144160:86460:Q:all.q at compute-1-13.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743496:1:STARTING:1381144160:86460:Q:all.q at compute-0-8.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743497:1:STARTING:1381144160:86460:Q:all.q at compute-1-9.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743498:1:STARTING:1381144160:86460:Q:all.q at compute-0-2.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743499:1:STARTING:1381144160:86460:Q:all.q at compute-1-10.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743500:1:STARTING:1381144160:86460:Q:all.q at compute-0-7.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743501:1:STARTING:1381144160:86460:Q:all.q at compute-1-7.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743502:1:STARTING:1381144160:86460:Q:all.q at compute-1-2.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743503:1:STARTING:1381144160:86460:Q:all.q at compute-1-13.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743504:1:STARTING:1381144160:86460:Q:all.q at compute-1-0.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743505:1:STARTING:1381144160:86460:Q:all.q at compute-1-10.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743506:1:STARTING:1381144160:86460:Q:all.q at compute-1-5.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743507:1:STARTING:1381144160:86460:Q:all.q at compute-1-7.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743508:1:STARTING:1381144160:86460:Q:all.q at compute-1-2.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743509:1:STARTING:1381144160:86460:Q:all.q at compute-1-12.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743510:1:STARTING:1381144160:86460:Q:all.q at compute-0-8.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743511:1:STARTING:1381144160:86460:Q:all.q at compute-1-4.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743512:1:STARTING:1381144160:86460:Q:all.q at compute-1-9.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743513:1:STARTING:1381144160:86460:Q:all.q at compute-1-5.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743514:1:STARTING:1381144160:86460:Q:all.q at compute-0-7.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743515:1:STARTING:1381144160:86460:Q:all.q at compute-1-8.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743516:1:STARTING:1381144160:86460:Q:all.q at compute-1-2.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743517:1:STARTING:1381144160:86460:Q:all.q at compute-0-10.local:slots:1.000000 
>>>>>>>
>>>>>>> 2743518:1:STARTING:1381144160:86460:Q:all.q at compute-0-8.local:slots:1.000000 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Even though jobs appear here listed as "starting" they are not 
>>>>>>> running at all. But they are issuing a "starting" message on 
>>>>>>> each scheduling interval.
>>>>>>>
>>>>>>> Why are the reservations blocking a third of the cluster??? It 
>>>>>>> shouldn't be a backfilling issue, they are blocking the usage of 
>>>>>>> 3 times the slots reserved. Why the "starting" jobs cannot run?
>>>>>>>
>>>>>>> Txema
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> El 07/10/13 09:28, Christian Krause escribió:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> We solved it the way that `h_rt` is set to FORCED in the 
>>>>>>>> complex list:
>>>>>>>>
>>>>>>>>      #name                    shortcut type        relop 
>>>>>>>> requestable consumable default urgency
>>>>>>>> #------------------------------------------------------------------------------------------------
>>>>>>>>      h_rt                     h_rt TIME        <=    
>>>>>>>> FORCED      YES        0:0:0 0
>>>>>>>>
>>>>>>>> And have a JSV rejecting jobs that don't request it (because 
>>>>>>>> they would be pending indefinetely
>>>>>>>> unless you have a default duration or use qalter).
>>>>>>>>
>>>>>>>> You could also use a JSV to enforce that only jobs with large 
>>>>>>>> resources (in your case more than some
>>>>>>>> amount of slots) are able to request reservation, i.e.:
>>>>>>>>
>>>>>>>>      # pseudo JSV code
>>>>>>>>           SLOT_RESERVATION_THRESHOLD=...
>>>>>>>>           if slots < SLOT_RESERVATION_THRESHOLD then
>>>>>>>>          "disable reservation / reject"
>>>>>>>>      else
>>>>>>>>          "enable reservation"
>>>>>>>>      fi
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Oct 04, 2013 at 04:25:29PM +0200, Txema Heredia wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I have a 27-node cluster. Currently there are 320 out of 320 
>>>>>>>>> slots
>>>>>>>>> filled up. All by jobs requesting 1-slot.
>>>>>>>>>
>>>>>>>>> At the top of my waiting queue there are 28 different jobs
>>>>>>>>> requesting 3 to 12 cores using two different parallel 
>>>>>>>>> environments.
>>>>>>>>> All these jobs are requesting -R y. They are being ignored and
>>>>>>>>> overrun by the myriad of 1-slot requesting  jobs behind them 
>>>>>>>>> in the
>>>>>>>>> waiting queue.
>>>>>>>>>
>>>>>>>>> I have enabled the scheduler logging. During the last 4 hours, it
>>>>>>>>> has logged 724 new jobs starting, in all the 27 nodes. Not a 
>>>>>>>>> single
>>>>>>>>> job on the system is requesting -l h_rt, but single-core jobs 
>>>>>>>>> keep
>>>>>>>>> being scheduled  and all the parallel jobs are starving.
>>>>>>>>>
>>>>>>>>> As far as I understand, the backfilling is killing my 
>>>>>>>>> reservations,
>>>>>>>>> even if no one is requesting any kind of time, but if I set the
>>>>>>>>> "default_duration" to INFINITY, all the RESERVING log messages
>>>>>>>>> disappear.
>>>>>>>>>
>>>>>>>>> Additionaly, for some odd reason, I only receive RESERVING 
>>>>>>>>> messages
>>>>>>>>> from the jobs requesting a given number of slots (-pe whatever 
>>>>>>>>> N).
>>>>>>>>> The jobs requesting a slot-range (-pe threaded 4-10) seem to 
>>>>>>>>> reserve
>>>>>>>>> nothing.
>>>>>>>>>
>>>>>>>>> My scheduler configuration is as follows:
>>>>>>>>>
>>>>>>>>> # qconf -ssconf
>>>>>>>>> algorithm                         default
>>>>>>>>> schedule_interval                 0:0:5
>>>>>>>>> maxujobs                          0
>>>>>>>>> queue_sort_method                 load
>>>>>>>>> job_load_adjustments              np_load_avg=0.50
>>>>>>>>> load_adjustment_decay_time        0:7:30
>>>>>>>>> load_formula                      np_load_avg
>>>>>>>>> schedd_job_info                   true
>>>>>>>>> flush_submit_sec                  0
>>>>>>>>> flush_finish_sec                  0
>>>>>>>>> params                            MONITOR=1
>>>>>>>>> reprioritize_interval             0:0:0
>>>>>>>>> halftime                          168
>>>>>>>>> usage_weight_list cpu=0.187000,mem=0.116000,io=0.697000
>>>>>>>>> compensation_factor               5.000000
>>>>>>>>> weight_user                       0.250000
>>>>>>>>> weight_project                    0.250000
>>>>>>>>> weight_department                 0.250000
>>>>>>>>> weight_job                        0.250000
>>>>>>>>> weight_tickets_functional         1000000000
>>>>>>>>> weight_tickets_share              1000000000
>>>>>>>>> share_override_tickets            TRUE
>>>>>>>>> share_functional_shares           TRUE
>>>>>>>>> max_functional_jobs_to_schedule   200
>>>>>>>>> report_pjob_tickets               TRUE
>>>>>>>>> max_pending_tasks_per_job         50
>>>>>>>>> halflife_decay_list               none
>>>>>>>>> policy_hierarchy                  OSF
>>>>>>>>> weight_ticket                     0.010000
>>>>>>>>> weight_waiting_time               0.000000
>>>>>>>>> weight_deadline                   3600000.000000
>>>>>>>>> weight_urgency                    0.100000
>>>>>>>>> weight_priority                   1.000000
>>>>>>>>> max_reservation                   50
>>>>>>>>> default_duration                  24:00:00
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have also tested it with params PROFILE=1 and default_duration
>>>>>>>>> INFINITY. But, when I set it, not a single reservation is 
>>>>>>>>> logged in
>>>>>>>>> /opt/gridengine/default/common/schedule and new jobs keep 
>>>>>>>>> starting.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> What am I missing? Is it possible to kill the backfilling? Are my
>>>>>>>>> reservations really working?
>>>>>>>>>
>>>>>>>>> Thanks in advance,
>>>>>>>>>
>>>>>>>>> Txema
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users at gridengine.org
>>>>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users at gridengine.org
>>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>>>
>




More information about the users mailing list