[gridengine users] Some strange interaction of PE and RQS

Reuti reuti at staff.uni-marburg.de
Fri Apr 20 19:34:07 UTC 2018


Hi,

Am 20.04.2018 um 21:04 schrieb Ilya M:

> Hello,
> 
> I set up a test queue to test a new prolog/epilog scripts and I am seeing some strange behavior when I submit a PE job to this queue, which causes the job to not get scheduled forever or for a very long period of time. I tried several PE with allocation rules of '1', '2', '4'. All to no avail. Submitting a job without a PE makes it run immediately. I am using SGE 2.6u5.
> 
> Checking why it is not running:
> $ qalter -w v 7301747
> ...
> Job 7301747 cannot run because it exceeds limit "ilya/////" in rule "limit_slots_for_users/1"
> Job 7301747 cannot run in PE "pe_1" because it only offers 0 slots

This error message is often misleading, although there is a real reason preventing the scheduling.

> verification: no suitable queues
> 
> $ qconf -sp pe_1
> pe_name            pe_1
> slots              9999999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    startmpi.sh $pe_hostfile
> stop_proc_args     stopmpi.sh $pe_hostfile
> allocation_rule    1
> control_slaves     TRUE
> job_is_first_task  TRUE
> urgency_slots      min
> accounting_summary FALSE
> 
> $ qconf -srqs limit_slots_for_users
> {
>    name         limit_slots_for_users
>    description  "limit the number of simultaneous slots any user can use"
>    enabled      TRUE
>    limit        users {*} to slots=800
> }
> 
> And finally, 
> $ qstat
> job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
> -----------------------------------------------------------------------------------------------------------------
> 7301584 0.60051 sleep      ilya        qw    04/20/2018 18:29:26                                    4        
> 7301747 0.50051 sleep      ilya        qw    04/20/2018 18:36:23                                    1        
> 
> So I am not running anything at the moment. If I submit a job with the same PE to a production queue, it will get scheduled.
> 
> A job that I left hanging last night, finally got scheduled after 7-8 hours.
> 
> The test queue is a follows:
> qconf -sq test_gpu.q
> qname                 test_gpu.q
> hostlist              @gpu

How many hosts are in @gpu? The allocation_rule 1 means exactly one slot per machine – not as often 1 as the node is filled (this is different form Torque, where this can be assigned several times per host).


> seq_no                0
> load_thresholds       np_load_avg=1.75
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH INTERACTIVE
> ckpt_list             NONE
> pe_list               make pe_1 pe_2 pe_3 pe_4 pe_slots
> rerun                 TRUE
> slots                 4
> tmpdir                /data
> shell                 /bin/sh
> prolog                sgegrid at prolog.sh
> epilog                sgegrid at epilog.sh
> shell_start_mode      unix_behavior
> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      custom_kill -p $job_pid -j $job_id

I don't know about your custom_kill procedure, but it should kill -$job_pid, i.e. the process group and not only a single process.

- Reuti



More information about the users mailing list