[gridengine users] Some strange interaction of PE and RQS

Ilya M 4ilya.m+grid at gmail.com
Fri Apr 20 20:55:28 UTC 2018


Hi Reuti,

There are dozens on hosts in @gpu. In my test submissions, however, I am
using only one host that I specify with '-l hostname='. I disabled all
other queues on this host to make sure nothing else but my test jobs are
running there.

BTW, after several hours, my PE 1 job went through. My submissions to
regular queue worked fine.


Update: As I was writing this response, I tried one change in the queue
configuration: I created a new host group with only one node it it and
changed my test queue to only run on that hostgroup. I submitted a couple
of PE jobs with allocation rules '1', '2', '4', and did not request a
specific hostname this time. The jobs started running immediately. And the
old jobs that had been waiting, also went through.

After discovering that, I tested normal production queue, combining '-l
hostanme=' and '-pe'. These jobs did not run and 'qalter -w v' reported "cannot
run because it exceeds limit "ilya/////" in rule "limit_slots_for_users/1"

So in my cluster, there seems to be some issue with RQS, PE and '-l
hostname=' combination that makes jobs unschedulable. I wonder if anyone
else can reproduce this behavior to see if this is an SGE bug or some
problem in my configuration.

Ilya.


On Fri, Apr 20, 2018 at 12:34 PM, Reuti <reuti at staff.uni-marburg.de> wrote:

> Hi,
>
> Am 20.04.2018 um 21:04 schrieb Ilya M:
>
> > Hello,
> >
> > I set up a test queue to test a new prolog/epilog scripts and I am
> seeing some strange behavior when I submit a PE job to this queue, which
> causes the job to not get scheduled forever or for a very long period of
> time. I tried several PE with allocation rules of '1', '2', '4'. All to no
> avail. Submitting a job without a PE makes it run immediately. I am using
> SGE 2.6u5.
> >
> > Checking why it is not running:
> > $ qalter -w v 7301747
> > ...
> > Job 7301747 cannot run because it exceeds limit "ilya/////" in rule
> "limit_slots_for_users/1"
> > Job 7301747 cannot run in PE "pe_1" because it only offers 0 slots
>
> This error message is often misleading, although there is a real reason
> preventing the scheduling.
>
> > verification: no suitable queues
> >
> > $ qconf -sp pe_1
> > pe_name            pe_1
> > slots              9999999
> > user_lists         NONE
> > xuser_lists        NONE
> > start_proc_args    startmpi.sh $pe_hostfile
> > stop_proc_args     stopmpi.sh $pe_hostfile
> > allocation_rule    1
> > control_slaves     TRUE
> > job_is_first_task  TRUE
> > urgency_slots      min
> > accounting_summary FALSE
> >
> > $ qconf -srqs limit_slots_for_users
> > {
> >    name         limit_slots_for_users
> >    description  "limit the number of simultaneous slots any user can use"
> >    enabled      TRUE
> >    limit        users {*} to slots=800
> > }
> >
> > And finally,
> > $ qstat
> > job-ID  prior   name       user         state submit/start at     queue
>                         slots ja-task-ID
> > ------------------------------------------------------------
> -----------------------------------------------------
> > 7301584 0.60051 sleep      ilya        qw    04/20/2018 18:29:26
>                             4
> > 7301747 0.50051 sleep      ilya        qw    04/20/2018 18:36:23
>                             1
> >
> > So I am not running anything at the moment. If I submit a job with the
> same PE to a production queue, it will get scheduled.
> >
> > A job that I left hanging last night, finally got scheduled after 7-8
> hours.
> >
> > The test queue is a follows:
> > qconf -sq test_gpu.q
> > qname                 test_gpu.q
> > hostlist              @gpu
>
> How many hosts are in @gpu? The allocation_rule 1 means exactly one slot
> per machine – not as often 1 as the node is filled (this is different form
> Torque, where this can be assigned several times per host).
>
>
> > seq_no                0
> > load_thresholds       np_load_avg=1.75
> > suspend_thresholds    NONE
> > nsuspend              1
> > suspend_interval      00:05:00
> > priority              0
> > min_cpu_interval      00:05:00
> > processors            UNDEFINED
> > qtype                 BATCH INTERACTIVE
> > ckpt_list             NONE
> > pe_list               make pe_1 pe_2 pe_3 pe_4 pe_slots
> > rerun                 TRUE
> > slots                 4
> > tmpdir                /data
> > shell                 /bin/sh
> > prolog                sgegrid at prolog.sh
> > epilog                sgegrid at epilog.sh
> > shell_start_mode      unix_behavior
> > starter_method        NONE
> > suspend_method        NONE
> > resume_method         NONE
> > terminate_method      custom_kill -p $job_pid -j $job_id
>
> I don't know about your custom_kill procedure, but it should kill
> -$job_pid, i.e. the process group and not only a single process.
>
> - Reuti
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20180420/f86e1290/attachment.html>


More information about the users mailing list