[gridengine users] Fwd: Fwd: backfilling, s_rt

baf035 baf035 at gmail.com
Wed Nov 30 11:28:36 UTC 2011


Dne 30. listopadu 2011 12:18 Reuti <reuti at staff.uni-marburg.de> napsal(a):

Am 30.11.2011 um 12:12 schrieb baf035:
>
> > Yes, SIGUSR1 is globaly blocked; inside of a GE starter method is added
> a line "signal (SIGUSR1, SIG_IGN);"
> > I consider as a weirdness that in the master node execd messages is a
> recorded exceeded hard walclock time:
> >
> > "11/29/2011 14:06:33|  main|n15|W|job 112781.1 exceeded hard wallclock
> time - initiate terminate method"
> >
> > what is in the contradiction with the record in the messages of the
> slave:
> >  "11/29/2011 14:06:27|  main|n1|W|job 112781.1 exceeded soft wallclock
> time - initiate soft notify method"
>
> There is a 5 second delay?
>
> Strange, I did not analyze it ..

>
> > I remark  the request s_rt=<time> was configured for qsub command
>
> You specified h_rt and s_rt in qsub and both are different?
>
I have set only s_rt for the job. On the queue level is not s_rt nor h_rt
configured.

baf035

>
> -- Reuti
>
>
> > baf035
> >
> >
> > Dne 29. listopadu 2011 18:26 Reuti <reuti at staff.uni-marburg.de>
> napsal(a):
> > Am 29.11.2011 um 17:37 schrieb baf035:
> >
> > > Hello Reuti,
> > >
> > > thanks for a clarification, I probably misunderstand man pages and
> used both parameters: s_rt and -notify together before.
> > >
> > > Our goal is to configure exactly working backfilling. We use for s_rt
> a mean value of running time of finished jobs based on their category.
> > > In a queue configuration is a big value for notify time.
> > >
> > > When s_rt is expired, job is signaled but not killed thanks to notify
> time. We receive an information in execd messages file:
> > > "11/29/2011 11:11:28|  main|service65|W|job 112756.1 exceeded soft
> wallclock time - initiate soft notify method"
> > > The job is correctly finished.
> > > This is valid for the single core job  or for the parallel job running
> only inside 1 node.
> > > I case of bigger parallel jobs they are killed directly after s_rt
> expiration. It seems that GE considers s_rt as h_rt ?!
> > > execd messages (master node):
> > > "11/29/2011 14:06:33|  main|n15|W|job 112781.1 exceeded hard wallclock
> time - initiate terminate method"
> >
> > Do you trap the warning signal in all ranks? If you have only one node,
> there will be threads used. But on the slave nodes, a new process is
> created, which must  ignore the signal.
> >
> > -- Reuti
> >
> >
> > > qmaster messages
> > > "11/29/2011 14:06:34|worker|sged3|E|master task of job 112781.1 failed
> - killing job
> > > 11/29/2011 14:06:35|worker|ged3|W|job 112781.1 failed on host n15
> assumedly after job because: job 112781.1 died through signal KILL (9)"
> > >
> > > execd strace:
> > > "17649 open("/<path>/sge_spool//n15/messages",
> O_WRONLY|O_CREAT|O_APPEND, 0666) = 5
> > > 17649 write(5, "11/29/2011 14:06:33|  main|r8i1n"..., 107) = 107
> > > 17649 close(5)                          = 0
> > > 17649 kill(2152, SIGTSTP)               = 0
> > > 17649 futex(0x7f705cd06e04, 0x189 /* FUTEX_??? */, 4655823,
> {1322571994, 846698000}, ffffffff <unfinished ...>
> > > 17650 <... futex resumed> )             = -1 ETIMEDOUT (Connection
> timed out)
> > > 17650 futex(0x7f705cd068f0, FUTEX_WAKE_PRIVATE, 1) = 0
> > > 17650 futex(0x7f705cd06954, 0x189 /* FUTEX_??? */, 4332213,
> {1322571994, 891528000}, ffffffff <unfinished ...>
> > > 17653 <... futex resumed> )             = -1 ETIMEDOUT (Connection
> timed out)
> > > 17653 futex(0x7f705cd07100, FUTEX_WAKE_PRIVATE, 1) = 0
> > > 17653 futex(0x7f705cd07164, 0x189 /* FUTEX_??? */, 4440135,
> {1322571995, 703424000}, ffffffff <unfinished ...>
> > > 17649 <... futex resumed> )             = ? ERESTART_RESTARTBLOCK (To
> be restarted)
> > > 17649 --- SIGCHLD (Child exited) @ 0 (0) ---
> > > 17649 rt_sigreturn(0x11)                = -1 EINTR (Interrupted system
> call)"
> > >
> > > Do you have an idea what could be wrong?
> > >
> > > Thanks
> > > Best regards
> > >
> > > baf035
> > >
> > >
> > >
> > >
> > > Dne 22. listopadu 2011 16:16 Reuti <reuti at staff.uni-marburg.de>
> napsal(a):
> > > Hi,
> > >
> > > Am 22.11.2011 um 12:58 schrieb baf035:
> > >
> > > >
> > > > Hello all.
> > > >
> > > > 1) When s_rt time is reached a job is signaled by default by
> SIGUSR1. One of our applications is dying by the signal.
> > > > I changed in a global configuration in execd_params a parameter
> NOTIFY_KILL=none or NOTIFY_KILL=SIGCONT
> > > > bur without effect. Applications are still signaled by SIGUSR1. How
> could I change the behavior?
> > >
> > > these are two different things.
> > >
> > > a) when you defined s_rt, it will send SIGUSR1 at the specified time -
> there is no way to avoid this. If you don't want it, don't define s_rt.
> > >
> > > b) the notification by -notify will be send the specified time before
> the final kill/suspend. These notifications can be redefined by the
> commands you used. If you don't want it, don't use -notify at submission
> time.
> > >
> > > NB: The signal will be send to the complete processgroup, i.e. the job
> script and the called binary. If you want to have an action happening on
> SIGUSR1, you need a proper set "trap" for the shell and/or a proper signal
> handling in the binary. The default action for SIGUSR1 is to terminate:
> http://kernel.org/doc/man-pages/online/pages/man7/signal.7.html So you
> just face the default behavior.
> > >
> > >
> > > > 2) Is the way how to reduce warning messages "job <JID>.1should have
> finished since <num> s" in a qmaster messages file
> > > > when s_rt time is exceeded and a notify time is going on.
> > >
> > > Not that I'm aware of. But using s_rt and -notify at the same time
> doesn't work in a nice way together is my experience. I suggest to use only
> one of them at a time. Do you have a 2 stage warning in mind, which you
> want to implement? Using both the signal chain would be: 1) s_rt (sigusr1)
> warning, 2) -notify defined warning before kill, 3) kill
> > >
> > > -- Reuti
> > >
> > >
> > > > Our productive GE is SoGE 8.0.0a.
> > > >
> > > > Thanks for help. Best regards
> > > >
> > > > baf035
> > > >
> > > > _______________________________________________
> > > > users mailing list
> > > > users at gridengine.org
> > > > https://gridengine.org/mailman/listinfo/users
> > >
> > >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20111130/a792d04c/attachment.html>


More information about the users mailing list