[gridengine users] Fwd: backfilling, s_rt

Reuti reuti at staff.uni-marburg.de
Wed Nov 30 11:18:09 UTC 2011


Am 30.11.2011 um 12:12 schrieb baf035:

> Yes, SIGUSR1 is globaly blocked; inside of a GE starter method is added a line "signal (SIGUSR1, SIG_IGN);"
> I consider as a weirdness that in the master node execd messages is a recorded exceeded hard walclock time:
>  
> "11/29/2011 14:06:33|  main|n15|W|job 112781.1 exceeded hard wallclock time - initiate terminate method"
> 
> what is in the contradiction with the record in the messages of the slave:
>  "11/29/2011 14:06:27|  main|n1|W|job 112781.1 exceeded soft wallclock time - initiate soft notify method"

There is a 5 second delay?


> I remark  the request s_rt=<time> was configured for qsub command

You specified h_rt and s_rt in qsub and both are different?

-- Reuti


> baf035
> 
> 
> Dne 29. listopadu 2011 18:26 Reuti <reuti at staff.uni-marburg.de> napsal(a):
> Am 29.11.2011 um 17:37 schrieb baf035:
> 
> > Hello Reuti,
> >
> > thanks for a clarification, I probably misunderstand man pages and used both parameters: s_rt and -notify together before.
> >
> > Our goal is to configure exactly working backfilling. We use for s_rt a mean value of running time of finished jobs based on their category.
> > In a queue configuration is a big value for notify time.
> >
> > When s_rt is expired, job is signaled but not killed thanks to notify time. We receive an information in execd messages file:
> > "11/29/2011 11:11:28|  main|service65|W|job 112756.1 exceeded soft wallclock time - initiate soft notify method"
> > The job is correctly finished.
> > This is valid for the single core job  or for the parallel job running only inside 1 node.
> > I case of bigger parallel jobs they are killed directly after s_rt expiration. It seems that GE considers s_rt as h_rt ?!
> > execd messages (master node):
> > "11/29/2011 14:06:33|  main|n15|W|job 112781.1 exceeded hard wallclock time - initiate terminate method"
> 
> Do you trap the warning signal in all ranks? If you have only one node, there will be threads used. But on the slave nodes, a new process is created, which must  ignore the signal.
> 
> -- Reuti
> 
> 
> > qmaster messages
> > "11/29/2011 14:06:34|worker|sged3|E|master task of job 112781.1 failed - killing job
> > 11/29/2011 14:06:35|worker|ged3|W|job 112781.1 failed on host n15 assumedly after job because: job 112781.1 died through signal KILL (9)"
> >
> > execd strace:
> > "17649 open("/<path>/sge_spool//n15/messages", O_WRONLY|O_CREAT|O_APPEND, 0666) = 5
> > 17649 write(5, "11/29/2011 14:06:33|  main|r8i1n"..., 107) = 107
> > 17649 close(5)                          = 0
> > 17649 kill(2152, SIGTSTP)               = 0
> > 17649 futex(0x7f705cd06e04, 0x189 /* FUTEX_??? */, 4655823, {1322571994, 846698000}, ffffffff <unfinished ...>
> > 17650 <... futex resumed> )             = -1 ETIMEDOUT (Connection timed out)
> > 17650 futex(0x7f705cd068f0, FUTEX_WAKE_PRIVATE, 1) = 0
> > 17650 futex(0x7f705cd06954, 0x189 /* FUTEX_??? */, 4332213, {1322571994, 891528000}, ffffffff <unfinished ...>
> > 17653 <... futex resumed> )             = -1 ETIMEDOUT (Connection timed out)
> > 17653 futex(0x7f705cd07100, FUTEX_WAKE_PRIVATE, 1) = 0
> > 17653 futex(0x7f705cd07164, 0x189 /* FUTEX_??? */, 4440135, {1322571995, 703424000}, ffffffff <unfinished ...>
> > 17649 <... futex resumed> )             = ? ERESTART_RESTARTBLOCK (To be restarted)
> > 17649 --- SIGCHLD (Child exited) @ 0 (0) ---
> > 17649 rt_sigreturn(0x11)                = -1 EINTR (Interrupted system call)"
> >
> > Do you have an idea what could be wrong?
> >
> > Thanks
> > Best regards
> >
> > baf035
> >
> >
> >
> >
> > Dne 22. listopadu 2011 16:16 Reuti <reuti at staff.uni-marburg.de> napsal(a):
> > Hi,
> >
> > Am 22.11.2011 um 12:58 schrieb baf035:
> >
> > >
> > > Hello all.
> > >
> > > 1) When s_rt time is reached a job is signaled by default by SIGUSR1. One of our applications is dying by the signal.
> > > I changed in a global configuration in execd_params a parameter NOTIFY_KILL=none or NOTIFY_KILL=SIGCONT
> > > bur without effect. Applications are still signaled by SIGUSR1. How could I change the behavior?
> >
> > these are two different things.
> >
> > a) when you defined s_rt, it will send SIGUSR1 at the specified time - there is no way to avoid this. If you don't want it, don't define s_rt.
> >
> > b) the notification by -notify will be send the specified time before the final kill/suspend. These notifications can be redefined by the commands you used. If you don't want it, don't use -notify at submission time.
> >
> > NB: The signal will be send to the complete processgroup, i.e. the job script and the called binary. If you want to have an action happening on SIGUSR1, you need a proper set "trap" for the shell and/or a proper signal handling in the binary. The default action for SIGUSR1 is to terminate: http://kernel.org/doc/man-pages/online/pages/man7/signal.7.html So you just face the default behavior.
> >
> >
> > > 2) Is the way how to reduce warning messages "job <JID>.1should have finished since <num> s" in a qmaster messages file
> > > when s_rt time is exceeded and a notify time is going on.
> >
> > Not that I'm aware of. But using s_rt and -notify at the same time doesn't work in a nice way together is my experience. I suggest to use only one of them at a time. Do you have a 2 stage warning in mind, which you want to implement? Using both the signal chain would be: 1) s_rt (sigusr1) warning, 2) -notify defined warning before kill, 3) kill
> >
> > -- Reuti
> >
> >
> > > Our productive GE is SoGE 8.0.0a.
> > >
> > > Thanks for help. Best regards
> > >
> > > baf035
> > >
> > > _______________________________________________
> > > users mailing list
> > > users at gridengine.org
> > > https://gridengine.org/mailman/listinfo/users
> >
> >
> 
> 





More information about the users mailing list