[gridengine users] Job finishes correctly but master is not notified

Paul Paul pot94352 at clerk.com
Thu Apr 5 13:38:18 UTC 2018


William,

Thanks for your reply.

In the 'messages' file of the exec host, there is nothing (the last message was 2 weeks ago).
In the 'messages' file of the master, there are the usual lines:
04/05/2018 06:42:58|worker|master_host|W|user forced the deletion of job 1376090
04/05/2018 06:43:20|worker|master_host|E|execd at exec_host reports running job (1376090.1/master) in queue "queue at exec_host" that was not supposed to be there - killing
04/05/2018 06:43:59|worker|master_host|E|execd at exec_host reports running job (1376090.1/master) in queue "queue at exec_host" that was not supposed to be there - killing

About 'gdi_timeout' and 'gdi_retries', we will try to modify them to check if things are better.
We already noticed issue when submitting jobs with 'qsub' (when the NFS is really loaded), like:
"Unable to run job: failed receiving gdi request response for mid=1 (got syncron message receive timeout error)."
so it might help for this too.

Paul.

> Sent: Thursday, April 05, 2018 at 8:20 AM
> From: "William Hay" <w.hay at ucl.ac.uk>
> To: "Paul Paul" <pot94352 at clerk.com>
> Cc: users at gridengine.org
> Subject: Re: [gridengine users] Job finishes correctly but master is not notified
>
> On Thu, Apr 05, 2018 at 09:46:23AM +0200, Paul Paul wrote:
> > Hello,
> > 
> > We're using SGE 8.1.9 and randomly, we have jobs that finish with success (our jobs logs confirm this) but the master is not notified.
> > On the compute, all the folders related to such a job are still here, correctly filled:
> > 
> > trace file:
> > ...
> > 04/04/2018 21:50:13 [300:38328]: now running with uid=300, euid=300
> > 04/04/2018 21:50:13 [300:38328]: execvlp(/bin/ksh, "-ksh" "/gridware/sge/gridname/spool/server/job_scripts/1376090")
> > 04/04/2018 21:50:23 [300:38327]: wait3 returned 38328 (status: 0; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
> > 04/04/2018 21:50:23 [300:38327]: job exited with exit status 0
> > 04/04/2018 21:50:23 [300:38327]: reaped "job" with pid 38328
> > 04/04/2018 21:50:23 [300:38327]: job exited not due to signal
> > 04/04/2018 21:50:23 [300:38327]: job exited with status 0
> > 04/04/2018 21:50:23 [300:38327]: now sending signal KILL to pid -38328
> > 04/04/2018 21:50:23 [300:38327]: pdc_kill_addgrpid: 20075 9
> > 04/04/2018 21:50:23 [300:38327]: writing usage file to "usage"
> > 04/04/2018 21:50:23 [300:38327]: no epilog script to start
> > 
> > exit_status:
> > 0
> > 
> > error:
> > (empty)
> > 
> > but the process no longer appears in the 'ps' output.
> > 
> > On the master, doing a 'qstat -j 1376090' works and so, to get rid of such a job, we are performing 'qdel -f 1376090'.
> > 
> > This happens 3 or 4 times a day (we submit more than 100k jobs per day), on different exec hosts.
> > 
> > Do you know what could be the cause of this behavior?
> Is there anything in the messages log?
> 
> Alternatively this might just be networks being less than 100% reliable.  Possibly tweaking gdi_timeout and gdi_retries 
> might help.
> 
> William
> 



More information about the users mailing list