[gridengine users] Job finishes correctly but master is not notified

Paul Paul pot94352 at clerk.com
Thu Apr 5 07:46:23 UTC 2018


Hello,

We're using SGE 8.1.9 and randomly, we have jobs that finish with success (our jobs logs confirm this) but the master is not notified.
On the compute, all the folders related to such a job are still here, correctly filled:

trace file:
...
04/04/2018 21:50:13 [300:38328]: now running with uid=300, euid=300
04/04/2018 21:50:13 [300:38328]: execvlp(/bin/ksh, "-ksh" "/gridware/sge/gridname/spool/server/job_scripts/1376090")
04/04/2018 21:50:23 [300:38327]: wait3 returned 38328 (status: 0; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
04/04/2018 21:50:23 [300:38327]: job exited with exit status 0
04/04/2018 21:50:23 [300:38327]: reaped "job" with pid 38328
04/04/2018 21:50:23 [300:38327]: job exited not due to signal
04/04/2018 21:50:23 [300:38327]: job exited with status 0
04/04/2018 21:50:23 [300:38327]: now sending signal KILL to pid -38328
04/04/2018 21:50:23 [300:38327]: pdc_kill_addgrpid: 20075 9
04/04/2018 21:50:23 [300:38327]: writing usage file to "usage"
04/04/2018 21:50:23 [300:38327]: no epilog script to start

exit_status:
0

error:
(empty)

but the process no longer appears in the 'ps' output.

On the master, doing a 'qstat -j 1376090' works and so, to get rid of such a job, we are performing 'qdel -f 1376090'.

This happens 3 or 4 times a day (we submit more than 100k jobs per day), on different exec hosts.

Do you know what could be the cause of this behavior?

Thanks,

Paul.



More information about the users mailing list