[gridengine users] suspended jobs continue to run

Joseph Farran jfarran at uci.edu
Thu Jan 22 21:29:43 UTC 2015


A little late but I am running 8.1.7 and suspend worked part-time.

I had to write my own suspend script to make it work, specially with 
MATLAB jobs which try to trap signals.

Joseph

On 12/19/2014 04:54 AM, Bergman at merctech.com wrote:
>
> On December 19, 2014 6:19:58 AM EST, Reuti <reuti at staff.uni-marburg.de> wrote:
> => Am 18.12.2014 um 22:21 schrieb bergman at merctech.com:
> => >
> => > We've got a job that was suspended via:
> => >
> => > 	qmod -sj $jobid
> => >
> => > that's continuing to run.  The job consists of a BASH script, which
> => in
> => > turn submits other jobs in a loop, sleeping for 30 seconds after
> => each loop.
> => >
> => > When I examine the job status on the node where it is executing
> => via:
> => > 	ps -e f | grep $JOBID
> => >
> => > I see that the process is sleeping (state "S"), which is not
> => unexpected,
> => > given the 'sleep 30' in the loop, but not suspended (state "T"):
> => >
> => > 	30559 ?        SNs    0:02  |   \_ /bin/bash
> => /var/tmp/gridengine/8.1.6/default/spool/node-5-2/job_scripts/2367998
> =>
> => Maybe it was introduced in this edition, as in 6.2u5 it's working for
>
> I can't believe I left that out... we're running SoGE 8.1.6.
>
> => me. Do you have a chance to test any other version on another machine
> => with your application in question?
>
> Nope.
>
> Mark
>
> =>
> => -- Reuti
> =>
> =>
> => > Indeed, the job is not suspended, as it keeps performing the action
> => > inside the loop.
> => >
> => > The problem can be consistently reproduced with a trivial job, such
> => as:
> => >
> => > ------------------------
> => > #! /bin/bash
> => > i=0
> => > while [ $i -le 100 ]
> => > do
> => > 	date
> => > 	i=$((i + 1))
> => > 	sleep 30
> => > done
> => > ------------------------
> => >
> => > Submitting that job to SGE, then executing 'qmod -sj $jobid' after
> => it
> => > starts does not suspend the running job. The 'qstat' command does
> => show
> => > the job as being in the 's' (suspended) state.
> => >
> => > We're not using any custom 'suspend_method' or changing the default
> => > signals sent by SGE.
> => >
> => > Jobs that are suspended (due to subordinated queues) by SGE have
> => never
> => > shown this behavior.
> => >
> => > Any suggestions about how to proceed with troubleshooting?
> => >
> => > Thanks,
> => >
> => > Mark
> => >
> => >
> => > _______________________________________________
> => > users mailing list
> => > users at gridengine.org
> => > https://gridengine.org/mailman/listinfo/users
>
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users
>




More information about the users mailing list