[gridengine users] release job dependency hold upon starting of dependent job instead on completing job

Reuti reuti at staff.uni-marburg.de
Thu May 2 18:35:15 UTC 2013


Hi,

Am 02.05.2013 um 19:31 schrieb Happy Monk:

> Thanks for the quick reply Reuti.
> 
> How can I restrict prolog to only certain jobs ? Here is the LSF recipe that we are trying to implement in SGE
> 
> 
> #!/bin/bash
> 
> bsub < preproc.sh
> 
> echo $LSB_JOBID

`bsub` can change the value of an environment variable in the actual shell process - interesting. Hence it's more like a sourced script than a started child process with its own environment.


> bsub -w 'done($LSB_JOBID)' < hostjob.sh
> 
> echo $LSB_JOBID
> 
> bsub -w 'started($LSB_JOBID)' < computejob.sh
> 
> echo $LSB_JOBID 

On the hand one could add an additional environment variable to this new job, where the real condition for each "-hold_jid" is stated. But this way it would be necessary for the orignal main job to parse all outputs of all jobs for the existence of this variable.

Maybe a shorter way could be to add the next job id to the context of the main job. In SGE the context of a job is meta data unrelated to SGE's handling and also unrelated to the jobs environment, it's like a comment. I mean:

#!/bin/sh
PREP_JOB=$(qsub -terse preproc.sh)
MAIN_JOB=$(qsub -terse -hold_jid $PREP_JOB hostjob.sh)
NEXT_JOB=$(qsub -terse -hold_jid $MAIN_JOB computejob.sh)
qalter -ac NEXT_JOB=$NEXT_JOB $MAIN_JOB

Then the prolog has to scan the `qstat -j $JOB_ID`, i.e. his own job number, whether there is an entry like:

context:                    NEXT_JOB=1234

and if yes, use `qalter` to apply the removal only to this job id.

NB: In principle there is a race condition with this setup: if the prolog of the main job runs before `qalter` for the follow up job was applied, it might miss this necessity. But this would mean that the `preproc.sh` has almost no runtime and the scheduled hostjob.sh starts more or less instantly. If this could happen it needs to be adjusted:

#!/bin/sh
PREP_JOB=$(qsub -terse preproc.sh)
MAIN_JOB=$(qsub -terse -hold_jid $PREP_JOB -ac NEXT_JOB=PENDING hostjob.sh)
NEXT_JOB=$(qsub -terse -hold_jid $MAIN_JOB computejob.sh)
qalter -sc NEXT_JOB=$NEXT_JOB $MAIN_JOB

Then the prolog could wait or rise an error if it sees NEXT_JOB=PENDING instead of a job id there.

-- Reuti


> On Thu, May 2, 2013 at 9:00 AM, Reuti <reuti at staff.uni-marburg.de> wrote:
> Hi,
> 
> Am 02.05.2013 um 17:10 schrieb Happy Monk:
> 
> > Is there any way to release hold of a job immediately after the dependent job started, usually this hold is released after execution of the dependent job.
> >
> > This function is available in LSF but checking whether its also available in SGE or not.
> 
> Not directly. You could use a queue prolog to remove the actual starting job from all jobs which depend on this one. This makes it necessary, that all exechosts are also submission hosts.
> 
> To remove a complete -hold_jid list, you can give the job id 0 there to `qalter`. As this job id will never be a real job, it always satisfies the condition as being completed already.
> 
> -- Reuti
> 





More information about the users mailing list