[gridengine users] Jobs sometimes run in wrong directory
dowobeha at gmail.com
Sun Jan 29 22:34:33 UTC 2012
Thanks. Excellent idea. I will give it a try when I get in to work on Monday.
On Fri, Jan 27, 2012 at 3:42 PM, Rayson Ho <rayson at scalablelogic.com> wrote:
> On Fri, Jan 27, 2012 at 2:49 PM, Lane Schwartz <dowobeha at gmail.com> wrote:
>> I have encountered a problem where sometimes (but not always) my jobs
>> ignore the -cwd or -wd flags and run in my home directory instead of
>> the specified working directory. I can run the same job multiple times
>> launching from the same directory, and sometimes the job correctly
>> runs from the current directory, and sometimes it runs from my home
> I ran over 100 test jobs and all of them ran in directory specified in
> -cwd or -wd. How easy is it to reproduce the issue?? Is the home
> directory on NFS or some kind of network or cluster storage??
> If Grid Engine cannot change the directory to the one specified by
> -cwd/-wd, then it will simply turn the job into the "Eqw" state.
> Since this is a random issue, we will need:
> 1) run a few jobs that do the following:
> - check if the current working directory is the correct one by calling
> `pwd` & check the hard coded value of the supposingly correct
> directory (so obviously you will need to decide the location before
> you submit the jobs, since the correct value is hard coded into the
> job script).
> - if the value is not correct, then email you to notify the issue, and
> then sleep (I mean... the job, not you!)
> - and if the value is correct, then just exit with 0 and don't sleep
> (no point in wasting the job slots).
> 2) So assume you have jobs do not run in the "correct" directory, run:
> - qstat -j <job id>
> the "sge_o_workdir" should show you what SGE thinks which directory
> the job is supposed to run in.
> - go into the $SGE_ROOT/$SGE_CELL/spool/<execution
> host>/active_jobs/<job id.1> directory
> Tar up the content of the directory and send it to me, together with
> the qstat -j output.
>> Interestingly, though, it always outputs the stderr and stdout log
>> files into the correct folder (specified by -o and -e) which is in the
>> current directory.
>> To help debug the problem, I made a sample script that simply calls
>> pwd. The script is below:
>> # Tell SGE to use bash instead of the SGE default shell
>> #$ -S /bin/bash
>> # Tell SGE to keep all current environment variables
>> #$ -V
>> # Tell SGE to run job from current working directory
>> #$ -cwd
>> # Tell SGE which queue to use
>> #$ -q all.q
>> # Tell SGE the name of this job
>> #$ -N fr-en.mert
>> # Tell SGE where to log this job
>> #$ -o log/mert/fr-en/
>> #$ -e log/mert/fr-en/
>> # Tell SGE how much memory this job needs
>> #$ -l mem_free=8G
>> echo "pwd=`pwd`"
>> echo "PWD=$PWD"
>> I'm running OGS/GE2011.11. I also have another setup with SGE/ge6.2u6
>> - on that older setup this problem does not occur.
>> Has anyone ever seen this type of problem? If I change the script to
>> use -wd /path/to/current/dir instead of -cwd I get exactly the same
>> inconsistent behavior. Likewise, it doesn't appear to matter whether I
>> pass the flag at the command line or within the script, as above.
>> Are there any grid engine or scheduler log files that I could examine
>> which might be helpful in tracking down this behavior?
>> Lane Schwartz
>> users mailing list
>> users at gridengine.org
More information about the users