[gridengine users] GE2011.11 and ge6.2u5
Rayson Ho
rayrayson at gmail.com
Fri Jun 15 20:41:07 UTC 2012
On Fri, Jun 15, 2012 at 2:43 PM, Michael Coffman
<michael.coffman at avagotech.com> wrote:
> 06/14/2012 08:56:49| main|cs431|E|shepherd of job 9990340.1 exited with
> exit status = 11
Hmm, then most likely the qmaster log also won't tell you anything...
and thus we need the shepherd "trace" file (in the active_jobs
directory) to find out what's happening.
Also, do you know if the job has any "%s" parameters passed into it??
(We have received reports of it before - a highly random error that
can happen depending on how the shepherd is built & the OS that it is
running on...)
Rayson
>
>>
>> Rayson
>>
>>
>>
>>
>> >
>> > ==============================================================
>> > qname all.q
>> > hostname cs431.ftc.avagotech.net
>> > group fidlib
>> > owner bgp
>> > project NONE
>> > department priority
>> > jobname qsubcmd.21231
>> > jobnumber 17593
>> > taskid undefined
>> > account sge
>> > priority 0
>> > qsub_time Wed Dec 31 17:00:00 1969
>> > start_time -/-
>> > end_time -/-
>> > granted_pe NONE
>> > slots 0
>> > failed 11 : before job
>> > exit_status 0
>> > ru_wallclock 0
>> > ru_utime 0.000
>> > ru_stime 0.000
>> > ru_maxrss 0
>> > ru_ixrss 0
>> > ru_ismrss 0
>> > ru_idrss 0
>> > ru_isrss 0
>> > ru_minflt 0
>> > ru_majflt 0
>> > ru_nswap 0
>> > ru_inblock 0
>> > ru_oublock 0
>> > ru_msgsnd 0
>> > ru_msgrcv 0
>> > ru_nsignals 0
>> > ru_nvcsw 0
>> > ru_nivcsw 0
>> > cpu 0.000
>> > mem 0.000
>> > io 0.000
>> > iow 0.000
>> > maxvmem 0.000
>> > arid undefined
>> >
>> >
>> >
>> > On Fri, Jun 15, 2012 at 11:27 AM, Michael Coffman
>> > <michael.coffman at avagotech.com> wrote:
>> >>
>> >> On Fri, Jun 15, 2012 at 11:11 AM, Rayson Ho <rayrayson at gmail.com>
>> >> wrote:
>> >>>
>> >>> Can you set "execd_params" to KEEP_ACTIVE for this host?? (See the
>> >>> manpage at this URL:
>> >>> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_conf.html )
>> >>>
>> >>> Request the job to run in this queue/host again, and see why the
>> >>> shepherd can't open the job_pid.
>> >>>
>> >>> (And remember to unset the execd_params or else you will fill up your
>> >>> local spool dir eventually with job information.)
>> >>>
>> >>
>> >> I can't do this on my production grid. And I don't know how to
>> >> replicate
>> >> the problem currently. I will set things up on a test setup and try
>> >> and
>> >> reproduce the issue with KEEP_ACTIVE turned on.
>> >>
>> >> Is it possible to set the KEEP_ACTIVE per host? I only see this in
>> >> the
>> >> qconf -sconf
>> >>
>> >>>
>> >>> Rayson
>> >>>
>> >>>
>> >>>
>> >>> On Fri, Jun 15, 2012 at 12:58 PM, Michael Coffman
>> >>> <michael.coffman at avagotech.com> wrote:
>> >>> > On Fri, Jun 15, 2012 at 10:11 AM, Rayson Ho <rayrayson at gmail.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> On Fri, Jun 15, 2012 at 12:01 PM, Michael Coffman
>> >>> >> <michael.coffman at avagotech.com> wrote:
>> >>> >> > From the qmaster messages file:
>> >>> >> > 06/14/2012 21:29:39|worker|gemaster|W|job 3885.1 failed on host
>> >>> >> > cs428.ftc.avagotech.net general before job because: 06/14/2012
>> >>> >> > 21:29:37
>> >>> >> > [20339:8436]: can't open file job_pid: Permission denied
>> >>> >> >
>> >>> >> > I checked a job_pid file on a currently running job on the system
>> >>> >> > that
>> >>> >> > had
>> >>> >> > the above errors, permission down the entire tree seems fine and
>> >>> >> > here is
>> >>> >> > the
>> >>> >> > job_id file:
>> >>> >> >
>> >>> >> > -rw-r--r-- 1 grid grid 6 Jun 14 17:40 job_pid
>> >>> >>
>> >>> >> Is your execd spool dir on NFS or local??
>> >>> >>
>> >>> > Local.
>> >>> >
>> >>> >>
>> >>> >> Also, does it happen to all nodes or just a node or queue?
>> >>> >>
>> >>> >
>> >>> > Happened on 2 different nodes. Not all jobs caused this.
>> >>> >
>> >>> >>
>> >>> >> Rayson
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> >
>> >>> >> > Any clues? Is the path perhaps hard coded into sge_shepherd
>> >>> >> > for
>> >>> >> > this
>> >>> >> > file?
>> >>> >> >
>> >>> >> > Thanks.
>> >>> >> > --
>> >>> >> > -MichaelC
>> >>> >> >
>> >>> >> > _______________________________________________
>> >>> >> > users mailing list
>> >>> >> > users at gridengine.org
>> >>> >> > https://gridengine.org/mailman/listinfo/users
>> >>> >> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > --
>> >>> > -MichaelC
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> -MichaelC
>> >
>> >
>> >
>> >
>> > --
>> > -MichaelC
>
>
>
>
> --
> -MichaelC
More information about the users
mailing list