[gridengine users] qrsh session failed to execute prolog script?

Derrick Lin klin938 at gmail.com
Thu Jan 10 23:30:54 UTC 2019


Hi Reuti

Thanks for the input. But how does this help on troubleshooting the prolog
script?

I will also troubleshooting the prolog script line by line and see which
line is causing the problem.

Cheers,
Derrick

On Thu, Jan 10, 2019 at 7:42 PM Reuti <reuti at staff.uni-marburg.de> wrote:

> Hi,
>
> Am 09.01.2019 um 23:35 schrieb Derrick Lin:
>
> > Hi Reuti,
> >
> > I have to say I am still not familiar with the "-i" in qsub after
> reading the man page, what does it do?
>
> It will be feed as stdin to the jobscript. Hence:
>
> $ qsub -i myfile foo.sh
>
> is like:
>
> $ foo.sh < myfile
>
> but in batch.
>
> -- Reuti
>
>
> > There is no useful/interesting output in qmaster message or exec node
> message log. The only information I could find is from job's trace file:
> >
> > [root at zeta-4-12 381.1]# ls
> > config  environment  error  exit_status  pe_hostfile  pid  trace
> > [root at zeta-4-12 381.1]# cat trace
> > 01/10/2019 09:12:07 [997:307578]: shepherd called with uid = 0, euid =
> 997
> > 01/10/2019 09:12:07 [997:307578]: qlogin_daemon = builtin
> > 01/10/2019 09:12:07 [997:307578]: starting up 8.1.9
> > 01/10/2019 09:12:07 [997:307578]: setpgid(307578, 307578) returned 0
> > 01/10/2019 09:12:07 [997:307578]: do_core_binding: "binding" parameter
> not found in config file
> > 01/10/2019 09:12:07 [997:307578]: calling fork_pty()
> > 01/10/2019 09:12:07 [997:307578]: parent: forked "prolog" with pid 307579
> > 01/10/2019 09:12:07 [997:307578]: using signal delivery delay of 120
> seconds
> > 01/10/2019 09:12:07 [997:307578]: parent: prolog-pid: 307579
> > 01/10/2019 09:12:07 [997:307579]: child: starting son(prolog, root@/opt/gridengine/default/common/prolog_exec.sh,
> 0, 10000);
> > 01/10/2019 09:12:07 [997:307579]: pid=307579 pgrp=307579 sid=307579 old
> pgrp=307579 getlogin()=<no login set>
> > 01/10/2019 09:12:07 [997:307579]: reading passwd information for user
> 'root'
> > 01/10/2019 09:12:07 [997:307579]: setting limits
> > 01/10/2019 09:12:07 [997:307579]: setting environment
> > 01/10/2019 09:12:07 [997:307579]: Initializing error file
> > 01/10/2019 09:12:07 [997:307579]: switching to intermediate/target user
> > 01/10/2019 09:12:07 [997:307579]: setting additional gid=0
> > 01/10/2019 09:12:07 [6782:307579]: closing all filedescriptors
> > 01/10/2019 09:12:07 [6782:307579]: further messages are in "error" and
> "trace"
> > 01/10/2019 09:12:07 [997:307578]: Poll received POLLHUP (Hang up).
> Unregister the FD.
> > 01/10/2019 09:12:07 [6782:307579]: using "/bin/bash" as shell of user
> "root"
> > 01/10/2019 09:12:07 [0:307579]: now running with uid=0, euid=0
> > 01/10/2019 09:12:07 [0:307579]:
> execvlp(/opt/gridengine/default/common/prolog_exec.sh,
> "/opt/gridengine/default/common/prolog_exec.sh")
> > ### The process just stuck in the line above
> >
> > Here is the trace file for a qsub/batch job, apparently the prolog
> script got executed and the process proceeded:
> >
> > [root at zeta-4-12 383.1]# ls
> > addgrpid  config  environment  error  exit_status  job_pid  pe_hostfile
> pid  trace
> > [root at zeta-4-12 383.1]# cat trace
> > 01/10/2019 09:20:22 [997:315329]: shepherd called with uid = 0, euid =
> 997
> > 01/10/2019 09:20:22 [997:315329]: starting up 8.1.9
> > 01/10/2019 09:20:22 [997:315329]: setpgid(315329, 315329) returned 0
> > 01/10/2019 09:20:22 [997:315329]: do_core_binding: "binding" parameter
> not found in config file
> > 01/10/2019 09:20:22 [997:315329]: parent: forked "prolog" with pid 315330
> > 01/10/2019 09:20:22 [997:315329]: using signal delivery delay of 120
> seconds
> > 01/10/2019 09:20:22 [997:315329]: parent: prolog-pid: 315330
> > 01/10/2019 09:20:22 [997:315330]: child: starting son(prolog, root@/opt/gridengine/default/common/prolog_exec.sh,
> 0, 10000);
> > 01/10/2019 09:20:22 [997:315330]: pid=315330 pgrp=315330 sid=315330 old
> pgrp=315329 getlogin()=<no login set>
> > 01/10/2019 09:20:22 [997:315330]: reading passwd information for user
> 'root'
> > 01/10/2019 09:20:22 [997:315330]: setting limits
> > 01/10/2019 09:20:22 [997:315330]: setting environment
> > 01/10/2019 09:20:22 [997:315330]: Initializing error file
> > 01/10/2019 09:20:22 [997:315330]: switching to intermediate/target user
> > 01/10/2019 09:20:22 [997:315330]: setting additional gid=0
> > 01/10/2019 09:20:22 [6782:315330]: closing all filedescriptors
> > 01/10/2019 09:20:22 [6782:315330]: further messages are in "error" and
> "trace"
> > 01/10/2019 09:20:22 [6782:315330]: using "/bin/bash" as shell of user
> "root"
> > 01/10/2019 09:20:22 [6782:315330]: using stdout as stderr
> > 01/10/2019 09:20:22 [0:315330]: now running with uid=0, euid=0
> > 01/10/2019 09:20:22 [0:315330]:
> execvlp(/opt/gridengine/default/common/prolog_exec.sh,
> "/opt/gridengine/default/common/prolog_exec.sh")
> > 01/10/2019 09:20:22 [997:315329]: wait3 returned 315330 (status: 0;
> WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
> > 01/10/2019 09:20:22 [997:315329]: prolog exited with exit status 0
> > 01/10/2019 09:20:22 [997:315329]: reaped "prolog" with pid 315330
> > 01/10/2019 09:20:22 [997:315329]: prolog exited not due to signal
> > 01/10/2019 09:20:22 [997:315329]: prolog exited with status 0
> > 01/10/2019 09:20:22 [997:315329]: parent: forked "job" with pid 315345
> > 01/10/2019 09:20:22 [997:315329]: parent: job-pid: 315345
> > 01/10/2019 09:20:22 [997:315345]: child: starting son(job, sleep, 0,
> 4096);
> > 01/10/2019 09:20:22 [997:315345]: pid=315345 pgrp=315345 sid=315345 old
> pgrp=315329 getlogin()=<no login set>
> > 01/10/2019 09:20:22 [997:315345]: reading passwd information for user
> 'derlin'
> > 01/10/2019 09:20:22 [997:315345]: setosjobid: uid = 0, euid = 997
> > 01/10/2019 09:20:22 [997:315345]: setting limits
> > 01/10/2019 09:20:22 [997:315345]: RLIMIT_CPU setting: (soft INFINITY
> hard INFINITY) resulting: (soft INFINITY hard INFINITY)
> > 01/10/2019 09:20:22 [997:315345]: RLIMIT_FSIZE setting: (soft INFINITY
> hard INFINITY) resulting: (soft INFINITY hard INFINITY)
> > 01/10/2019 09:20:22 [997:315345]: RLIMIT_DATA setting: (soft INFINITY
> hard INFINITY) resulting: (soft INFINITY hard INFINITY)
> > 01/10/2019 09:20:22 [997:315345]: RLIMIT_STACK setting: (soft INFINITY
> hard INFINITY) resulting: (soft INFINITY hard INFINITY)
> > 01/10/2019 09:20:22 [997:315345]: RLIMIT_CORE setting: (soft INFINITY
> hard INFINITY) resulting: (soft INFINITY hard INFINITY)
> > 01/10/2019 09:20:22 [997:315345]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
> 8257536000 hard 8257536000) resulting: (soft 8257536000 hard 8257536000)
> > 01/10/2019 09:20:22 [997:315345]: RLIMIT_RSS setting: (soft INFINITY
> hard INFINITY) resulting: (soft INFINITY hard INFINITY)
> > 01/10/2019 09:20:22 [997:315345]: setting environment
> > 01/10/2019 09:20:22 [997:315345]: Initializing error file
> > 01/10/2019 09:20:22 [997:315345]: switching to intermediate/target user
> > 01/10/2019 09:20:22 [997:315345]: setting additional gid=20011
> > 01/10/2019 09:20:22 [6782:315345]: closing all filedescriptors
> > 01/10/2019 09:20:22 [6782:315345]: further messages are in "error" and
> "trace"
> > 01/10/2019 09:20:22 [6782:315345]: using stdout as stderr
> > 01/10/2019 09:20:22 [6782:315345]: now running with uid=6782, euid=6782
> > 01/10/2019 09:20:22 [6782:315345]: execvlp(/bin/csh, "-csh" "-c" "sleep
> 10m ")
> >
> > I will attach my prolog script in the next post.
> >
> > Cheers
> > Derrick
> >
> >
> > On Wed, Jan 9, 2019 at 7:36 PM Reuti <reuti at staff.uni-marburg.de> wrote:
> > Hi,
> >
> > > Am 09.01.2019 um 01:14 schrieb Derrick Lin <klin938 at gmail.com>:
> > >
> > > Hi guys,
> > >
> > > I just brought up a new SGE cluster, but somehow the qrsh session does
> not work:
> > >
> > > tester at login-gpu:~$ qrsh
> > > ^Cerror: error while waiting for builtin IJS connection: "got select
> timeout"
> > >
> > > after I hit entered, the session just stuck there forever instead of
> bring me to a compute node. I have to entered Crtl+c to terminate and it
> gave the above error.
> > >
> > > I noticed, the SGE did send my qrsh request to a compute node as I
> could tell from qstat:
> > >
> > >
> ---------------------------------------------------------------------------------
> > > short.q at zeta-4-15.local        BIP   0/1/80         0.01     lx-amd64
> > >      15 0.55500 QRLOGIN    tester       r    01/09/2019 10:47:13     1
> > >
> > > We have a prolog script configured globally, the script deals with
> local disk quota and keep all output to a log file for each job. So I went
> to that compute node, and check, found that a log file was created but it
> was empty.
> > >
> > > So my thinking so far is, my qrsh stuck because the prolog script is
> not fully executed.
> >
> > Is there any statement in the prolog, which could wait for stdin – and
> in a batch job there is just no stdin, hence it continues? Could be tested
> with "-i" to a batch job.
> >
> > -- Reuti
> >
> >
> > > qsub job are working fine.
> > >
> > > Any idea will be appreciated
> > >
> > > Cheers,
> > > Derrick
> > > _______________________________________________
> > > users mailing list
> > > users at gridengine.org
> > > https://gridengine.org/mailman/listinfo/users
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20190111/8f1b61ad/attachment.html>


More information about the users mailing list