[gridengine users] qrsh session failed to execute prolog script?

Derrick Lin klin938 at gmail.com
Wed Jan 9 22:48:52 UTC 2019


I compared two trace files closely I found there is one line appears in
qrsh job but not in qsub job is:

01/10/2019 09:12:07 [997:307578]: Poll received POLLHUP (Hang up).
Unregister the FD.

Does this mean anything important?

We also have another SGE cluster running older version of SGE (and based on
CentOS 6), the same prolog script runs fine for qrsh, and its trace file
looks like:

So in the older SGE cluster, qrsh trace file does not have that POLLHUP
line either.

[luffy at omega-6-16 ~]$ cat
/opt/gridengine/default/spool/omega-6-16/active_jobs/1430156.1/trace
01/10/2019 09:40:09 [400:58636]: shepherd called with uid = 0, euid = 400
01/10/2019 09:40:09 [400:58636]: rlogin_daemon = builtin
01/10/2019 09:40:09 [400:58636]: starting up 2011.11p1
01/10/2019 09:40:09 [400:58636]: setpgid(58636, 58636) returned 0
01/10/2019 09:40:09 [400:58636]: do_core_binding: "binding" parameter not
found in config file
01/10/2019 09:40:09 [400:58636]: parent: forked "prolog" with pid 58637
01/10/2019 09:40:09 [400:58636]: using signal delivery delay of 120 seconds
01/10/2019 09:40:09 [400:58636]: parent: prolog-pid: 58637
01/10/2019 09:40:09 [400:58637]: child: starting son(prolog,
root@/opt/gridengine/default/common/prolog_exec.sh,
0);
01/10/2019 09:40:09 [400:58637]: pid=58637 pgrp=58637 sid=58637 old
pgrp=58636 getlogin()=<no login set>
01/10/2019 09:40:09 [400:58637]: reading passwd information for user 'root'
01/10/2019 09:40:09 [400:58637]: setting limits
01/10/2019 09:40:09 [400:58637]: setting environment
01/10/2019 09:40:09 [400:58637]: Initializing error file
01/10/2019 09:40:09 [400:58637]: switching to intermediate/target user
01/10/2019 09:40:09 [500:58637]: closing all filedescriptors
01/10/2019 09:40:09 [500:58637]: further messages are in "error" and "trace"
01/10/2019 09:40:09 [500:58637]: using "/bin/bash" as shell of user "root"
01/10/2019 09:40:09 [0:58637]: now running with uid=0, euid=0
01/10/2019 09:40:09 [0:58637]:
execvp(/opt/gridengine/default/common/prolog_exec.sh,
"/opt/gridengine/default/common/prolog_exec.sh")
01/10/2019 09:40:09 [400:58636]: wait3 returned 58637 (status: 0;
WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
01/10/2019 09:40:09 [400:58636]: prolog exited with exit status 0
01/10/2019 09:40:09 [400:58636]: reaped "prolog" with pid 58637
01/10/2019 09:40:09 [400:58636]: prolog exited not due to signal
01/10/2019 09:40:09 [400:58636]: prolog exited with status 0
01/10/2019 09:40:09 [400:58636]: pipe to child uses fds 4 and 5
01/10/2019 09:40:09 [400:58636]: calling fork_pty()
01/10/2019 09:40:09 [400:58636]: parent: forked "job" with pid 58672
01/10/2019 09:40:09 [400:58636]: parent: job-pid: 58672
01/10/2019 09:40:09 [400:58636]: parent: closing childs end of the pipe
01/10/2019 09:40:09 [400:58636]: csp = 0
01/10/2019 09:40:09 [400:58636]: parent: starting parent loop with
remote_host = dice01.local, remote_port = 44009, job_owner = luffy,
fd_pty_master = 6, fd_pipe_in = -1, fd_pipe_out = -1, fd_pipe_err = -1,
fd_pipe_to_child = 5
### more lines omitted

On Thu, Jan 10, 2019 at 9:35 AM Derrick Lin <klin938 at gmail.com> wrote:

> Hi Reuti,
>
> I have to say I am still not familiar with the "-i" in qsub after reading
> the man page, what does it do?
>
> There is no useful/interesting output in qmaster message or exec node
> message log. The only information I could find is from job's trace file:
>
> [root at zeta-4-12 381.1]# ls
> config  environment  error  exit_status  pe_hostfile  pid  trace
> [root at zeta-4-12 381.1]# cat trace
> 01/10/2019 09:12:07 [997:307578]: shepherd called with uid = 0, euid = 997
> 01/10/2019 09:12:07 [997:307578]: qlogin_daemon = builtin
> 01/10/2019 09:12:07 [997:307578]: starting up 8.1.9
> 01/10/2019 09:12:07 [997:307578]: setpgid(307578, 307578) returned 0
> 01/10/2019 09:12:07 [997:307578]: do_core_binding: "binding" parameter not
> found in config file
> 01/10/2019 09:12:07 [997:307578]: calling fork_pty()
> 01/10/2019 09:12:07 [997:307578]: parent: forked "prolog" with pid 307579
> 01/10/2019 09:12:07 [997:307578]: using signal delivery delay of 120
> seconds
> 01/10/2019 09:12:07 [997:307578]: parent: prolog-pid: 307579
> 01/10/2019 09:12:07 [997:307579]: child: starting son(prolog, root@/opt/gridengine/default/common/prolog_exec.sh,
> 0, 10000);
> 01/10/2019 09:12:07 [997:307579]: pid=307579 pgrp=307579 sid=307579 old
> pgrp=307579 getlogin()=<no login set>
> 01/10/2019 09:12:07 [997:307579]: reading passwd information for user
> 'root'
> 01/10/2019 09:12:07 [997:307579]: setting limits
> 01/10/2019 09:12:07 [997:307579]: setting environment
> 01/10/2019 09:12:07 [997:307579]: Initializing error file
> 01/10/2019 09:12:07 [997:307579]: switching to intermediate/target user
> 01/10/2019 09:12:07 [997:307579]: setting additional gid=0
> 01/10/2019 09:12:07 [6782:307579]: closing all filedescriptors
> 01/10/2019 09:12:07 [6782:307579]: further messages are in "error" and
> "trace"
> 01/10/2019 09:12:07 [997:307578]: Poll received POLLHUP (Hang up).
> Unregister the FD.
> 01/10/2019 09:12:07 [6782:307579]: using "/bin/bash" as shell of user
> "root"
> 01/10/2019 09:12:07 [0:307579]: now running with uid=0, euid=0
> 01/10/2019 09:12:07 [0:307579]:
> execvlp(/opt/gridengine/default/common/prolog_exec.sh,
> "/opt/gridengine/default/common/prolog_exec.sh")
> ### The process just stuck in the line above
>
> Here is the trace file for a qsub/batch job, apparently the prolog script
> got executed and the process proceeded:
>
> [root at zeta-4-12 383.1]# ls
> addgrpid  config  environment  error  exit_status  job_pid  pe_hostfile
> pid  trace
> [root at zeta-4-12 383.1]# cat trace
> 01/10/2019 09:20:22 [997:315329]: shepherd called with uid = 0, euid = 997
> 01/10/2019 09:20:22 [997:315329]: starting up 8.1.9
> 01/10/2019 09:20:22 [997:315329]: setpgid(315329, 315329) returned 0
> 01/10/2019 09:20:22 [997:315329]: do_core_binding: "binding" parameter not
> found in config file
> 01/10/2019 09:20:22 [997:315329]: parent: forked "prolog" with pid 315330
> 01/10/2019 09:20:22 [997:315329]: using signal delivery delay of 120
> seconds
> 01/10/2019 09:20:22 [997:315329]: parent: prolog-pid: 315330
> 01/10/2019 09:20:22 [997:315330]: child: starting son(prolog, root@/opt/gridengine/default/common/prolog_exec.sh,
> 0, 10000);
> 01/10/2019 09:20:22 [997:315330]: pid=315330 pgrp=315330 sid=315330 old
> pgrp=315329 getlogin()=<no login set>
> 01/10/2019 09:20:22 [997:315330]: reading passwd information for user
> 'root'
> 01/10/2019 09:20:22 [997:315330]: setting limits
> 01/10/2019 09:20:22 [997:315330]: setting environment
> 01/10/2019 09:20:22 [997:315330]: Initializing error file
> 01/10/2019 09:20:22 [997:315330]: switching to intermediate/target user
> 01/10/2019 09:20:22 [997:315330]: setting additional gid=0
> 01/10/2019 09:20:22 [6782:315330]: closing all filedescriptors
> 01/10/2019 09:20:22 [6782:315330]: further messages are in "error" and
> "trace"
> 01/10/2019 09:20:22 [6782:315330]: using "/bin/bash" as shell of user
> "root"
> 01/10/2019 09:20:22 [6782:315330]: using stdout as stderr
> 01/10/2019 09:20:22 [0:315330]: now running with uid=0, euid=0
> 01/10/2019 09:20:22 [0:315330]:
> execvlp(/opt/gridengine/default/common/prolog_exec.sh,
> "/opt/gridengine/default/common/prolog_exec.sh")
> 01/10/2019 09:20:22 [997:315329]: wait3 returned 315330 (status: 0;
> WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
> 01/10/2019 09:20:22 [997:315329]: prolog exited with exit status 0
> 01/10/2019 09:20:22 [997:315329]: reaped "prolog" with pid 315330
> 01/10/2019 09:20:22 [997:315329]: prolog exited not due to signal
> 01/10/2019 09:20:22 [997:315329]: prolog exited with status 0
> 01/10/2019 09:20:22 [997:315329]: parent: forked "job" with pid 315345
> 01/10/2019 09:20:22 [997:315329]: parent: job-pid: 315345
> 01/10/2019 09:20:22 [997:315345]: child: starting son(job, sleep, 0, 4096);
> 01/10/2019 09:20:22 [997:315345]: pid=315345 pgrp=315345 sid=315345 old
> pgrp=315329 getlogin()=<no login set>
> 01/10/2019 09:20:22 [997:315345]: reading passwd information for user
> 'derlin'
> 01/10/2019 09:20:22 [997:315345]: setosjobid: uid = 0, euid = 997
> 01/10/2019 09:20:22 [997:315345]: setting limits
> 01/10/2019 09:20:22 [997:315345]: RLIMIT_CPU setting: (soft INFINITY hard
> INFINITY) resulting: (soft INFINITY hard INFINITY)
> 01/10/2019 09:20:22 [997:315345]: RLIMIT_FSIZE setting: (soft INFINITY
> hard INFINITY) resulting: (soft INFINITY hard INFINITY)
> 01/10/2019 09:20:22 [997:315345]: RLIMIT_DATA setting: (soft INFINITY hard
> INFINITY) resulting: (soft INFINITY hard INFINITY)
> 01/10/2019 09:20:22 [997:315345]: RLIMIT_STACK setting: (soft INFINITY
> hard INFINITY) resulting: (soft INFINITY hard INFINITY)
> 01/10/2019 09:20:22 [997:315345]: RLIMIT_CORE setting: (soft INFINITY hard
> INFINITY) resulting: (soft INFINITY hard INFINITY)
> 01/10/2019 09:20:22 [997:315345]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
> 8257536000 hard 8257536000) resulting: (soft 8257536000 hard 8257536000)
> 01/10/2019 09:20:22 [997:315345]: RLIMIT_RSS setting: (soft INFINITY hard
> INFINITY) resulting: (soft INFINITY hard INFINITY)
> 01/10/2019 09:20:22 [997:315345]: setting environment
> 01/10/2019 09:20:22 [997:315345]: Initializing error file
> 01/10/2019 09:20:22 [997:315345]: switching to intermediate/target user
> 01/10/2019 09:20:22 [997:315345]: setting additional gid=20011
> 01/10/2019 09:20:22 [6782:315345]: closing all filedescriptors
> 01/10/2019 09:20:22 [6782:315345]: further messages are in "error" and
> "trace"
> 01/10/2019 09:20:22 [6782:315345]: using stdout as stderr
> 01/10/2019 09:20:22 [6782:315345]: now running with uid=6782, euid=6782
> 01/10/2019 09:20:22 [6782:315345]: execvlp(/bin/csh, "-csh" "-c" "sleep
> 10m ")
>
> I will attach my prolog script in the next post.
>
> Cheers
> Derrick
>
>
> On Wed, Jan 9, 2019 at 7:36 PM Reuti <reuti at staff.uni-marburg.de> wrote:
>
>> Hi,
>>
>> > Am 09.01.2019 um 01:14 schrieb Derrick Lin <klin938 at gmail.com>:
>> >
>> > Hi guys,
>> >
>> > I just brought up a new SGE cluster, but somehow the qrsh session does
>> not work:
>> >
>> > tester at login-gpu:~$ qrsh
>> > ^Cerror: error while waiting for builtin IJS connection: "got select
>> timeout"
>> >
>> > after I hit entered, the session just stuck there forever instead of
>> bring me to a compute node. I have to entered Crtl+c to terminate and it
>> gave the above error.
>> >
>> > I noticed, the SGE did send my qrsh request to a compute node as I
>> could tell from qstat:
>> >
>> >
>> ---------------------------------------------------------------------------------
>> > short.q at zeta-4-15.local        BIP   0/1/80         0.01     lx-amd64
>> >      15 0.55500 QRLOGIN    tester       r    01/09/2019 10:47:13     1
>> >
>> > We have a prolog script configured globally, the script deals with
>> local disk quota and keep all output to a log file for each job. So I went
>> to that compute node, and check, found that a log file was created but it
>> was empty.
>> >
>> > So my thinking so far is, my qrsh stuck because the prolog script is
>> not fully executed.
>>
>> Is there any statement in the prolog, which could wait for stdin – and in
>> a batch job there is just no stdin, hence it continues? Could be tested
>> with "-i" to a batch job.
>>
>> -- Reuti
>>
>>
>> > qsub job are working fine.
>> >
>> > Any idea will be appreciated
>> >
>> > Cheers,
>> > Derrick
>> > _______________________________________________
>> > users mailing list
>> > users at gridengine.org
>> > https://gridengine.org/mailman/listinfo/users
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20190110/99d477c2/attachment.html>


More information about the users mailing list