[gridengine users] Issue with epilog script on SGE 8.1.8 - epilog returning bad exit_status and blocks queues

Yuri Burmachenko yuribu at mellanox.com
Wed Mar 11 12:43:19 UTC 2015


Modified the epilog script to contain single command:

echo $HOSTNAME > /local/tmp/sge_vars

it still fails with exit code 1.

Any ideas why does this strange behavior occur?
Thanks.


From: Yuri Burmachenko
Sent: Monday, March 09, 2015 3:55 PM
To: 'users at gridengine.org'
Cc: Dmitry Leibovich
Subject: RE: Issue with epilog script on SGE 8.1.8 - epilog returning bad exit_status and blocks queues

Example of SGE mail with epilog exit status 1:


Job 3473 caused action: none

User        = yurig

Queue       = all.q at mtlx346.yok.mtl.com<mailto:all.q at mtlx346.yok.mtl.com>

Start Time  = 03/09/2015 11:19:48

End Time    = 03/09/2015 15:16:40

failed in epilog: 03/09/2015 15:16:40 [1771:10086]: exit_status of epilog = 1 Shepherd trace:

03/09/2015 11:19:48 [1771:10086]: shepherd called with uid = 0, euid = 1771

03/09/2015 11:19:48 [1771:10086]: starting up 8.1.8

03/09/2015 11:19:48 [1771:10086]: setpgid(10086, 10086) returned 0

03/09/2015 11:19:48 [1771:10087]: child: starting son(prolog, /home/sgeadmin/bin/prolog.sh, 0, 10000);

03/09/2015 11:19:48 [1771:10087]: pid=10087 pgrp=10087 sid=10087 old pgrp=10086 getlogin()=<no login set>

03/09/2015 11:19:48 [1771:10087]: reading passwd information for user 'yurig'

03/09/2015 11:19:48 [1771:10087]: setting limits

03/09/2015 11:19:48 [1771:10087]: setting environment

03/09/2015 11:19:48 [1771:10087]: Initializing error file

03/09/2015 11:19:48 [1771:10087]: switching to intermediate/target user

03/09/2015 11:19:48 [1771:10087]: setting additional gid=0

03/09/2015 11:19:48 [1771:10086]: parent: forked "prolog" with pid 10087

03/09/2015 11:19:48 [1771:10086]: using signal delivery delay of 120 seconds

03/09/2015 11:19:48 [1771:10086]: parent: prolog-pid: 10087

03/09/2015 11:19:48 [1373:10087]: closing all filedescriptors

03/09/2015 11:19:48 [1373:10087]: further messages are in "error" and "trace"

03/09/2015 11:19:48 [1373:10087]: using "/usr/bin/tcsh" as shell of user "yurig"

03/09/2015 11:19:48 [1373:10087]: now running with uid=1373, euid=1373

03/09/2015 11:19:48 [1373:10087]: execvlp(/home/sgeadmin/bin/prolog.sh, "/home/sgeadmin/bin/prolog.sh")

03/09/2015 11:19:48 [1771:10086]: wait3 returned 10087 (status: 0; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)

03/09/2015 11:19:48 [1771:10086]: prolog exited with exit status 0

03/09/2015 11:19:48 [1771:10086]: reaped "prolog" with pid 10087

03/09/2015 11:19:48 [1771:10086]: prolog exited not due to signal

03/09/2015 11:19:48 [1771:10086]: prolog exited with status 0

03/09/2015 11:19:48 [1771:10086]: /bin/true

03/09/2015 11:19:48 [1771:10086]: /bin/true

03/09/2015 11:19:48 [1771:10086]: parent: forked "pe_start" with pid 10105

03/09/2015 11:19:48 [1771:10086]: using signal delivery delay of 120 seconds

03/09/2015 11:19:48 [1771:10086]: parent: pe_start-pid: 10105

03/09/2015 11:19:48 [1771:10105]: child: starting son(pe_start, /bin/true, 0, 10000);

03/09/2015 11:19:48 [1771:10105]: pid=10105 pgrp=10105 sid=10105 old pgrp=10086 getlogin()=<no login set>

03/09/2015 11:19:48 [1771:10105]: reading passwd information for user 'yurig'

03/09/2015 11:19:48 [1771:10105]: setting limits

03/09/2015 11:19:48 [1771:10105]: setting environment

03/09/2015 11:19:48 [1771:10105]: Initializing error file

03/09/2015 11:19:48 [1771:10105]: switching to intermediate/target user

03/09/2015 11:19:48 [1771:10105]: setting additional gid=0

03/09/2015 11:19:48 [1373:10105]: closing all filedescriptors

03/09/2015 11:19:48 [1373:10105]: further messages are in "error" and "trace"

03/09/2015 11:19:48 [1373:10105]: using "/usr/bin/tcsh" as shell of user "yurig"

03/09/2015 11:19:48 [1373:10105]: now running with uid=1373, euid=1373

03/09/2015 11:19:48 [1373:10105]: execvlp(/bin/true, "/bin/true")

03/09/2015 11:19:48 [1771:10086]: wait3 returned 10105 (status: 0; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)

03/09/2015 11:19:48 [1771:10086]: pe_start exited with exit status 0

03/09/2015 11:19:48 [1771:10086]: reaped "pe_start" with pid 10105

03/09/2015 11:19:48 [1771:10086]: pe_start exited not due to signal

03/09/2015 11:19:48 [1771:10086]: pe_start exited with status 0

03/09/2015 11:19:48 [1771:10086]: parent: forked "job" with pid 10106

03/09/2015 11:19:48 [1771:10106]: child: starting son(job, /local/sge_spool/mtlx346/job_scripts/3473, 0, 4096);

03/09/2015 11:19:48 [1771:10086]: parent: job-pid: 10106

03/09/2015 11:19:48 [1771:10106]: pid=10106 pgrp=10106 sid=10106 old pgrp=10086 getlogin()=<no login set>

03/09/2015 11:19:48 [1771:10106]: reading passwd information for user 'yurig'

03/09/2015 11:19:48 [1771:10106]: setosjobid: uid = 0, euid = 1771

03/09/2015 11:19:48 [1771:10106]: setting limits

03/09/2015 11:19:48 [1771:10106]: RLIMIT_CPU setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)

03/09/2015 11:19:48 [1771:10106]: RLIMIT_FSIZE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)

03/09/2015 11:19:48 [1771:10106]: RLIMIT_DATA setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)

03/09/2015 11:19:48 [1771:10106]: RLIMIT_STACK setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)

03/09/2015 11:19:48 [1771:10106]: RLIMIT_CORE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)

03/09/2015 11:19:48 [1771:10106]: RLIMIT_VMEM/RLIMIT_AS setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)

03/09/2015 11:19:48 [1771:10106]: RLIMIT_RSS setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)

03/09/2015 11:19:48 [1771:10106]: setting environment

03/09/2015 11:19:48 [1771:10106]: Initializing error file

03/09/2015 11:19:48 [1771:10106]: switching to intermediate/target user

03/09/2015 11:19:48 [1771:10106]: setting additional gid=5976

03/09/2015 11:19:48 [1373:10106]: closing all filedescriptors

03/09/2015 11:19:48 [1373:10106]: further messages are in "error" and "trace"

03/09/2015 11:19:48 [1373:10106]: now running with uid=1373, euid=1373

03/09/2015 11:19:48 [1373:10106]: execvlp(/bin/csh, "-csh" "/local/sge_spool/mtlx346/job_scripts/3473")

03/09/2015 15:16:40 [1771:10086]: wait3 returned 10106 (status: 0; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)

03/09/2015 15:16:40 [1771:10086]: job exited with exit status 0

03/09/2015 15:16:40 [1771:10086]: reaped "job" with pid 10106

03/09/2015 15:16:40 [1771:10086]: job exited not due to signal

03/09/2015 15:16:40 [1771:10086]: job exited with status 0

03/09/2015 15:16:40 [1771:10086]: now sending signal KILL to pid -10106

03/09/2015 15:16:40 [1771:10086]: pdc_kill_addgrpid: 5976 9

03/09/2015 15:16:40 [0:10086]: killing pid 11285/10

03/09/2015 15:16:40 [0:10086]: killing pid 27126/10

03/09/2015 15:16:40 [1771:10086]: writing usage file to "usage"

03/09/2015 15:16:40 [1771:10086]: /bin/true

03/09/2015 15:16:40 [1771:10086]: /bin/true

03/09/2015 15:16:40 [1771:5923]: child: starting son(pe_stop, /bin/true, 0, 10000);

03/09/2015 15:16:40 [1771:5923]: pid=5923 pgrp=5923 sid=5923 old pgrp=10086 getlogin()=<no login set>

03/09/2015 15:16:40 [1771:5923]: reading passwd information for user 'yurig'

03/09/2015 15:16:40 [1771:5923]: setting limits

03/09/2015 15:16:40 [1771:5923]: setting environment

03/09/2015 15:16:40 [1771:10086]: parent: forked "pe_stop" with pid 5923

03/09/2015 15:16:40 [1771:10086]: using signal delivery delay of 120 seconds

03/09/2015 15:16:40 [1771:10086]: parent: pe_stop-pid: 5923

03/09/2015 15:16:40 [1771:5923]: Initializing error file

03/09/2015 15:16:40 [1771:5923]: switching to intermediate/target user

03/09/2015 15:16:40 [1771:5923]: setting additional gid=0

03/09/2015 15:16:40 [1373:5923]: closing all filedescriptors

03/09/2015 15:16:40 [1373:5923]: further messages are in "error" and "trace"

03/09/2015 15:16:40 [1373:5923]: using "/usr/bin/tcsh" as shell of user "yurig"

03/09/2015 15:16:40 [1373:5923]: now running with uid=1373, euid=1373

03/09/2015 15:16:40 [1373:5923]: execvlp(/bin/true, "/bin/true")

03/09/2015 15:16:40 [1771:10086]: wait3 returned 5923 (status: 0; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)

03/09/2015 15:16:40 [1771:10086]: pe_stop exited with exit status 0

03/09/2015 15:16:40 [1771:10086]: reaped "pe_stop" with pid 5923

03/09/2015 15:16:40 [1771:10086]: pe_stop exited not due to signal

03/09/2015 15:16:40 [1771:10086]: pe_stop exited with status 0

03/09/2015 15:16:40 [1771:5924]: child: starting son(epilog, /home/sgeadmin/bin/epilogsosge.sh, 0, 10000);

03/09/2015 15:16:40 [1771:5924]: pid=5924 pgrp=5924 sid=5924 old pgrp=10086 getlogin()=<no login set>

03/09/2015 15:16:40 [1771:5924]: reading passwd information for user 'yurig'

03/09/2015 15:16:40 [1771:5924]: setting limits

03/09/2015 15:16:40 [1771:10086]: parent: forked "epilog" with pid 5924

03/09/2015 15:16:40 [1771:5924]: setting environment

03/09/2015 15:16:40 [1771:10086]: using signal delivery delay of 120 seconds

03/09/2015 15:16:40 [1771:10086]: parent: epilog-pid: 5924

03/09/2015 15:16:40 [1771:5924]: Initializing error file

03/09/2015 15:16:40 [1771:5924]: switching to intermediate/target user

03/09/2015 15:16:40 [1771:5924]: setting additional gid=0

03/09/2015 15:16:40 [1373:5924]: closing all filedescriptors

03/09/2015 15:16:40 [1373:5924]: further messages are in "error" and "trace"

03/09/2015 15:16:40 [1373:5924]: using "/usr/bin/tcsh" as shell of user "yurig"

03/09/2015 15:16:40 [1373:5924]: now running with uid=1373, euid=1373

03/09/2015 15:16:40 [1373:5924]: execvlp(/home/sgeadmin/bin/epilogsosge.sh, "/home/sgeadmin/bin/epilogsosge.sh")

03/09/2015 15:16:40 [1771:10086]: wait3 returned 5924 (status: 256; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 1)

03/09/2015 15:16:40 [1771:10086]: epilog exited with exit status 1

03/09/2015 15:16:40 [1771:10086]: reaped "epilog" with pid 5924

03/09/2015 15:16:40 [1771:10086]: epilog exited not due to signal

03/09/2015 15:16:40 [1771:10086]: epilog exited with status 1

03/09/2015 15:16:40 [1771:10086]: exit_status of epilog = 1



Shepherd error:

03/09/2015 15:16:40 [1771:10086]: exit_status of epilog = 1



Shepherd pe_hostfile:

mtlx346.yok.mtl.com 4 all.q at mtlx346.yok.mtl.com<mailto:all.q at mtlx346.yok.mtl.com> UNDEFINED


From: Yuri Burmachenko
Sent: Monday, March 09, 2015 2:18 PM
To: 'users at gridengine.org'
Cc: Dmitry Leibovich
Subject: Issue with epilog script on SGE 8.1.8 - epilog returning bad exit_status and blocks queues

Hallo to distinguished forum members,

I hope you can assist me.
We are in process of pilot for SGE 8.1.8.

All jobs which are submitted fail on epilog script with exit_status 1 or 2 which cause SGE queues to be put in error state.
As a debug measure we have modified the epilot script so it just echo different environment variables - it still exists with exit_status 1.

NOTE: We use both prolog and epilog scripts and both written in bash. We don't have any issues with prolog.

Any tips on how to resolve, will be greatly appreciated.
Thank You.


Yuri Burmachenko | Sr. Engineer | IT | Mellanox Technologies Ltd.
Work: +972 74 7236386 | Cell +972 54 7542188 |Fax: +972 4 959 3245
Follow us on Twitter<http://twitter.com/mellanoxtech> and Facebook<http://www.facebook.com/pages/Mellanox-Technologies/223164879116>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20150311/f71a89c5/attachment.html>


More information about the users mailing list