[gridengine users] can't open file job_pid: Permission denied

Schmidt, Burkhard bs at cpfs.mpg.de
Wed Sep 28 13:41:31 UTC 2011


Hello,

I'm running SGE 6.2u5 on an Xserve cluster running Mac OS X Server v10.6 Snow Leopard with Open Directory network accounts. All users belong to the same default group staff.

I have a weird problem with some users who are unable to run qlogin or submit a job. On the qmaster, I see

09/28/2011 15:05:08|worker|xserve01|W|job 295287.1 failed on host xserve12.cpfs.mpg.de general before job because: 09/28/2011 15:05:08 [1319:18564]: can't open file job_pid: Permission denied
09/28/2011 15:05:08|worker|xserve01|E|queue late06.q marked QERROR as result of job 295287's failure at host xserve12.cpfs.mpg.de

and in the corresponding error message sent by mail, I see entries like those attached at the end of this message.

This happens to some users only. The only common property of the failing accounts I can see at the moment is that these have been created after the upgrade of the OD master from v10.5 Leopard to v10.6 Snow Leopard.

I'd be thankful for any hints where to search for the origin of this problem.

Best regards, Burkhard.

Job 295287 caused action: Queue "late06.q at xserve12.cpfs.mpg.de" set to ERROR
User        = bschmidt4
Queue       = late06.q at xserve12.cpfs.mpg.de
Start Time  = <unknown>
End Time    = <unknown>
failed before job:09/28/2011 15:05:08 [1319:18564]: can't open file job_pid: Permission denied
Shepherd trace:
09/28/2011 15:05:06 [501:18562]: shepherd called with uid = 0, euid = 501
09/28/2011 15:05:06 [501:18562]: qlogin_daemon = builtin
09/28/2011 15:05:06 [501:18562]: starting up 6.2u5
09/28/2011 15:05:06 [501:18562]: setpgid(18562, 18562) returned 0
09/28/2011 15:05:06 [501:18562]: no prolog script to start
09/28/2011 15:05:06 [501:18562]: pipe to child uses fds 4 and 5
09/28/2011 15:05:06 [501:18562]: calling fork_pty()
09/28/2011 15:05:06 [501:18562]: parent: forked "job" with pid 18564
09/28/2011 15:05:06 [501:18562]: parent: job-pid: 18564
09/28/2011 15:05:06 [501:18562]: parent: closing childs end of the pipe
09/28/2011 15:05:06 [501:18562]: csp = 0
09/28/2011 15:05:06 [501:18562]: parent: starting parent loop with remote_host =xserve01.cpfs.mpg.de, remote_port = 62902, job_owner = bschmidt4, fd_pty_master = 6, fd_pipe_in = -1, fd_pipe_out = -1, fd_pipe_err = -1, fd_pipe_to_child = 5
09/28/2011 15:05:06 [501:18562]: parent: opening connection to qrsh/qlogin client
09/28/2011 15:05:06 [501:18564]: child: closing parents end of the pipe
09/28/2011 15:05:06 [501:18564]: child: trying to read from parent through the pipe
09/28/2011 15:05:06 [501:18562]: parent: sending REGISTER_CTRL_MSG to qrsh/qlogin client
09/28/2011 15:05:06 [501:18562]: parent: creating pty_to_commlib thread
09/28/2011 15:05:06 [501:18562]: parent: creating commlib_to_pty thread
09/28/2011 15:05:06 [501:18562]: parent: created both worker threads, now waiting for jobs end
09/28/2011 15:05:06 [501:18562]: commlib_to_pty: received window size message, changing window size
09/28/2011 15:05:06 [501:18562]: commlib_to_pty: received settings message
09/28/2011 15:05:06 [501:18562]: commlib_to_pty: writing to child 11 bytes: noshell = 0
09/28/2011 15:05:06 [501:18564]: child: parent sent us 'noshell = 0'
09/28/2011 15:05:06 [501:18564]: child: starting son(job, QLOGIN, 0);
09/28/2011 15:05:06 [501:18564]: processing qlogin job
09/28/2011 15:05:06 [501:18564]: pid=18564 pgrp=18564 sid=18564 old pgrp=18564 getlogin()=_atsserver
09/28/2011 15:05:06 [501:18564]: reading passwd information for user 'bschmidt4'
09/28/2011 15:05:06 [501:18564]: setosjobid: uid = 0, euid = 501
09/28/2011 15:05:06 [501:18564]: setting limits
09/28/2011 15:05:06 [501:18564]: RLIMIT_CPU setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY)
09/28/2011 15:05:06 [501:18564]: RLIMIT_FSIZE setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY)
09/28/2011 15:05:06 [501:18564]: RLIMIT_DATA setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY)
09/28/2011 15:05:06 [501:18564]: RLIMIT_STACK setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 67104768 hard 67104768)
09/28/2011 15:05:06 [501:18564]: RLIMIT_CORE setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY)
09/28/2011 15:05:06 [501:18564]: RLIMIT_RSS setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY)
09/28/2011 15:05:06 [501:18564]: RLIMIT_RSS setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY)
09/28/2011 15:05:06 [501:18564]: setting environment
09/28/2011 15:05:06 [501:18564]: Initializing error file
09/28/2011 15:05:06 [501:18564]: switching to intermediate/target user
09/28/2011 15:05:06 [1319:18564]: closing all filedescriptors
09/28/2011 15:05:06 [1319:18564]: further messages are in "error" and "trace"
09/28/2011 15:05:08 [1319:18564]: now running with uid=1319, euid=1319
09/28/2011 15:05:08 [1319:18564]: execle(, -(null), NULL, env)
09/28/2011 15:05:08 [1319:18564]: parent: forked "job" with pid 0
09/28/2011 15:05:08 [1319:18564]: can't open file job_pid: Permission denied
09/28/2011 15:05:08 [501:18562]: pty_to_commlib: our child seems to have exited -> exiting
09/28/2011 15:05:08 [501:18562]: wait3 returned 18564 (status: 2816; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 11)
09/28/2011 15:05:08 [501:18562]: job exited with exit status 11
09/28/2011 15:05:08 [501:18562]: parent: wait_my_child returned exit_status = 2816
09/28/2011 15:05:08 [501:18562]: parent:            rusage.ru_stime.tv_sec  = 0
09/28/2011 15:05:08 [501:18562]: parent:            rusage.ru_stime.tv_usec = 2910
09/28/2011 15:05:08 [501:18562]: parent:            rusage.ru_utime.tv_sec  = 0
09/28/2011 15:05:08 [501:18562]: parent:            rusage.ru_utime.tv_usec = 1344
09/28/2011 15:05:08 [501:18562]: parent: received event 1000, g_raised_event = 2
09/28/2011 15:05:08 [501:18562]: parent: shutting down pty_to_commlib thread
09/28/2011 15:05:08 [501:18562]: parent: shutting down commlib_to_pty thread
09/28/2011 15:05:08 [501:18562]: parent: thread_cleanup_lib()
09/28/2011 15:05:08 [501:18562]: parent: leaving main loop. From here on, only the main thread is running.
09/28/2011 15:05:08 [501:18562]: reaped "job" with pid 18564
09/28/2011 15:05:08 [501:18562]: job exited not due to signal
09/28/2011 15:05:08 [501:18562]: job exited with status 11
09/28/2011 15:05:08 [501:18562]: now sending signal KILL to pid -18564
09/28/2011 15:05:08 [501:18562]: no tasker to notify
09/28/2011 15:05:08 [501:18562]: failed starting job
09/28/2011 15:05:08 [501:18562]: no epilog script to start
09/28/2011 15:05:08 [501:18562]: writing exit status to qrsh: 0
09/28/2011 15:05:08 [501:18562]: sending UNREGISTER_CTRL_MSG with exit_status = "0"
09/28/2011 15:05:08 [501:18562]: sending to host: xserve01.cpfs.mpg.de
09/28/2011 15:05:08 [501:18562]: waiting for UNREGISTER_RESPONSE_CTRL_MSG
09/28/2011 15:05:08 [501:18562]: Received UNREGISTER_RESPONSE_CTRL_MSG
09/28/2011 15:05:08 [501:18562]: parent: cl_com_ignore_timeouts
09/28/2011 15:05:08 [501:18562]: parent: leaving closinge_parent_loop()

Shepherd error:
09/28/2011 15:05:08 [1319:18564]: can't open file job_pid: Permission denied

Shepherd pe_hostfile:
xserve12.cpfs.mpg.de 1 late06.q at xserve12.cpfs.mpg.de UNDEFINED
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4419 bytes
Desc: not available
URL: <http://gridengine.org/pipermail/users/attachments/20110928/d8c5239f/attachment.bin>


More information about the users mailing list