[gridengine users] can't open file job_pid: Permission denied

Reuti reuti at staff.uni-marburg.de
Thu Sep 29 18:50:04 UTC 2011


Hi,

Am 28.09.2011 um 15:41 schrieb Schmidt, Burkhard:

> I'm running SGE 6.2u5 on an Xserve cluster running Mac OS X Server  
> v10.6 Snow Leopard with Open Directory network accounts. All users  
> belong to the same default group staff.

the complete cluster is OS X, or only the master node or only the  
slaves?

There were issues in the past as a result for an account having too  
many additinal groups, but I'm not sure whether it applies here, as  
the error message was different.

http://gridengine.org/pipermail/users/2011-March/000447.html

Nevertheless: can you check the group count of the users in question?

-- Reuti


> I have a weird problem with some users who are unable to run qlogin  
> or submit a job. On the qmaster, I see
>
> 09/28/2011 15:05:08|worker|xserve01|W|job 295287.1 failed on host  
> xserve12.cpfs.mpg.de general before job because: 09/28/2011 15:05:08  
> [1319:18564]: can't open file job_pid: Permission denied
> 09/28/2011 15:05:08|worker|xserve01|E|queue late06.q marked QERROR  
> as result of job 295287's failure at host xserve12.cpfs.mpg.de
>
> and in the corresponding error message sent by mail, I see entries  
> like those attached at the end of this message.
>
> This happens to some users only. The only common property of the  
> failing accounts I can see at the moment is that these have been  
> created after the upgrade of the OD master from v10.5 Leopard to  
> v10.6 Snow Leopard.
>
> I'd be thankful for any hints where to search for the origin of this  
> problem.
>
> Best regards, Burkhard.
>
> Job 295287 caused action: Queue "late06.q at xserve12.cpfs.mpg.de" set  
> to ERROR
> User        = bschmidt4
> Queue       = late06.q at xserve12.cpfs.mpg.de
> Start Time  = <unknown>
> End Time    = <unknown>
> failed before job:09/28/2011 15:05:08 [1319:18564]: can't open file  
> job_pid: Permission denied
> Shepherd trace:
> 09/28/2011 15:05:06 [501:18562]: shepherd called with uid = 0, euid  
> = 501
> 09/28/2011 15:05:06 [501:18562]: qlogin_daemon = builtin
> 09/28/2011 15:05:06 [501:18562]: starting up 6.2u5
> 09/28/2011 15:05:06 [501:18562]: setpgid(18562, 18562) returned 0
> 09/28/2011 15:05:06 [501:18562]: no prolog script to start
> 09/28/2011 15:05:06 [501:18562]: pipe to child uses fds 4 and 5
> 09/28/2011 15:05:06 [501:18562]: calling fork_pty()
> 09/28/2011 15:05:06 [501:18562]: parent: forked "job" with pid 18564
> 09/28/2011 15:05:06 [501:18562]: parent: job-pid: 18564
> 09/28/2011 15:05:06 [501:18562]: parent: closing childs end of the  
> pipe
> 09/28/2011 15:05:06 [501:18562]: csp = 0
> 09/28/2011 15:05:06 [501:18562]: parent: starting parent loop with  
> remote_host =xserve01.cpfs.mpg.de, remote_port = 62902, job_owner =  
> bschmidt4, fd_pty_master = 6, fd_pipe_in = -1, fd_pipe_out = -1,  
> fd_pipe_err = -1, fd_pipe_to_child = 5
> 09/28/2011 15:05:06 [501:18562]: parent: opening connection to qrsh/ 
> qlogin client
> 09/28/2011 15:05:06 [501:18564]: child: closing parents end of the  
> pipe
> 09/28/2011 15:05:06 [501:18564]: child: trying to read from parent  
> through the pipe
> 09/28/2011 15:05:06 [501:18562]: parent: sending REGISTER_CTRL_MSG  
> to qrsh/qlogin client
> 09/28/2011 15:05:06 [501:18562]: parent: creating pty_to_commlib  
> thread
> 09/28/2011 15:05:06 [501:18562]: parent: creating commlib_to_pty  
> thread
> 09/28/2011 15:05:06 [501:18562]: parent: created both worker  
> threads, now waiting for jobs end
> 09/28/2011 15:05:06 [501:18562]: commlib_to_pty: received window  
> size message, changing window size
> 09/28/2011 15:05:06 [501:18562]: commlib_to_pty: received settings  
> message
> 09/28/2011 15:05:06 [501:18562]: commlib_to_pty: writing to child 11  
> bytes: noshell = 0
> 09/28/2011 15:05:06 [501:18564]: child: parent sent us 'noshell = 0'
> 09/28/2011 15:05:06 [501:18564]: child: starting son(job, QLOGIN, 0);
> 09/28/2011 15:05:06 [501:18564]: processing qlogin job
> 09/28/2011 15:05:06 [501:18564]: pid=18564 pgrp=18564 sid=18564 old  
> pgrp=18564 getlogin()=_atsserver
> 09/28/2011 15:05:06 [501:18564]: reading passwd information for user  
> 'bschmidt4'
> 09/28/2011 15:05:06 [501:18564]: setosjobid: uid = 0, euid = 501
> 09/28/2011 15:05:06 [501:18564]: setting limits
> 09/28/2011 15:05:06 [501:18564]: RLIMIT_CPU setting: (soft  
> 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard  
> 0INFINITY)
> 09/28/2011 15:05:06 [501:18564]: RLIMIT_FSIZE setting: (soft  
> 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard  
> 0INFINITY)
> 09/28/2011 15:05:06 [501:18564]: RLIMIT_DATA setting: (soft  
> 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard  
> 0INFINITY)
> 09/28/2011 15:05:06 [501:18564]: RLIMIT_STACK setting: (soft  
> 0INFINITY hard 0INFINITY) resulting: (soft 67104768 hard 67104768)
> 09/28/2011 15:05:06 [501:18564]: RLIMIT_CORE setting: (soft  
> 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard  
> 0INFINITY)
> 09/28/2011 15:05:06 [501:18564]: RLIMIT_RSS setting: (soft  
> 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard  
> 0INFINITY)
> 09/28/2011 15:05:06 [501:18564]: RLIMIT_RSS setting: (soft  
> 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard  
> 0INFINITY)
> 09/28/2011 15:05:06 [501:18564]: setting environment
> 09/28/2011 15:05:06 [501:18564]: Initializing error file
> 09/28/2011 15:05:06 [501:18564]: switching to intermediate/target user
> 09/28/2011 15:05:06 [1319:18564]: closing all filedescriptors
> 09/28/2011 15:05:06 [1319:18564]: further messages are in "error"  
> and "trace"
> 09/28/2011 15:05:08 [1319:18564]: now running with uid=1319, euid=1319
> 09/28/2011 15:05:08 [1319:18564]: execle(, -(null), NULL, env)
> 09/28/2011 15:05:08 [1319:18564]: parent: forked "job" with pid 0
> 09/28/2011 15:05:08 [1319:18564]: can't open file job_pid:  
> Permission denied
> 09/28/2011 15:05:08 [501:18562]: pty_to_commlib: our child seems to  
> have exited -> exiting
> 09/28/2011 15:05:08 [501:18562]: wait3 returned 18564 (status: 2816;  
> WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 11)
> 09/28/2011 15:05:08 [501:18562]: job exited with exit status 11
> 09/28/2011 15:05:08 [501:18562]: parent: wait_my_child returned  
> exit_status = 2816
> 09/28/2011 15:05:08 [501:18562]: parent:             
> rusage.ru_stime.tv_sec  = 0
> 09/28/2011 15:05:08 [501:18562]: parent:             
> rusage.ru_stime.tv_usec = 2910
> 09/28/2011 15:05:08 [501:18562]: parent:             
> rusage.ru_utime.tv_sec  = 0
> 09/28/2011 15:05:08 [501:18562]: parent:             
> rusage.ru_utime.tv_usec = 1344
> 09/28/2011 15:05:08 [501:18562]: parent: received event 1000,  
> g_raised_event = 2
> 09/28/2011 15:05:08 [501:18562]: parent: shutting down  
> pty_to_commlib thread
> 09/28/2011 15:05:08 [501:18562]: parent: shutting down  
> commlib_to_pty thread
> 09/28/2011 15:05:08 [501:18562]: parent: thread_cleanup_lib()
> 09/28/2011 15:05:08 [501:18562]: parent: leaving main loop. From  
> here on, only the main thread is running.
> 09/28/2011 15:05:08 [501:18562]: reaped "job" with pid 18564
> 09/28/2011 15:05:08 [501:18562]: job exited not due to signal
> 09/28/2011 15:05:08 [501:18562]: job exited with status 11
> 09/28/2011 15:05:08 [501:18562]: now sending signal KILL to pid -18564
> 09/28/2011 15:05:08 [501:18562]: no tasker to notify
> 09/28/2011 15:05:08 [501:18562]: failed starting job
> 09/28/2011 15:05:08 [501:18562]: no epilog script to start
> 09/28/2011 15:05:08 [501:18562]: writing exit status to qrsh: 0
> 09/28/2011 15:05:08 [501:18562]: sending UNREGISTER_CTRL_MSG with  
> exit_status = "0"
> 09/28/2011 15:05:08 [501:18562]: sending to host: xserve01.cpfs.mpg.de
> 09/28/2011 15:05:08 [501:18562]: waiting for  
> UNREGISTER_RESPONSE_CTRL_MSG
> 09/28/2011 15:05:08 [501:18562]: Received UNREGISTER_RESPONSE_CTRL_MSG
> 09/28/2011 15:05:08 [501:18562]: parent: cl_com_ignore_timeouts
> 09/28/2011 15:05:08 [501:18562]: parent: leaving  
> closinge_parent_loop()
>
> Shepherd error:
> 09/28/2011 15:05:08 [1319:18564]: can't open file job_pid:  
> Permission denied
>
> Shepherd pe_hostfile:
> xserve12.cpfs.mpg.de 1 late06.q at xserve12.cpfs.mpg.de UNDEFINED
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users





More information about the users mailing list