[gridengine users] grid jobs not visible with qstat output

Reuti reuti at staff.uni-marburg.de
Wed May 13 11:41:30 UTC 2015


> Am 13.05.2015 um 13:36 schrieb <sudha.penmetsa at wipro.com> <sudha.penmetsa at wipro.com>:
> 
> Hi Reuti,
> 
> In qconf -sconf we have the configuration as follows
> execd_params                 enable_windomacc=true
> 
> Can you please confirm if we can add as below or should it be defined in a different way
> 
> execd_params                 enable_windomacc=true ENABLE_ADDGRP_KILL=TRUE

It's correct. - Reuti


> 
> Regards,
> Sudha
> 
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Wednesday, May 13, 2015 4:17 PM
> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom)
> Cc: users at gridengine.org
> Subject: Re: [gridengine users] grid jobs not visible with qstat output
> 
> 
>> Am 13.05.2015 um 12:35 schrieb <sudha.penmetsa at wipro.com> <sudha.penmetsa at wipro.com>:
>> 
>> Hi Reuti,
>> 
>> I did some testing again and now the process is killed after deleting the job using qdel job_id.  Please find the test results.
>> 
>> After starting the job, the process started on the execution host
>> 
>> qstat -j 8150628
>> =================================================
>> job_number:                 8150628
>> exec_file:                  job_scripts/8150628
>> submission_time:            Wed May 13 13:00:08 2015
>> owner:                      spenmets
>> uid:                        78566
>> group:                      newgrp1
>> gid:                        1018
>> 
>> =================================================
>> [spenmets at node2 homes/users/spenmets]$ps -au spenmets
>> PID TTY          TIME CMD
>> 10837 pts/12   00:00:00 qrsh_starter
>> 10911 pts/12   00:00:00 xterm
> 
> As long as the process will stay attached to the `qrsh_starter`, it will be killed too as SGE will kill the complete process group. The problem arises, when a process jumps out of the process tree and must be detected by the additional group ID. Then also "execd_params ENABLE_ADDGRP_KILL=TRUE" in SGE's configuration must be set to allow this facility to jump in.
> 
> -- Reuti
> 
> 
>> =================================================
>> 
>> [spenmets at node2 proc/10837]$cat status
>> Name:   qrsh_starter
>> Gid:    1018    1018    1018    1018
>> Utrace: 0
>> FDSize: 64
>> Groups: 1000 1018 1025 1030 27000 27001 27007 27010 27014 27017 27025
>> ================================================
>> 
>> gridnode @ /xxxxx/xxxxx/xxxxx : qdel 8150628 registered the job
>> 8150628 for deletion gridnode @ /xxxxx/xxxxx/xxxxx : qstat -j 8150628
>> Following jobs do not exist:
>> 8150628
>> 
>> ===============================================
>> 
>> [spenmets at node2 homes/users/spenmets]$ps 10837
>> PID TTY      STAT   TIME COMMAND
>> [spenmets at node2 homes/users/spenmets]$cd /proc/10837
>> -bash: cd: /proc/10837: No such file or directory
>> 
>> Does it mean not an issue with tight integration of SSH into SGE.
>> 
>> Regards,
>> Sudha
>> 
>> -----Original Message-----
>> From: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom)
>> Sent: Wednesday, May 13, 2015 1:15 PM
>> To: 'Reuti'
>> Cc: users at gridengine.org
>> Subject: RE: [gridengine users] grid jobs not visible with qstat
>> output
>> 
>> Hi Reuti,
>> 
>> The value in /opt/sge/default/spool/active_jobs/8143543.1/addgrpid is
>> not there in /proc/
>> 
>> But the the child processes of the job are available in /proc/.
>> 
>> Can you please suggest a solution.
>> 
>> Regards,
>> Sudha
>> 
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Tuesday, May 12, 2015 8:53 PM
>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom)
>> Cc: prod.feng at gmail.com; users at gridengine.org
>> Subject: Re: [gridengine users] grid jobs not visible with qstat
>> output
>> 
>> 
>>> Am 12.05.2015 um 17:03 schrieb <sudha.penmetsa at wipro.com> <sudha.penmetsa at wipro.com>:
>>> 
>>> Hi Reuti,
>>> 
>>> In the link suggested by you
>>> (https://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html ) it
>>> is mentioned as below
>>> 
>>> "To  have a tight integration of SSH into SGE, the started sshd needs an additional group ID to be attached."
>>> 
>>> Checked the configuration from our side and the addgrpid is generated
>>> 
>>> /opt/sge/default/spool/active_jobs/8143543.1 : ls addgrpid
>> 
>> Yes, but not attached to all processes. Processes running in a tight integration needs them attached like something in /proc:
>> 
>> reuti at node:/proc/24989> cat status
>> ...
>> Groups: 20082 24000 25000
>> 
>> And the 20082 is the additional one.
>> 
>> -- Reuti
>> 
>> 
>>> 
>>> Regards,
>>> Sudha
>>> 
>>> -----Original Message-----
>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>> Sent: Monday, May 11, 2015 2:08 AM
>>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom)
>>> Cc: prod.feng at gmail.com; users at gridengine.org
>>> Subject: Re: [gridengine users] grid jobs not visible with qstat
>>> output
>>> 
>>> 
>>> Am 10.05.2015 um 19:30 schrieb <sudha.penmetsa at wipro.com> <sudha.penmetsa at wipro.com>:
>>> 
>>>> Hi Reuti,
>>>> 
>>>> The startup mechanism is as below
>>>> 
>>>> qlogin_daemon                /usr/sbin/sshd -i
>>>> qlogin_command               /gridapl1/HWEE_ge6/new/qssh
>>> 
>>> Then it's most likely that the `ssh` is not tightly integrated into SGE. Please have a look at:
>>> 
>>> https://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html
>>> 
>>> section "SSH TIGHT INTEGRATION".
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> Regards,
>>>> Sudha
>>>> 
>>>> -----Original Message-----
>>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>>> Sent: Friday, May 08, 2015 10:50 PM
>>>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom)
>>>> Cc: prod.feng at gmail.com; users at gridengine.org
>>>> Subject: Re: [gridengine users] grid jobs not visible with qstat
>>>> output
>>>> 
>>>> 
>>>>> Am 08.05.2015 um 16:57 schrieb sudha.penmetsa at wipro.com:
>>>>> 
>>>>> Hi Zhang,
>>>>> 
>>>>> Please find the o/p
>>>>> 
>>>>> 32682 61457200 27020 karppa 32682
>>>>> /applic36/grid/HWEE_ge6/utilbin/lx24-amd64/qrsh_starter
>>>>> /gridapl1/HWEE_ge6/default/spo
>>>>> 32734 61457200 27020 karppa 32734  \_ /bin/ksh ./run_it_file.vcs
>>>>> 33043 61457200 27020 karppa 32734      \_ /bin/ksh ./vcs.start.dh.no_gui
>>>>> 33059 61457200 27020 karppa 32734          \_ ./vcs/tb_bin/hdl_top_rtldhsim/simv -licqueue -cm line+cond+fsm+branch+tgl+
>>>>> 38048 61457200 27020 karppa 32734              \_ [target.bin] <defunct>
>>>>> 5049 61457200 27020 karppa 5049
>>>>> /applic36/grid/HWEE_ge6/utilbin/lx24-amd64/qrsh_starter
>>>>> /gridapl1/HWEE_ge6/default/spoo
>>>>> 5101 61457200 27020 karppa 5101  \_ /bin/ksh ./run_it_file.vcs
>>>>> 5408 61457200 27020 karppa 5101      \_ /bin/ksh ./vcs.start.dh.no_gui
>>>>> 5424 61457200 27020 karppa 5101          \_ ./vcs/tb_bin/hdl_top_rtldhsim/simv -licqueue -cm line+cond+fsm+branch+tgl+a
>>>>> 9089 61457200 27020 karppa 5101              \_ [target.bin] <defunct>
>>>> 
>>>> The problem seems to be, that the `qrsh`starter` is no longer bound to the "sge_shephered". This was after the job? How does it look like while SGE still knows about the job. What is the startup mechanism:
>>>> 
>>>> $ qconf -sconf
>>>> ...
>>>> qlogin_command               builtin
>>>> qlogin_daemon                builtin
>>>> rlogin_command               builtin
>>>> rlogin_daemon                builtin
>>>> rsh_command                  builtin
>>>> rsh_daemon                   builtin
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> Regards,
>>>>> Sudha
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Feng Zhang [mailto:prod.feng at gmail.com]
>>>>> Sent: Friday, May 08, 2015 7:35 PM
>>>>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom)
>>>>> Subject: Re: [gridengine users] grid jobs not visible with qstat
>>>>> output
>>>>> 
>>>>> Sudha,
>>>>> 
>>>>> Can you run "ps -e f -o pid,ppid,command", which can show more details?
>>>>> 
>>>>> On Fri, May 8, 2015 at 4:09 AM,  <sudha.penmetsa at wipro.com> wrote:
>>>>>> Hi Reuti,
>>>>>> 
>>>>>> The processes are not bound to sge_shepherd anymore.
>>>>>> 
>>>>>> Below are the qrsh_starter processes running still
>>>>>> 
>>>>>> 5049 ?        00:00:00 qrsh_starter
>>>>>> 5101 ?        00:00:00 run_it_file.vcs
>>>>>> 5408 ?        00:00:00 vcs.start.dh.no
>>>>>> 5424 ?        8-20:57:02 simv
>>>>>> 9089 ?        00:00:00 target.bin <defunct>
>>>>>> 16868 ?        00:00:00 sshd
>>>>>> 16913 pts/9    00:00:00 bash
>>>>>> 17371 pts/9    00:00:00 ps
>>>>>> 32682 ?        00:00:00 qrsh_starter
>>>>>> 32734 ?        00:00:00 run_it_file.vcs
>>>>>> 33043 ?        00:00:00 vcs.start.dh.no
>>>>>> 33059 ?        8-21:19:03 simv
>>>>>> 38048 ?        00:00:00 target.bin <defunct>
>>>>>> 
>>>>>> Regards,
>>>>>> Sudha
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>>>>> Sent: Thursday, May 07, 2015 9:52 PM
>>>>>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom)
>>>>>> Cc: rangam at gmail.com; users at gridengine.org
>>>>>> Subject: Re: [gridengine users] grid jobs not visible with qstat
>>>>>> output
>>>>>> 
>>>>>> Are the processes still bound to the sge_shephered or did they jump out of the process tree? By what method were they started by qrsh_starter: "builtin" or by defining `ssh`?
>>>>>> 
>>>>>> -- Reuti
>>>>>> 
>>>>>> 
>>>>>>> Am 07.05.2015 um 18:00 schrieb <sudha.penmetsa at wipro.com> <sudha.penmetsa at wipro.com>:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> No the slots are not being used anymore
>>>>>>> 
>>>>>>> That according to qstat I seem not to have any jobs at host. However, there are my processes running in that specific host (launched by qrsh_starter) that are altogether consuming 200% of CPU and licenses. The problem here is that the processes have been running there over a week and I haven't been aware of those. I've thought that the processes were killed when the job was killed with qdel.
>>>>>>> 
>>>>>>> What could be the reason for this.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Sudha
>>>>>>> 
>>>>>>> From: Srirangam Addepalli [mailto:rangam at gmail.com]
>>>>>>> Sent: Wednesday, May 06, 2015 7:52 PM
>>>>>>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom)
>>>>>>> Subject: Re: [gridengine users] grid jobs not visible with qstat
>>>>>>> output
>>>>>>> 
>>>>>>> That would be strange.  Do the slots on the host show as being used.
>>>>>>> 
>>>>>>> qhost -j -h hostname should list the jobs that Grid Engine is aware of. Unless qrsh some how spwanned a process that is not bound by sge_execd. On the client/ execution host  what info do you have in active_jobs and jobs directories.  It is more likely that the qrsh session is terminated but left resident processes.
>>>>>>> 
>>>>>>> Rangam
>>>>>>> 
>>>>>>> On Wed, May 6, 2015 at 9:05 AM, <sudha.penmetsa at wipro.com> wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I noticed that I've had two grid jobs running over a week on a machine of which I haven't been aware of. Both of the jobs have been launched with qrsh but they are not visible with qstat thus for a reason or another they are no longer included in grid book-keeping. This issue will cause that grid resources are wasted for ghost jobs as for example both of my jobs seem to consume 100% CPU on the host.
>>>>>>> 
>>>>>>> Can anyone please explain on this.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Sudha
>>>>>>> 
>>>>>>> The information contained in this electronic message and any
>>>>>>> attachments to this message are intended for the exclusive use of
>>>>>>> the addressee(s) and may contain proprietary, confidential or
>>>>>>> privileged information. If you are not the intended recipient,
>>>>>>> you should not disseminate, distribute or copy this e-mail.
>>>>>>> Please notify the sender immediately and destroy all copies of
>>>>>>> this message and any attachments. WARNING: Computer viruses can
>>>>>>> be transmitted via email. The recipient should check this email
>>>>>>> and any attachments for the presence of viruses. The company
>>>>>>> accepts no liability for any damage caused by any virus
>>>>>>> transmitted by this email. www.wipro.com
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users at gridengine.org
>>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>>> 
>>>>>>> 
>>>>>>> The information contained in this electronic message and any
>>>>>>> attachments to this message are intended for the exclusive use of
>>>>>>> the addressee(s) and may contain proprietary, confidential or
>>>>>>> privileged information. If you are not the intended recipient,
>>>>>>> you should not disseminate, distribute or copy this e-mail.
>>>>>>> Please notify the sender immediately and destroy all copies of
>>>>>>> this message and any attachments. WARNING: Computer viruses can
>>>>>>> be transmitted via email. The recipient should check this email
>>>>>>> and any attachments for the presence of viruses. The company
>>>>>>> accepts no liability for any damage caused by any virus
>>>>>>> transmitted by this email. www.wipro.com
>>>>>> 
>>>>>> The information contained in this electronic message and any
>>>>>> attachments to this message are intended for the exclusive use of
>>>>>> the addressee(s) and may contain proprietary, confidential or
>>>>>> privileged information. If you are not the intended recipient, you
>>>>>> should not disseminate, distribute or copy this e-mail. Please
>>>>>> notify the sender immediately and destroy all copies of this
>>>>>> message and any attachments. WARNING: Computer viruses can be
>>>>>> transmitted via email. The recipient should check this email and
>>>>>> any attachments for the presence of viruses. The company accepts
>>>>>> no liability for any damage caused by any virus transmitted by
>>>>>> this email. www.wipro.com
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users at gridengine.org
>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Best,
>>>>> 
>>>>> Feng
>>>>> The information contained in this electronic message and any
>>>>> attachments to this message are intended for the exclusive use of
>>>>> the addressee(s) and may contain proprietary, confidential or
>>>>> privileged information. If you are not the intended recipient, you
>>>>> should not disseminate, distribute or copy this e-mail. Please
>>>>> notify the sender immediately and destroy all copies of this
>>>>> message and any attachments. WARNING: Computer viruses can be
>>>>> transmitted via email. The recipient should check this email and
>>>>> any attachments for the presence of viruses. The company accepts no
>>>>> liability for any damage caused by any virus transmitted by this email.
>>>>> www.wipro.com
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users at gridengine.org
>>>>> https://gridengine.org/mailman/listinfo/users
>>>>> 
>>>> 
>>>> The information contained in this electronic message and any
>>>> attachments to this message are intended for the exclusive use of
>>>> the
>>>> addressee(s) and may contain proprietary, confidential or privileged
>>>> information. If you are not the intended recipient, you should not
>>>> disseminate, distribute or copy this e-mail. Please notify the
>>>> sender immediately and destroy all copies of this message and any
>>>> attachments. WARNING: Computer viruses can be transmitted via email.
>>>> The recipient should check this email and any attachments for the
>>>> presence of viruses. The company accepts no liability for any damage
>>>> caused by any virus transmitted by this email. www.wipro.com
>>>> 
>>> 
>>> The information contained in this electronic message and any
>>> attachments to this message are intended for the exclusive use of the
>>> addressee(s) and may contain proprietary, confidential or privileged
>>> information. If you are not the intended recipient, you should not
>>> disseminate, distribute or copy this e-mail. Please notify the sender
>>> immediately and destroy all copies of this message and any
>>> attachments. WARNING: Computer viruses can be transmitted via email.
>>> The recipient should check this email and any attachments for the
>>> presence of viruses. The company accepts no liability for any damage
>>> caused by any virus transmitted by this email. www.wipro.com
>>> 
>> 
>> The information contained in this electronic message and any
>> attachments to this message are intended for the exclusive use of the
>> addressee(s) and may contain proprietary, confidential or privileged
>> information. If you are not the intended recipient, you should not
>> disseminate, distribute or copy this e-mail. Please notify the sender
>> immediately and destroy all copies of this message and any
>> attachments. WARNING: Computer viruses can be transmitted via email.
>> The recipient should check this email and any attachments for the
>> presence of viruses. The company accepts no liability for any damage
>> caused by any virus transmitted by this email. www.wipro.com
>> 
> 
> The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
> 




More information about the users mailing list