[gridengine users] qlogin with ssh

Reuti reuti at staff.uni-marburg.de
Wed Dec 4 21:29:56 UTC 2013


Am 04.12.2013 um 21:59 schrieb Wiegers, Bert:

> According to the man-page of queue_conf
> the kill -9 command should have been sent by default (we tried this first).
> This killscript below was an attempt to fix the problem.
> Both don't work.

Then it might be promising to get a tight SSH integration:

http://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html

section "SSH TIGHT INTEGRATION". I wonder why I forgot to mention there that it needs "execd_params ENABLE_ADDGRP_KILL=TRUE" in SGE's configuration.

-- Reuti


> Bert
> 
> 
> 
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Wednesday, December 04, 2013 6:28 PM
>> To: Wiegers, Bert
>> Cc: users at gridengine.org
>> Subject: Re: [gridengine users] qlogin with ssh
>> 
>> Am 04.12.2013 um 17:47 schrieb Wiegers, Bert:
>> 
>>> our setup is
>>> 
>>> sge_conf:
>>> qlogin_command               /export/opt/SGE-8.1.6/utilbin/lx-amd64/qlogin_wrapper.sh
>>> 
>>> cat /export/opt/SGE-8.1.6/utilbin/lx-amd64/qlogin_wrapper.sh
>>> #!/bin/sh
>>> HOST=$1
>>> PORT=$2
>>> /usr/bin/ssh -Y -p $PORT $HOST
>>> 
>>> 
>>> queue_conf:
>>> terminate_method      /export/opt/SGE-8.1.6/scripts/case_terminate_method.sh \
>>>                     $job_pid $job_owner
>> 
>> What was the motivation to have a custom method?
>> 
>> The default is to send a kill to the complete process group, i.e. something like
>> 
>> kill -9 -- -$1
>> 
>> in your setup.
>> 
>> 
>>> cat /export/opt/SGE-8.1.6/scripts/case_terminate_method.sh
>>> #!/bin/bash
>>> 
>>> if [ $# -ne 2 ] ; then
>>> echo "Usage:" $0 job_pid job_owner
>>> exit 1
>>> fi
>>> 
>>> job_pid=$1
>>> job_owner=$2
>>> 
>>> # try and kill the session group - the group leader is the shell
>>> # executing the job script
>>> pkill -s $job_pid if [ $? -ne 0 ] ; then
>>>       kill $job_pid
>> 
>> AFAICS the sid can be different from the pid or pgrp. And the even when they are the same: it's the
>> sid of the sshd, not the shell.
>> 
>> -- Reuti
>> 
>> 
>>> fi
>>> 
>>> # cleanup grace period
>>> sleep 10
>>> pkill -9 -s $job_pid
>>> if [ $? -ne 0 ] ; then
>>>       kill -9 $job_pid
>>> fi
>>> 
>>> 
>>> 
>>> Bert
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>>> Sent: Wednesday, December 04, 2013 5:33 PM
>>>> To: Wiegers, Bert
>>>> Cc: users at gridengine.org
>>>> Subject: Re: [gridengine users] qlogin with ssh
>>>> 
>>>> Am 04.12.2013 um 17:19 schrieb Wiegers, Bert:
>>>> 
>>>>> Hi *,
>>>>> 
>>>>> we are using a qlogin wrapper script, as mentioned below.
>>>>> It looks like that this setup prevents the sge to reach the terminate_method.
>>>> 
>>>> You defined a custom "terminate_method"? Can you please post it?
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> Bert
>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: users-bounces at gridengine.org [mailto:users-bounces at gridengine.org] On Behalf Of
>>>> Wiegers,
>>>>>> Bert
>>>>>> Sent: Tuesday, December 03, 2013 9:01 AM
>>>>>> To: users at gridengine.org
>>>>>> Subject: Re: [gridengine users] qlogin with ssh
>>>>>> 
>>>>>> Hi Reuti,
>>>>>> 
>>>>>> The processtree looks like this
>>>>>> root     20939  0.0  0.0 1242552 5892 ?        Sl   Nov14  18:57 /export/opt/SGE-8.1.6/bin/lx-
>>>>>> amd64/sge_execd
>>>>>> root     33874 99.7  0.0  34164  2828 ?        R    08:47   0:22  \_ sge_shepherd-18003 -bg
>>>>>> root     33882  0.0  0.0  98156  3836 pts/1    Ss+  08:47   0:00      \_ sshd: xxxxxx [priv]
>>>>>> xxxxxx 33884  0.0  0.0  98156  2044 pts/1    S+   08:47   0:00          \_ sshd: xxxxxx at pts/2
>>>>>> xxxxxx 33885  1.1  0.0  14556  3260 pts/2    SNs  08:47   0:00              \_ -tcsh
>>>>>> it stays the same as long as I am logged on to the node.
>>>>>> 
>>>>>> The Job is still listed in qstat.
>>>>>> 
>>>>>> In the messages of the scheduler I find these hints:
>>>>>> 12/03/2013 08:52:31|schedu|service0|W|job 18003.1 should have finished since 90s
>>>>>> 
>>>>>> When I logout afterwards I see  in the messages
>>>>>> 12/03/2013 08:58:42|worker|service0|I|removing trigger to terminate job 18003.1
>>>>>> 12/03/2013 08:58:42|worker|service0|W|job 18003.1 failed on host XY qmaster enforced h_rt,
>>>> h_cpu,
>>>>>> or h_vmem limit because: <unknown reason>
>>>>>> 
>>>>>> Bert
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>>>>>> Sent: Monday, December 02, 2013 6:43 PM
>>>>>>> To: Wiegers, Bert
>>>>>>> Cc: users at gridengine.org
>>>>>>> Subject: Re: [gridengine users] qlogin with ssh
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Am 02.12.2013 um 18:28 schrieb Wiegers, Bert:
>>>>>>> 
>>>>>>>> we are running the SGE 8.1.6.
>>>>>>>> We have configured some interactive queues and use qlogin with the
>>>>>>>> wrapper-script  (... /usr/bin/ssh -Y -p $PORT $HOST).
>>>>>>>> In our setup the user is forced to use the  h_rt variable.
>>>>>>>> Unfortunatly qlogin does not care if the walltime is overdue.
>>>>>>>> The shepherd seems to be unable to kill the qlogin sessions, when the
>>>>>>>> user is still connected to the node.
>>>>>>>> Has anyone a solution or a workaround for this?
>>>>>>> 
>>>>>>> Is the `sshd` a child of the `shephered`, i.e. something like:
>>>>>>> 
>>>>>>> $ ps -e f
>>>>>>> ...
>>>>>>> 6656 ?        Sl    56:23 /usr/sge/bin/lx24-x86/sge_execd
>>>>>>> 9391 ?        S      0:00  \_ sge_shepherd-10502 -bg
>>>>>>> 9392 ?        Ss     0:00      \_ sshd: reuti [priv]
>>>>>>> 9398 ?        S      0:00          \_ sshd: reuti at pts/2
>>>>>>> 9405 pts/2    Ss     0:00              \_ -bash
>>>>>>> 
>>>>>>> How does the process tree look like after "h_rt" expired - did the job vanish from the `qstat`
>>>>> too?
>>>>>>> 
>>>>>>> -- Reuti
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users at gridengine.org
>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users at gridengine.org
>>>>> https://gridengine.org/mailman/listinfo/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users at gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
> 
> 
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users





More information about the users mailing list