[gridengine users] Several problems with server JSV s

Txema Heredia txema.llistes at gmail.com
Wed Jun 18 17:03:53 UTC 2014


Woah man! That's impressive!

I had to make a couple of adjustments to the script. Mainly checking 
first if $JOB_ID existed (to instantly exclude qsub jobs), and adding an 
if to the pid check because non-root qlogins needed an additional "step" 
in the search of the shepherd pid.

For the record, this is my final script:

#cat /etc/profile.d/qlogin_timelimit_message.sh
#!/bin/bash

if [ ! -n "$JOB_ID" ]; then
         GO="false";
         MYPARENT=`ps -p $$ -o ppid --no-header`
#       echo $MYPARENT
         MYPARENT=`ps -p $MYPARENT -o ppid --no-header`
#       echo $MYPARENT
         MYSTARTUP=`ps -p $MYPARENT -o command --no-header`
#       echo $MYSTARTUP

         if [ "${MYSTARTUP:0:13}" = "sge_shepherd-" ]; then
                 GO="true";
         else
                 MYPARENT=`ps -p $MYPARENT -o ppid --no-header`
#               echo $MYPARENT
                 MYSTARTUP=`ps -p $MYPARENT -o command --no-header`
#               echo $MYSTARTUP
                 if [ "${MYSTARTUP:0:13}" = "sge_shepherd-" ]; then
                         GO="true";
                 fi
         fi



#        if [ "${MYSTARTUP:0:13}" = "sge_shepherd-" ]; then
         if [ "$GO" = "true" ]; then
#               echo "Running inside SGE";
                 MYJOBID=${MYSTARTUP:13}
                 MYJOBID=${MYJOBID% -bg}
#               echo "Job $MYJOBID"

                 if [ -n "$MYJOBID" ]; then
                 . /opt/gridengine/default/common/settings.sh
                 TIMELIMIT=`qstat -j $MYJOBID | sed -n -e 
"/^context/s/^context: *//p" | tr "," "\n" | sed -n -e 
"s/^QLOGIN_TIMELIMIT=//p"`
#               echo $TIMELIMIT
                         if [ -n "$TIMELIMIT" ]; then

echo -e "\n\n"
echo -e 
"\t\x1b\x5b1;31;49m#################################################\x1b\x5b0;39;49m"
echo -e "\t\x1b\x5b1;31;49m#\t\t* W A R N I N G 
*\t\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m"
echo -e 
"\t\x1b\x5b1;31;49m#################################################\x1b\x5b0;39;49m"
echo -e 
"\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m\t\t\t\t\t\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m"
##print ("\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m       | |       |       
|       |       \x1b\x5b1;31;49m#\x1b\x5b0;39;49m"
echo -e "\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m The qlogin job you 
submitted did not request\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m"
echo -e "\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m any time 
duration.\t\t\t\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m"
echo -e 
"\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m\t\t\t\t\t\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m"
echo -e "\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m This qlogin session has 
been assigned a\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m"
echo -e "\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m duration of\x1b\x5b1;32;49m 
${TIMELIMIT}\x1b\x5b0;39;49m. After this time 
expires,\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m"
#echo -e "\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m duration of 
\x1b\x5b1;32;49m2 hours\x1b\x5b0;39;49m. After this time 
expires,\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m"
echo -e "\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m the qlogin session will 
close.\t\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m"
echo -e 
"\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m\t\t\t\t\t\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m"
echo -e "\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m If you want to submit a 
qlogin session with\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m"
echo -e "\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m longer duration, please add 
to your resource\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m"
echo -e "\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m request a time petition by 
adding the\t\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m"
echo -e "\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m following parameter to your 
qlogin command:\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m"
echo -e "\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m \t-l 
h_rt=hh:mm:ss\t\t\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m"
echo -e "\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m as 
in\t\t\t\t\t\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m"
echo -e "\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m \t-l h_rt=02:30:00 (2 hours 
30 minutes)\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m"
echo -e 
"\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m\t\t\t\t\t\t\x1b\x5b1;31;49m#\x1b\x5b0;39;49m"
echo -e 
"\t\x1b\x5b1;31;49m#################################################\x1b\x5b0;39;49m"
echo -e "\n"

                         fi
                 fi
         fi
fi



And this is the message it produces (in full blown color).

$ qlogin -l h_vmem=500M
Your job 4550191 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 4550191 has been successfully scheduled.
Establishing /opt/gridengine/bin/rocks-qlogin.sh session to host 
compute-1-2.local ...
Warning: Permanently added '[compute-1-2.local]:53921' (RSA) to the list 
of known hosts.
Last login: Wed Jun 18 18:55:04 2014 from floquet.local
Rocks Compute Node
Rocks 6.0 (Mamba)
Profile built 15:58 25-Sep-2013

Kickstarted 16:05 25-Sep-2013



     #################################################
     #        * W A R N I N G *        #
     #################################################
     #                        #
     # The qlogin job you submitted did not request    #
     # any time duration.                #
     #                        #
     # This qlogin session has been assigned a    #
     # duration of 2 hours. After this time expires,    #
     # the qlogin session will close.        #
     #                        #
     # If you want to submit a qlogin session with    #
     # longer duration, please add to your resource    #
     # request a time petition by adding the        #
     # following parameter to your qlogin command:    #
     #     -l h_rt=hh:mm:ss            #
     # as in                        #
     #     -l h_rt=02:30:00 (2 hours 30 minutes)    #
     #                        #
     #################################################


[theredia at compute-1-2 ~]$




Thank you very much,

Txema






El 18/06/14 17:37, Reuti escribió:
> Am 18.06.2014 um 17:01 schrieb Txema Heredia:
>
>> El 17/06/14 17:23, Reuti escribió:
>>> Am 17.06.2014 um 15:46 schrieb Txema Heredia:
>>>
>>>> Basically, the JSV checks the CLIENT parameter. If it is equal to "qlogin", then checks if there is any h_rt, and sets a limit on 2h:10min if there is none, while showing a colorful warning message to the user so they know that there is a time limit.
>>>> Then, it also sets all.q as a hard queue if there is none (or *) and also sets the core binding policy.
>>>>
>>>> This works wonders when I run it as a client jsv by using "qlogin -jsv /opt/gridengine/default/common/jsv.pl". The message appears, the limit is set, and the job runs fine.
>>>>
>>>> But, once I set this as a server JSV (by qconf -mconf global), the time limit no longer applies.
>>>>
>>>> As far as I've been able to find the following behaviours differ from running it as client or server jsv:
>>>>
>>>> - The 'CLIENT' parameter changed, from 'qlogin' to 'qmaster'. This skips all my "if" in the jsv and stops checking for time limits. Can I trust this? Why is this 'qmaster' appearing? Now both qsub's and qlogin's show the same command. How can I distinguish them?
>>> I found the same:
>>>
>>> http://gridengine.org/pipermail/users/2012-September/004808.html
>>>
>>> You can check for QRSH_PORT port according to William's post.
>>>
>> Thank you as usual, Reuti.
>>
>> That QRSH_PORT env variable allows to differentiate between qlogin and qsub commands. But I am still having some problems.
>>
>>>> - The jsv_show_params() command shows nothing. Neither on stdout nor in /opt/gridengine/default/spool/qmaster/messages This makes debugging really cumbersome
>>> For me it's working, being it Bash or Perl.
>>>
>>> 06/17/2014 17:13:30|worker|pc15370|I|got param: A='sge'
>>> 06/17/2014 17:13:30|worker|pc15370|I|got param: GROUP='users'
>>> 06/17/2014 17:13:30|worker|pc15370|I|got param: N='test.sh'
>>> 06/17/2014 17:13:30|worker|pc15370|I|got param: CMDNAME='test.sh'
>>> 06/17/2014 17:13:30|worker|pc15370|I|got param: CMDARGS='0'
>>> 06/17/2014 17:13:30|worker|pc15370|I|got param: JOB_ID='11553'
>>> 06/17/2014 17:13:30|worker|pc15370|I|got param: M='reuti at pc15370'
>>> 06/17/2014 17:13:30|worker|pc15370|I|got param: CLIENT='qmaster'
>>> 06/17/2014 17:13:30|worker|pc15370|I|got param: VERSION='1.0'
>>> 06/17/2014 17:13:30|worker|pc15370|I|got param: USER='reuti'
>>> 06/17/2014 17:13:30|worker|pc15370|I|got param: CONTEXT='server'
>> My bad. I had my loglevel set to log_warning instead of log_info. Now I can see these messages.
>>
>>>> - No message can be sent to the user.  Being it info, warning or error. The user won't know if I have set a time limit to his session
>>> Yep, only for "jsv_reject_wait" a message can be displayed. Despite the fact that also for "jsv_correct" and "jsv_accept" a message can be specified too.
>>>
>>> -- Reuti
>>>
>> Trying to overcome this, I thought of making my JSV add an environment variable "IS_QLOGIN=true" whenever it detects the QRSH_PORT. Then, a prolog script in the execution host would check that environment variable and print, if needed, the timelimit message to the user.
>> BUT, the prolog script cannot print anything to the standard output. (¿because it is run before the actual session begins?)
> Yep, its stdout is not connected to the terminal yet.
>
>
>> So, I thought about modifying the .bashrc file (or any other of the several scripts under /etc/profile.d/), and make it read that "IS_QLOGIN=true" environment variable. But, again, the fates are against me, and qlogin commands cannot use the "-v" parameter. Even if I use the jsv_add_env() command in the JSV script, that environment variable is passed to the prolog script, but is nowhere to be found once the "real" qlogin session starts.
>>
>> I could also ignore both the JSV and the prolog scripts, go directly to the .bashrc script and check there for the presence of (or lack thereof) variables like JOB_ID or JOB_NAME. That would allow me to distinguish between qlogin and qsub sessions (qlogins and ssh environments are identical). But then, again, I won't have access to anything able to tell the script if there is a timelimit set.
>>
>> The only solution to all this mess that comes to my mind would be to make /etc/motd writable by all users, have the prolog script to modify it with the timelimit message, and then have some sort of contraption in /etc/profile.d/ that resets the motd back to its previous non-qlogin version.
>>
>> Does anyone have a better (or less-prone-to-failure) idea?
> In the JSV you can also add some job context and fill it with a proper message. This could then be output:
>
> $ qrsh -ac "MESSAGE=Time limit of 12 hrs set."
> Running inside SGE
> Job 11562
> Time limit of 12 hrs set.
>
> The necessary profile for bash (depending on builtin/ssh/rsh the parent must be looked up more than once):
>
> MYPARENT=`ps -p $$ -o ppid --no-header`
> #MYPARENT=`ps -p $MYPARENT -o ppid --no-header`
> #MYPARENT=`ps -p $MYPARENT -o ppid --no-header`
> MYSTARTUP=`ps -p $MYPARENT -o command --no-header`
>
> if [ "${MYSTARTUP:0:13}" = "sge_shepherd-" ]; then
>     echo "Running inside SGE"
>     MYJOBID=${MYSTARTUP:13}
>     MYJOBID=${MYJOBID% -bg}
>     echo "Job $MYJOBID"
>
>     if [ -n "$MYJOBID" ]; then
>        . /usr/sge/default/common/settings.sh
>         qstat -j $MYJOBID | sed -n -e "/^context/s/^context: *//p" | tr "," "\n" | sed -n -e "s/^MESSAGE=//p"
>     fi
> fi
>
> HTH - Reuti




More information about the users mailing list