[gridengine users] complex error

William Hay w.hay at ucl.ac.uk
Wed Jul 26 08:17:39 UTC 2017


On Tue, Jul 25, 2017 at 12:57:47AM +0000, John_Tai wrote:
>    I have configured virtual_free as a requestable resource:
> 
>     
> 
>    virtual_free        mem        MEMORY      <=    YES         JOB       
>    0        0
> 
>     
> 
>    And it's been working great for months.
> 
>     
> 
>    However today all of a sudden I got this error in messages:
> 
>     
> 
>    07/25/2017 08:45:41|worker|ibm068|E|host load value "virtual_free"
>    exceeded: capacity is 95945748480.262146, job 5983416 requests additional
>    268000000000.000000
> 
>    07/25/2017 08:45:41|worker|ibm068|E|cannot start job 5983416.1, as
>    resources have changed during a scheduling run
> 
>    07/25/2017 08:45:41|worker|ibm068|W|Skipping remaining 7 orders
> 
>     
> 
>    And any job would not get scheduled at all, they'd be in waiting state
>    "qw", no matter how many resources it's requesting:
Are they all failing to start on the same host?  Might be worth disabling the queues
on that host so the scheduler looks for another place to put it.  Have a look at the host 
to see if something is eating virtual memory there.

William

> 
>     
> 
>    # qstat -j 5983416
> 
>    ==============================================================
> 
>    job_number:                 5983416
> 
>    exec_file:                  job_scripts/5983416
> 
>    submission_time:            Tue Jul 25 08:18:46 2017
> 
>    owner:                      jumbo
> 
>    uid:                        986
> 
>    group:                      memory
> 
>    gid:                        41
> 
>    sge_o_home:                 /home/jumbo
> 
>    sge_o_log_name:             jumbo
> 
>    sge_o_path:                
>    /home/eda/cadence/IC616.500.3_20131102/tools/bin:/home/eda/cadence/IC616.500.3_20131102/tools/dfII/bin:/ho
>    me/eda/cadence/IC616.500.3_20131102/tools/plot/bin:/home/eda/cadence/Spectre161ISR2/tools/bin:/home/sge/sge6.2u6/bin/lx24-amd64:/bin:/
>    usr/bin:/usr/local/bin:.:/home/sge/bin:/home/DI/TOOLS/bin:.:/home/IPproj/IOproject/quan/Flatten
> 
>    sge_o_shell:                /bin/csh
> 
>    sge_o_workdir:             
>    /home/memorytemp/jumbo/180G_RK/S018DP/design_review
> 
>    sge_o_host:                 ibm041
> 
>    account:                    sge
> 
>    cwd:                       
>    /home/memorytemp/jumbo/180G_RK/S018DP/design_review
> 
>    merge:                      y
> 
>    hard resource_list:         virtual_free=2000m
> 
>    mail_list:                  jumbo at ibm041
> 
>    notify:                     FALSE
> 
>    job_name:                   run.pl
> 
>    jobshare:                   0
> 
>    hard_queue_list:            256g.q
> 
>    env_list:                  
>    REMOTEHOST=dsls11,MANPATH=/home/sge/sge6.2u6/man:/opt/SUNWspro/man:/usr/man:/usr/openwin/man:/usr/dt/man:/
>    usr/local/man:/usr/local/mysql/man:/usr/local/samba/man,VNCDESKTOP=ibm041:344
>    (jumbo),HOSTNAME=ibm041,HOST=ibm041,SHELL=/bin/csh,TERM=
>    xterm,GROUP=memory,USER=jumbo,LD_LIBRARY_PATH=/usr/lib:/usr/openwin/lib:/usr/dt/lib:/usr/ccs/lib:/usr/local/lib:/usr/local/mysql/lib,L
>    S_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.
>    exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip
>    =00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*
>    .xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:,HOSTTYPE=x86_64-linux,MAIL=/var/spool/mail/jumbo,PATH=/home/eda/cadence/IC616.500.3_20
>    131102/tools/bin:/home/eda/cadence/IC616.500.3_20131102/tools/dfII/bin:/home/eda/cadence/IC616.500.3_20131102/tools/plot/bin:/home/eda
>    /cadence/Spectre161ISR2/tools/bin:/home/sge/sge6.2u6/bin/lx24-amd64:/bin:/usr/bin:/usr/local/bin:.:/home/sge/bin:/home/DI/TOOLS/bin:.:
>    /home/IPproj/IOproject/quan/Flatten,INPUTRC=/etc/inputrc,PWD=/home/memorytemp/jumbo/180G_RK/S018DP/design_review,EDITOR=xterm
>    -e vi,LA
>    NG=en_US.UTF-8,SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass,SHLVL=6,HOME=/home/jumbo,OSTYPE=linux,VENDOR=unknown,MACHTYPE=x86_64
>    ,LOGNAME=jumbo,LESSOPEN=|/usr/bin/lesspipe.sh
>    %s,DISPLAY=:344.0,G_BROKEN_FILENAMES=1,_=/usr/bin/gnome-session,GTK_RC_FILES=/etc/gtk/gt
>    krc:/home/jumbo/.gtkrc-1.2-gnome2,SESSION_MANAGER=local/ibm041:/tmp/.ICE-unix/17118,GNOME_KEYRING_SOCKET=/tmp/keyring-FJMO4E/socket,GN
>    OME_DESKTOP_SESSION_ID=Default,DESKTOP_STARTUP_ID=NONE,COLORTERM=gnome-terminal,WINDOWID=38263354,SGE_ROOT=/home/sge/sge6.2u6,SGE_CELL
>    =cell1,SGE_CLUSTER_NAME=p5098,IC61=/home/eda/cadence/IC616.500.3_20131102,MMSIMHOME=/home/eda/cadence/Spectre161ISR2,LM_LICENSE_FILE=5
>    280 at ibm041:5280 at ibm001:5280 at ibm002:5280 at ibm003:5260 at cadlic:5280 at cadlic:5280 at dsw3:5280 at dsw7:5280 at ibm004:5280 at ibm005:5280 at ibm006:5280 at 10
>    .224.172.252
> 
>    script_file:                ./run.pl
> 
>    scheduling info:            queue instance "gui.q at dsbm05" dropped because
>    it is overloaded: mem_used=269814435839.737854 (no load adju stment) >=
>    200g
> 
>                                queue instance "192g.q at dsbm10" dropped because
>    it is temporarily not available
> 
>                                queue instance "gui.q at dsbm10" dropped because
>    it is temporarily not available
> 
>                                queue instance "gui.q at dsbm10" dropped because
>    it is temporarily not available
> 
>     
> 
>     
> 
>    And clearly there are available resources:
> 
>     
> 
>     
> 
>     
> 
>    # qstat -F mem
> 
>    queuename                      qtype resv/used/tot. load_avg arch         
>    states
> 
>    ---------------------------------------------------------------------------------
> 
>    gmig.q at ibm044                  BIP   0/0/2          1.27     lx24-amd64
> 
>            hc:virtual_free=24.000G
> 
>    ---------------------------------------------------------------------------------
> 
>    gui.q at dsbm04                   BIP   0/59/70        10.01    lx24-amd64
> 
>            hc:virtual_free=256.000G
> 
>    ---------------------------------------------------------------------------------
> 
>    gui.q at dsbm05                   BIP   0/56/70        7.14     lx24-amd64   
>    a
> 
>            hc:virtual_free=90.705G
> 
>    ---------------------------------------------------------------------------------
> 
>    gui.q at dsbm08                   BIP   0/11/45        9.96     lx24-amd64
> 
>            hc:virtual_free=192.000G
> 
>    ---------------------------------------------------------------------------------
> 
>    gui.q at dsbm09                   BIP   0/7/45         9.84     lx24-amd64
> 
>            hc:virtual_free=192.000G
> 
>    ---------------------------------------------------------------------------------
> 
>    gui.q at dsbm10                   BIP   0/2/45         0.82     lx24-amd64   
>    o
> 
>            hc:virtual_free=192.000G
> 
>    ---------------------------------------------------------------------------------
> 
>    gui.q at dsbm11                   BIP   0/41/45        3.13     lx24-amd64
> 
>            hc:virtual_free=192.000G
> 
>    ---------------------------------------------------------------------------------
> 
>    lc.q at ibm071                    BIP   0/0/50         0.21     lx24-amd64
> 
>            hc:virtual_free=48.000G
> 
>    ---------------------------------------------------------------------------------
> 
>    lc.q at ibm072                    BIP   0/0/50         0.00     lx24-amd64
> 
>            hc:virtual_free=48.000G
> 
>    ---------------------------------------------------------------------------------
> 
>    lc.q at ibm073                    BIP   0/0/50         24.09    lx24-amd64
> 
>            hc:virtual_free=48.000G
> 
>    ---------------------------------------------------------------------------------
> 
>    lc.q at ibm074                    BIP   0/5/50         0.05     lx24-amd64
> 
>            hc:virtual_free=48.000G
> 
>    ---------------------------------------------------------------------------------
> 
>    lc.q at ibm075                    BIP   0/0/50         24.43    lx24-amd64
> 
>            hc:virtual_free=48.000G
> 
>     
> 
>     
> 
>    Not sure what happened there. I had to disable this complex, so now jobs
>    are being scheduled again. I wonder if there was one job that was
>    submitted improperly that caused this?
> 
>     
> 
>     
> 
>      ----------------------------------------------------------------------
> 
>       This email (including its attachments, if any) may be confidential and
>       proprietary information of SMIC, and intended only for the use of the
>       named recipient(s) above. Any unauthorized use or disclosure of this email
>       is strictly prohibited. If you are not the intended recipient(s), please
>       notify the sender immediately and delete this email from your computer.

> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://gridengine.org/pipermail/users/attachments/20170726/193af90f/attachment.sig>


More information about the users mailing list