[gridengine users] Finished jobs still appear as running in queue

Nicolás Serrano Martínez-Santos nserrano at dsic.upv.es
Wed Nov 27 09:24:23 UTC 2013


Excerpts from Reuti's message of 2013-11-26 19:37:34 +0100:
> 
> But the process is also gone from the node, and not in some uninterruptible kernel sleep?
> 

It is gone.

> 
> What's in the script: /scripts/sgeepilog.sh - anything what could hang?
> 

Please find it attached. However, the wait does not always return -1 in the
epilog but sometimes also in the main script. 

> Are you using -notify and s_rt at the same time? At least for the CPU time I spot 36000 as s_cpu which I suggest to remove. It has no direct effect as you have a h_cpu in addition anyway. Having -notify and a soft warning at the same time could result in a warning for the warning and the job is never killed but warned every 90 seconds or so. Maybe something similar is happening when you have s_cpu and s_rt being triggered almost at the same time.
> 

We are not using those two options. This is what the typical qstat of a process loooks like

==============================================================
job_number:                 294730
exec_file:                  job_scripts/294730
submission_time:            Tue Nov 19 17:30:06 2013
owner:                      adgipas
uid:                        3155
group:                      20040059
gid:                        3091
sge_o_home:                 /h/adgipas
sge_o_log_name:             adgipas
sge_o_path:                 /home/adgipas/proves_cart/share_tlk/tLtask-train/scripts:/h/adgipas/proves_cart/bin:/home/apps/ompi/1.6.4/gnu/bin:/bin:/usr/bin:/home/apps/oge/bin/linux-x64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/h/adgipas/bin:/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts
sge_o_shell:                /bin/bash
sge_o_workdir:              /home/adgipas/proves_cart
sge_o_host:                 mainnode
account:                    sge
cwd:                        /h/adgipas/proves_cart
reserve:                    y
merge:                      y
hard resource_list:         h_cpu=72000,h_rt=72000,h_vmem=5120M
mail_list:                  nserrano at dsic.upv.es
notify:                     FALSE
job_name:                   cart_700.standard.triphoneme.train-em-MIX01-ITER1-estimate
jobshare:                   0
shell_list:                 NONE:/bin/bash
env_list:                   PATH=/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts:/h/adgipas/proves_cart/bin:/home/apps/ompi/1.6.4/gnu/bin:/bin:/usr/bin:/home/apps/oge/bin/linux-x64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/h/adgipas/bin:/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts
script_file:                STDIN
jid_predecessor_list (req):  cart_700.standard.triphoneme.train-init
                             jid_successor_list:          294731
                             job-array tasks:            1-500:1
                             usage  334:                 cpu=10:08:02, mem=182410.00000 GBs, io=0.00000, vmem=5.000G, maxvmem=5.000G
                             scheduling info:            queue instance "gpus at hpcg1.cc.upv.es" dropped because it is disabled
                                                         queue instance "gpus at hpcg2.cc.upv.es" dropped because it is disabled

-----------------

Another peculiarity of the cluster is that all processes are submittion with -R y, could it cause also any problem? I read in one of your mails

http://gridengine.org/pipermail/users/2012-October/005077.html

but I don't think is related to this problem.

> -- Reuti
> 
> > until the process is deleted with "-f".
> > 
> > In the <qmaster spool>/messages there are references to this jobs as:
> > 
> > 11/25/2013 10:11:41|schedu|mainnode|W|job 312363.9 should have finished since 10483s
> > 
> > Do you have any hint of what can be problem?
> > 
> > Thanks in advance,
> > 
> > -- 
> > NiCo
> > <trace>_______________________________________________
> > users mailing list
> > users at gridengine.org
> > https://gridengine.org/mailman/listinfo/users

-- 
NiCo



More information about the users mailing list