[gridengine users] Finished jobs still appear as running in queue

Nicolás Serrano Martínez-Santos nserrano at dsic.upv.es
Wed Nov 27 09:24:23 UTC 2013

Excerpts from Reuti's message of 2013-11-26 19:37:34 +0100:
> But the process is also gone from the node, and not in some uninterruptible kernel sleep?

It is gone.

> What's in the script: /scripts/sgeepilog.sh - anything what could hang?

Please find it attached. However, the wait does not always return -1 in the
epilog but sometimes also in the main script. 

> Are you using -notify and s_rt at the same time? At least for the CPU time I spot 36000 as s_cpu which I suggest to remove. It has no direct effect as you have a h_cpu in addition anyway. Having -notify and a soft warning at the same time could result in a warning for the warning and the job is never killed but warned every 90 seconds or so. Maybe something similar is happening when you have s_cpu and s_rt being triggered almost at the same time.

We are not using those two options. This is what the typical qstat of a process loooks like

job_number:                 294730
exec_file:                  job_scripts/294730
submission_time:            Tue Nov 19 17:30:06 2013
owner:                      adgipas
uid:                        3155
group:                      20040059
gid:                        3091
sge_o_home:                 /h/adgipas
sge_o_log_name:             adgipas
sge_o_path:                 /home/adgipas/proves_cart/share_tlk/tLtask-train/scripts:/h/adgipas/proves_cart/bin:/home/apps/ompi/1.6.4/gnu/bin:/bin:/usr/bin:/home/apps/oge/bin/linux-x64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/h/adgipas/bin:/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts
sge_o_shell:                /bin/bash
sge_o_workdir:              /home/adgipas/proves_cart
sge_o_host:                 mainnode
account:                    sge
cwd:                        /h/adgipas/proves_cart
reserve:                    y
merge:                      y
hard resource_list:         h_cpu=72000,h_rt=72000,h_vmem=5120M
mail_list:                  nserrano at dsic.upv.es
notify:                     FALSE
job_name:                   cart_700.standard.triphoneme.train-em-MIX01-ITER1-estimate
jobshare:                   0
shell_list:                 NONE:/bin/bash
env_list:                   PATH=/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts:/h/adgipas/proves_cart/bin:/home/apps/ompi/1.6.4/gnu/bin:/bin:/usr/bin:/home/apps/oge/bin/linux-x64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/h/adgipas/bin:/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts
script_file:                STDIN
jid_predecessor_list (req):  cart_700.standard.triphoneme.train-init
                             jid_successor_list:          294731
                             job-array tasks:            1-500:1
                             usage  334:                 cpu=10:08:02, mem=182410.00000 GBs, io=0.00000, vmem=5.000G, maxvmem=5.000G
                             scheduling info:            queue instance "gpus at hpcg1.cc.upv.es" dropped because it is disabled
                                                         queue instance "gpus at hpcg2.cc.upv.es" dropped because it is disabled


Another peculiarity of the cluster is that all processes are submittion with -R y, could it cause also any problem? I read in one of your mails


but I don't think is related to this problem.

> -- Reuti
> > until the process is deleted with "-f".
> > 
> > In the <qmaster spool>/messages there are references to this jobs as:
> > 
> > 11/25/2013 10:11:41|schedu|mainnode|W|job 312363.9 should have finished since 10483s
> > 
> > Do you have any hint of what can be problem?
> > 
> > Thanks in advance,
> > 
> > -- 
> > NiCo
> > <trace>_______________________________________________
> > users mailing list
> > users at gridengine.org
> > https://gridengine.org/mailman/listinfo/users


More information about the users mailing list