[gridengine users] [SGE-discuss] spool, no information, loss of jobs

Reuti reuti at Staff.Uni-Marburg.DE
Fri Jun 17 10:09:12 UTC 2011


Am 17.06.2011 um 10:18 schrieb baf035:

> Yes, using a classic spooling.
> The spool directory is created on a nfs3 filesystem widely mounted in the HPC enviroment.
> The configuration is used for several years, never seen this kind of trouble in SGE versions till 6.2u5 and SoGE rel. 3710.

Okay, looks fine then. I have no further idea.

-- Reuti


> The structure of created files under qmaster/jobs dir:
> 
> single job (1 slot ):
>  qstat -u \* -s a | grep 187471
>  187471 8.50000 METODIK dzcjsjo      r     06/15/2011 10:16:19 all.q at node05np01     1
>  root at sged8:/jms/spool/i001/sge_spool/qmaster# ls jobs/00/0018/7471
> jobs/00/0018/7471
> root at sged8:/jms/spool/i001/sge_spool/qmaster# file jobs/00/0018/7471
> jobs/00/0018/7471: data
> 
> parallel job waiting or hold:
> root at sged8:/jms/spool/i001/sge_spool/qmaster# qstat -u \* -s a |  grep 191340
>  191340 5.62661 adc_1PQ dzcar18      qw    06/17/2011 09:35:48                                   48        
> root at sged8:/jms/spool/i001/sge_spool/qmaster# ls -laR jobs/00/0019/1340/
> jobs/00/0019/1340/:
> total 28
> drwxr-xr-x   2 sgeadm tkvyp    19 2011-06-17 09:35 .
> drwxr-xr-x 124 sgeadm tkvyp 16384 2011-06-17 09:58 ..
> -rw-r--r--   1 sgeadm tkvyp  4708 2011-06-17 09:35 common
> 
> parallel job running:
> 191256 7.32617 ABCD_2PB d471676      r     06/17/2011 09:35:47 all.q at node04n120    48
> 
> root at sged8:/jms/spool/i001/sge_spool/qmaster# ls -laR jobs/00/0019/1256
> jobs/00/0019/1256:
> total 28
> drwxr-xr-x   3 sgeadm tkvyp    32 2011-06-17 09:35 .
> drwxr-xr-x 136 sgeadm tkvyp 16384 2011-06-17 09:53 ..
> drwxr-xr-x   3 sgeadm tkvyp    14 2011-06-17 09:35 1-4096
> -rw-r--r--   1 sgeadm tkvyp  4820 2011-06-17 09:35 common
> 
> jobs/00/0019/1256/1-4096:
> total 0
> drwxr-xr-x 3 sgeadm tkvyp 14 2011-06-17 09:35 .
> drwxr-xr-x 3 sgeadm tkvyp 32 2011-06-17 09:35 ..
> drwxr-xr-x 2 sgeadm tkvyp 66 2011-06-17 09:36 1
> 
> jobs/00/0019/1256/1-4096/1:
> total 16
> drwxr-xr-x 2 sgeadm tkvyp  66 2011-06-17 09:36 .
> drwxr-xr-x 3 sgeadm tkvyp  14 2011-06-17 09:35 .. 
> -rw-r--r-- 1 sgeadm tkvyp 735 2011-06-17 09:36 1.r2i2n8
> -rw-r--r-- 1 sgeadm tkvyp 736 2011-06-17 09:36 1.r2i3n12
> -rw-r--r-- 1 sgeadm tkvyp 736 2011-06-17 09:36 1.r4i3n13
> -rw-r--r-- 1 sgeadm tkvyp 892 2011-06-17 09:35 common
> 
> Above mentioned data are from a productive instance but missing in the testing instance based on  SoGE rel.3910 despite a correct job scheduling .
> 
> baf035
> 
> 2011/6/16 Reuti <reuti at staff.uni-marburg.de>
> Am 16.06.2011 um 15:03 schrieb baf035:
> 
> > we are using SoGE rel. 3910 for tests.
> > Submited jobs are correcty dispatched but no informations are stored in a spool direcrory <SPOOL_DIR>/qmaster/jobs.
> 
> You are using classic spooling?
> 
> 
> > In a qmaster messages file are inforamations about missing file/folder at the time of ending of job:
> > ----------------
> > 6/16/2011 10:06:30|schedu|sged2|E|can't find parallel task 50993.1 task past_usage for update in function pe_task_update_master_list_usage
> > 06/16/2011 10:06:30|schedu|sged2|E|callback function for event "3941466. EVENT JOB 50993.1 task past_usage USAGE" failed
> > 06/16/2011 10:07:10|worker|sged2|E|unlink(jobs/00/0005/0993/common) failed: No such file or directory
> > 06/16/2011 10:07:10|worker|sged2|E|can not remove file job spool file: jobs/00/0005/0993/common
> 
> The "common" is strange here. What I saw in the past was just a plain file like 0993 containing binary information of the job.
> 
> 
> > 06/16/2011 10:07:10|worker|sged2|E|can not remove file job spool directory: jobs/00/0005/0993
> > ---------------
> > qacct -j 50993 | grep end_time | uniq
> > end_time     Thu Jun 16 10:05:52 2011
> > --------------
> >
> >
> > A migration of the qmasterd leads to a total lost of job informations. No jobs in qstat after the migration.
> >
> > We have encountered also a case when files in <SPOOL_DIR>/qmaster/jobs are correctly created but during
> > the migration disappeard without a log in the messages file.
> 
> And it's in a shared space?
> 
> -- Reuti
> 
> 
> > Please validate this behavior and thanks for a fix.
> >
> > baf035
> > _______________________________________________
> > SGE-discuss mailing list
> > SGE-discuss at liv.ac.uk
> > https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
> 
> 





More information about the users mailing list