[gridengine users] [SGE-discuss] spool, no information, loss of jobs

Reuti reuti at staff.uni-marburg.de
Thu Jun 16 13:52:28 UTC 2011


Am 16.06.2011 um 15:03 schrieb baf035:

> we are using SoGE rel. 3910 for tests.       
> Submited jobs are correcty dispatched but no informations are stored in a spool direcrory <SPOOL_DIR>/qmaster/jobs.

You are using classic spooling?


> In a qmaster messages file are inforamations about missing file/folder at the time of ending of job:
> ----------------
> 6/16/2011 10:06:30|schedu|sged2|E|can't find parallel task 50993.1 task past_usage for update in function pe_task_update_master_list_usage
> 06/16/2011 10:06:30|schedu|sged2|E|callback function for event "3941466. EVENT JOB 50993.1 task past_usage USAGE" failed
> 06/16/2011 10:07:10|worker|sged2|E|unlink(jobs/00/0005/0993/common) failed: No such file or directory
> 06/16/2011 10:07:10|worker|sged2|E|can not remove file job spool file: jobs/00/0005/0993/common

The "common" is strange here. What I saw in the past was just a plain file like 0993 containing binary information of the job.


> 06/16/2011 10:07:10|worker|sged2|E|can not remove file job spool directory: jobs/00/0005/0993
> ---------------
> qacct -j 50993 | grep end_time | uniq
> end_time     Thu Jun 16 10:05:52 2011
> --------------
> 
> 
> A migration of the qmasterd leads to a total lost of job informations. No jobs in qstat after the migration.
>  
> We have encountered also a case when files in <SPOOL_DIR>/qmaster/jobs are correctly created but during 
> the migration disappeard without a log in the messages file.

And it's in a shared space?

-- Reuti


> Please validate this behavior and thanks for a fix.
> 
> baf035
> _______________________________________________
> SGE-discuss mailing list
> SGE-discuss at liv.ac.uk
> https://arc.liv.ac.uk/mailman/listinfo/sge-discuss





More information about the users mailing list