[gridengine users] Another MATLAB + SGE Question

Prentice Bisbal prentice at ias.edu
Thu Feb 9 21:20:26 UTC 2012


An epilog script would delete the files, but not every user would want
those files automatically deleted. The properties of the parallel job
(results, etc.) are stored in those files. As long as those files exist,
the MATLAB user can retrieve the results to work with them. Once they
call destroy, the results are gone, so when to delete them should really
be left up to the MATLAB user, and may vary from job to job.

Prentice


On 02/08/2012 05:47 PM, "Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D." wrote:
> http://gridscheduler.sourceforge.net/howto/filestaging/index.html
> may be one can use epilog script to delete these files
> regards
>
>
>
> On 2/8/2012 4:11 PM, Prentice Bisbal wrote:
>> Murph,
>>
>> I did more testing this afternoon. I'm not sure I'd call this problem a
>> race condition. More like a communications error, that behaves like a
>> race condition.
>>
>> Here's my latest understanding of what's going on (again posted to list
>> for others who might search the archives )
>>
>> When MATLAB calls qsub, it uses the -o switch to redirect STDOUT to a
>> file in the Job directory named<JobID>.mpiexec.log. For Job1, this
>> would be
>>
>> Job1/Job1.mpiexec.log
>>
>> As soon as the job itself completes, MATLAB thinks it's done completely,
>> and if you called the destroy function, it will delete the job
>> directory. Now in reality, after the job itself completes, SGE still has
>> some housekeeping to do, like collecting the STDOUT from the job and
>> writing it to Job1/Job1.mpiexec.log. So that's exactly what SGE tries to
>> do, but it finds that file is no longer there, and panics, reporting an
>> error with the job.
>>
>> To fix this, I added a pause before the destroy, like this:
>>
>> pjob = createParallelJob(sched);
>> set(pjob, 'MinimumNumberOfWorkers', 1);
>> set(pjob, 'MaximumNumberOfWorkers', 16);
>> createTask(pjob, @sum, 1, {[1 2 3]});
>> submit(pjob);
>> waitForState(pjob, 'finished');
>> results = getAllOutputArguments(pjob);
>> disp(results);
>> pause(16);
>> destroy(pjob);
>>
>> This works, but keeps the job running a bit longer. I initially put the
>> pause right after waitForState(), but then moved it since I decided that
>> users copying my code would what to get their results ASAP, and wouldn't
>> care if there was a pause before calling destroy. From testing, I've
>> found that the pause should be about 1 second per process on my cluster.
>> So, for a job with 4 workers, it should pause for 4 seconds. Since I
>> have a 16 worker license, I set my pause duration to 16 seconds.
>>
>> I say this is a communications error and not a race condition because of
>> getJobStateFcn.m did it's job correctly, waitForState would block until
>> the job no longer shows up in the output of 'qstat -xml', but it's
>> clearly unblocking before then.
>>
>> Prentice
>>
>> On 02/08/2012 11:31 AM, Murphy, Brian (E IT F PR ORL 2) wrote:
>>
>>
>>> Prentice,
>>>
>>> No worries.  I will post back here when (if) we have a solution.
>>>
>>> --murph
>>>
>>> -----Original Message-----
>>> From: users-bounces at gridengine.org
>>> [mailto:users-bounces at gridengine.org] On Behalf Of Prentice Bisbal
>>> Sent: Wednesday, February 08, 2012 11:27 AM
>>> To: users at gridengine.org
>>> Subject: Re: [gridengine users] Another MATLAB + SGE Question
>>>
>>> Murph,
>>>
>>> Thanks for letting me know I'm not the only one!
>>>
>>> Prentice
>>>
>>>
>>> On 02/08/2012 11:19 AM, Murphy, Brian (E IT F PR ORL 2) wrote:
>>>> Prentice,
>>>>
>>>>  From what we have determined, this is a race condition.  We have
>>>> experienced the same problem for months.  Some of my users are
>>>> currently working with Mathworks to find a solution.
>>>>
>>>> --murph
>>>>
>>>> Siemens Energy, Inc
>>>> Orlando, FL
>>>>
>>>> -----Original Message-----
>>>> From: users-bounces at gridengine.org
>>>> [mailto:users-bounces at gridengine.org] On Behalf Of Prentice Bisbal
>>>> Sent: Wednesday, February 08, 2012 10:37 AM
>>>> To: users at gridengine.org Users
>>>> Subject: [gridengine users] Another MATLAB + SGE Question
>>>>
>>>> So I finally have MATLAB set up and working fine with SGE. I can
>>>> submit
>>>> parallel and distributed jobs from MATLAB to SGE, and then SGE does
>>>> its
>>>> thing.
>>>> I have one remaining problem, and I thought I'd ask here first before
>>>> talking to Mathworks, since I have more confidence in you guys. I'm
>>>> afraid this might be very MATLAB specific, though.
>>>>
>>>> When I submit a parallel job, the job executes, and appears to
>>>> complete
>>>> without any issues (MATLAB prints the results), but then the
>>>> running job
>>>> goes from status 'r' to 'Eqw'. Using 'qstat -j<jobid>  -E explain'
>>>> shows
>>>> this error:
>>>>
>>>> error reason    1:          02/08/2012 10:28:58 [103808:6171]: error:
>>>> can't open output file "/work/prentice/matlab/Job1/Job1.mp
>>>>
>>>> My working directory is /work/prentice/matlab, and the Job1
>>>> directory is
>>>> temporary directory used my MATLAB when the job is running.
>>>>
>>>> In my MATLAB submit script, it call destroy() to kill the job after it
>>>> gets the results, and this destroy function deletes the Job1 subdir
>>>> while SGE is still running leading to this result. If I don't call
>>>> destroy(), MATLAB leaves a bunch of files and dirs in my working
>>>> directory related to Job1, Job2, etc.
>>>>
>>>> Has anyone else seen this with MATLAB+SGE? I suspect I might need to
>>>> modify the destroyJobFcn.m file to not kill the jobs while they're
>>>> still
>>>> running.
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users at gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users at gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>>>
>> _______________________________________________
>> users mailing list
>> users at gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>


More information about the users mailing list