[gridengine users] Another MATLAB + SGE Question
Murphy, Brian (E IT F PR ORL 2)
brian.murphy at siemens.com
Thu Feb 9 18:30:19 UTC 2012
Prentice,
Thank you. I see what you are saying. I will have the users test this.
--murph
-----Original Message-----
From: users-bounces at gridengine.org [mailto:users-bounces at gridengine.org] On Behalf Of Prentice Bisbal
Sent: Wednesday, February 08, 2012 4:11 PM
To: users at gridengine.org
Subject: Re: [gridengine users] Another MATLAB + SGE Question
Murph,
I did more testing this afternoon. I'm not sure I'd call this problem a
race condition. More like a communications error, that behaves like a
race condition.
Here's my latest understanding of what's going on (again posted to list
for others who might search the archives )
When MATLAB calls qsub, it uses the -o switch to redirect STDOUT to a
file in the Job directory named <JobID>.mpiexec.log. For Job1, this
would be
Job1/Job1.mpiexec.log
As soon as the job itself completes, MATLAB thinks it's done completely,
and if you called the destroy function, it will delete the job
directory. Now in reality, after the job itself completes, SGE still has
some housekeeping to do, like collecting the STDOUT from the job and
writing it to Job1/Job1.mpiexec.log. So that's exactly what SGE tries to
do, but it finds that file is no longer there, and panics, reporting an
error with the job.
To fix this, I added a pause before the destroy, like this:
pjob = createParallelJob(sched);
set(pjob, 'MinimumNumberOfWorkers', 1);
set(pjob, 'MaximumNumberOfWorkers', 16);
createTask(pjob, @sum, 1, {[1 2 3]});
submit(pjob);
waitForState(pjob, 'finished');
results = getAllOutputArguments(pjob);
disp(results);
pause(16);
destroy(pjob);
This works, but keeps the job running a bit longer. I initially put the
pause right after waitForState(), but then moved it since I decided that
users copying my code would what to get their results ASAP, and wouldn't
care if there was a pause before calling destroy. From testing, I've
found that the pause should be about 1 second per process on my cluster.
So, for a job with 4 workers, it should pause for 4 seconds. Since I
have a 16 worker license, I set my pause duration to 16 seconds.
I say this is a communications error and not a race condition because of
getJobStateFcn.m did it's job correctly, waitForState would block until
the job no longer shows up in the output of 'qstat -xml', but it's
clearly unblocking before then.
Prentice
On 02/08/2012 11:31 AM, Murphy, Brian (E IT F PR ORL 2) wrote:
> Prentice,
>
> No worries. I will post back here when (if) we have a solution.
>
> --murph
>
> -----Original Message-----
> From: users-bounces at gridengine.org [mailto:users-bounces at gridengine.org] On Behalf Of Prentice Bisbal
> Sent: Wednesday, February 08, 2012 11:27 AM
> To: users at gridengine.org
> Subject: Re: [gridengine users] Another MATLAB + SGE Question
>
> Murph,
>
> Thanks for letting me know I'm not the only one!
>
> Prentice
>
>
> On 02/08/2012 11:19 AM, Murphy, Brian (E IT F PR ORL 2) wrote:
>> Prentice,
>>
>> From what we have determined, this is a race condition. We have experienced the same problem for months. Some of my users are currently working with Mathworks to find a solution.
>>
>> --murph
>>
>> Siemens Energy, Inc
>> Orlando, FL
>>
>> -----Original Message-----
>> From: users-bounces at gridengine.org [mailto:users-bounces at gridengine.org] On Behalf Of Prentice Bisbal
>> Sent: Wednesday, February 08, 2012 10:37 AM
>> To: users at gridengine.org Users
>> Subject: [gridengine users] Another MATLAB + SGE Question
>>
>> So I finally have MATLAB set up and working fine with SGE. I can submit
>> parallel and distributed jobs from MATLAB to SGE, and then SGE does its
>> thing.
>> I have one remaining problem, and I thought I'd ask here first before
>> talking to Mathworks, since I have more confidence in you guys. I'm
>> afraid this might be very MATLAB specific, though.
>>
>> When I submit a parallel job, the job executes, and appears to complete
>> without any issues (MATLAB prints the results), but then the running job
>> goes from status 'r' to 'Eqw'. Using 'qstat -j <jobid> -E explain' shows
>> this error:
>>
>> error reason 1: 02/08/2012 10:28:58 [103808:6171]: error:
>> can't open output file "/work/prentice/matlab/Job1/Job1.mp
>>
>> My working directory is /work/prentice/matlab, and the Job1 directory is
>> temporary directory used my MATLAB when the job is running.
>>
>> In my MATLAB submit script, it call destroy() to kill the job after it
>> gets the results, and this destroy function deletes the Job1 subdir
>> while SGE is still running leading to this result. If I don't call
>> destroy(), MATLAB leaves a bunch of files and dirs in my working
>> directory related to Job1, Job2, etc.
>>
>> Has anyone else seen this with MATLAB+SGE? I suspect I might need to
>> modify the destroyJobFcn.m file to not kill the jobs while they're still
>> running.
>>
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users at gridengine.org
https://gridengine.org/mailman/listinfo/users
More information about the users
mailing list