[gridengine users] Hadoop Integration HOWTO (was: Hadoop Integration - how's it going)

Rayson Ho rayson at scalablelogic.com
Mon Jun 4 17:17:53 UTC 2012


Hi Prakashan & Ron,

I thought about this issue while I was writing & testing the HOWTO...
but I didn't spend much more time on it as I needed to work on
something else, and it requires an upcoming C API binding for HDFS
from Ralph. Plus... I didn't want to pre-announce too many upcoming
new features. :-)

With the architecture of Prakashan's On-demand Hadoop Cluster, we can
take advantage of Ralph's C HDFS API, and we can then easily write a
scheduler plugin that queries HDFS block information. This scheduler
plugin then affects scheduling decision such that Open Grid
Scheduler/Grid Engine can send jobs to the data, which IMO is the core
idea behind Hadoop - scheduling jobs & tasks to the data.

Note that we will also need to productionize the "Parallel Environment
Queue Sort (PQS) Scheduler API", which was under technology preview in
GE 2011.11:

http://gridscheduler.sourceforge.net/Releases/ReleaseNotesGE2011.11.pdf

Rayson



On Mon, Jun 4, 2012 at 12:55 PM, Prakashan Korambath <ppk at ats.ucla.edu> wrote:
> Hi Ron,
>
> I don't have anything planned beyond what I released right now.  Idea is to
> let what Hadoop does best to Hadoop and what SGE or any scheduler does best
> to the scheduler.  I believe somebody from SDSC also released similar
> strategy for PBS/Torque.  I worked only on the SGE because I mostly use SGE.
>
> Prakashan
>
>
>
> On 06/04/2012 09:45 AM, Ron Chen wrote:
>>
>> Hi Prakashan,
>>
>>
>> I am trying to understand your integration, and it looks like Ravi Chandra
>> Nallan's Hadoop Integration.
>>
>> One of the improvements in Daniel Templeton's Hadoop Integration is he
>> models HDFS data as resources, and thus can schedule jobs to data. Is
>> scheduling jobs to data a planned feature of your "On-Demand Hadoop Cluster"
>> integration?
>>
>> For those who didn't know Ravi Chandra Nallan, he was with Sun Micro when
>> he developed the integration. Last I checked, he was with Oracle.
>>
>>  -Ron
>>
>>
>>
>>
>> ----- Original Message -----
>> From: Rayson Ho<rayson at scalablelogic.com>
>> To: Prakashan Korambath<ppk at ats.ucla.edu>
>> Cc: "users at gridengine.org"<users at gridengine.org>
>> Sent: Friday, June 1, 2012 3:04 PM
>> Subject: Re: [gridengine users] Hadoop Integration HOWTO (was: Hadoop
>> Integration - how's it going)
>>
>> Thanks again Prakashan for the contribution!
>>
>> Rayson
>>
>>
>>
>> On Fri, Jun 1, 2012 at 1:25 PM, Prakashan Korambath<ppk at ats.ucla.edu>
>>  wrote:
>>>
>>> Thank you Rayson!  Appreciate you taking time and upload the tar files
>>> and
>>> writing the howto.
>>>
>>> Regards,
>>>
>>> Prakashan
>>>
>>>
>>>
>>> On 06/01/2012 10:19 AM, Rayson Ho wrote:
>>>>
>>>>
>>>> I've reviewed the integration, and wrote a short Grid Engine Hadoop
>>>> HOWTO:
>>>>
>>>> http://gridscheduler.sourceforge.net/howto/GridEngineHadoop.html
>>>>
>>>> The difference between the 2 methods (original SGE 6.2u5 vs
>>>> Prakashan's) is that with Prakashan's approach, Grid Engine is used
>>>> for resource allocation, and the Hadoop job scheduler/Job Tracker is
>>>> used to handle all the MapReduce operations. A Hadoop cluster is
>>>> created on demand with Prakashan's approach, but in the original SGE
>>>> 6.2u5 method Grid Engine replaces the Hadoop job scheduler.
>>>>
>>>> As standard Grid Engine PEs are used in this new approach, one can
>>>> call "qrsh -inherit" and use Grid Engine's method to start Hadoop
>>>> services on remote nodes, and thus get full job control, job
>>>> accounting, and cleanup at terminate benefits like any other tight PE
>>>> jobs!
>>>>
>>>> Rayson
>>>>
>>>>
>>>>
>>>> On Tue, May 29, 2012 at 10:36 AM, Prakashan Korambath<ppk at ats.ucla.edu>
>>>>  wrote:
>>>>>
>>>>>
>>>>> I put my scripts in a tar file and send it to Rayson yesterday so that
>>>>> he
>>>>> can put it in a common place to download.
>>>>>
>>>>> Prakashan
>>>>>
>>>>>
>>>>>
>>>>> On 05/29/2012 07:18 AM, Jesse Becker wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, May 28, 2012 at 12:00:24PM -0400, Prakashan
>>>>>> Korambath wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This is how we run hadoop using Grid Engine (for that matter
>>>>>>> any scheduler with appropriate alteration)
>>>>>>>
>>>>>>> http://www.ats.ucla.edu/clusters/hoffman2/hadoop/default.htm
>>>>>>>
>>>>>>> Basically, run either a prolog or call a script inside the
>>>>>>> submission command file itself to parse the output of
>>>>>>> PE_HOSTFILE to create hadoop *.site.xml, masters and slaves
>>>>>>> files at run time. This methodology is suitable for any
>>>>>>> scheduler as it is not dependent on them. If there is
>>>>>>> interest I can post the prologue script. Thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Please do.
>>>>>>
>>>>>
>>>
>>
>> _______________________________________________
>> users mailing list
>> users at gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>>
>




More information about the users mailing list