[gridengine users] Hadoop Integration HOWTO (was: Hadoop Integration - how's it going)

Ron Chen ron_chen_123 at yahoo.com
Tue Jun 5 06:10:01 UTC 2012


Ralph, when you your C HDFS API ready, can you please let us know? And do you know when Apache Hadoop is going to release version 2.0?


 -Ron



----- Original Message -----
From: Rayson Ho <rayson at scalablelogic.com>
To: Prakashan Korambath <ppk at ats.ucla.edu>
Cc: Ralph Castain <rhc at open-mpi.org>; Ron Chen <ron_chen_123 at yahoo.com>; "users at gridengine.org" <users at gridengine.org>
Sent: Monday, June 4, 2012 1:53 PM
Subject: Re: [gridengine users] Hadoop Integration HOWTO (was: Hadoop Integration - how's it going)

Prakashan,

Ralph mentioned to me before that the C API bindings will be available
in Apache 2.0, which adds Google protocol buffers as one of the new
features and thus supports non-Java HDFS bindings.

AFAIK, EMC MapR replaces HDFS with something that has more HA features
& performance. I don't know all the specific details but I do believe
that most of the API interfaces are going to be the same as or very
similar to the existing HDFS APIs.

Rayson



On Mon, Jun 4, 2012 at 1:24 PM, Prakashan Korambath <ppk at ats.ucla.edu> wrote:
> Hi Rayson,
>
> Let me know why you have C API bindings from Ralph ready.  I can help you
> guys with testing it out.
>
> Prakashan
>
>
> On 06/04/2012 10:17 AM, Rayson Ho wrote:
>>
>> Hi Prakashan&  Ron,
>>
>> I thought about this issue while I was writing&  testing the HOWTO...
>>
>> but I didn't spend much more time on it as I needed to work on
>> something else, and it requires an upcoming C API binding for HDFS
>> from Ralph. Plus... I didn't want to pre-announce too many upcoming
>> new features. :-)
>>
>> With the architecture of Prakashan's On-demand Hadoop Cluster, we can
>> take advantage of Ralph's C HDFS API, and we can then easily write a
>> scheduler plugin that queries HDFS block information. This scheduler
>> plugin then affects scheduling decision such that Open Grid
>> Scheduler/Grid Engine can send jobs to the data, which IMO is the core
>> idea behind Hadoop - scheduling jobs&  tasks to the data.
>>
>>
>> Note that we will also need to productionize the "Parallel Environment
>> Queue Sort (PQS) Scheduler API", which was under technology preview in
>> GE 2011.11:
>>
>> http://gridscheduler.sourceforge.net/Releases/ReleaseNotesGE2011.11.pdf
>>
>> Rayson
>>
>>
>>
>> On Mon, Jun 4, 2012 at 12:55 PM, Prakashan Korambath<ppk at ats.ucla.edu>
>>  wrote:
>>>
>>> Hi Ron,
>>>
>>> I don't have anything planned beyond what I released right now.  Idea is
>>> to
>>> let what Hadoop does best to Hadoop and what SGE or any scheduler does
>>> best
>>> to the scheduler.  I believe somebody from SDSC also released similar
>>> strategy for PBS/Torque.  I worked only on the SGE because I mostly use
>>> SGE.
>>>
>>> Prakashan
>>>
>>>
>>>
>>> On 06/04/2012 09:45 AM, Ron Chen wrote:
>>>>
>>>>
>>>> Hi Prakashan,
>>>>
>>>>
>>>> I am trying to understand your integration, and it looks like Ravi
>>>> Chandra
>>>> Nallan's Hadoop Integration.
>>>>
>>>> One of the improvements in Daniel Templeton's Hadoop Integration is he
>>>> models HDFS data as resources, and thus can schedule jobs to data. Is
>>>> scheduling jobs to data a planned feature of your "On-Demand Hadoop
>>>> Cluster"
>>>> integration?
>>>>
>>>> For those who didn't know Ravi Chandra Nallan, he was with Sun Micro
>>>> when
>>>> he developed the integration. Last I checked, he was with Oracle.
>>>>
>>>>  -Ron
>>>>
>>>>
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: Rayson Ho<rayson at scalablelogic.com>
>>>> To: Prakashan Korambath<ppk at ats.ucla.edu>
>>>> Cc: "users at gridengine.org"<users at gridengine.org>
>>>> Sent: Friday, June 1, 2012 3:04 PM
>>>> Subject: Re: [gridengine users] Hadoop Integration HOWTO (was: Hadoop
>>>> Integration - how's it going)
>>>>
>>>> Thanks again Prakashan for the contribution!
>>>>
>>>> Rayson
>>>>
>>>>
>>>>
>>>> On Fri, Jun 1, 2012 at 1:25 PM, Prakashan Korambath<ppk at ats.ucla.edu>
>>>>  wrote:
>>>>>
>>>>>
>>>>> Thank you Rayson!  Appreciate you taking time and upload the tar files
>>>>> and
>>>>> writing the howto.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Prakashan
>>>>>
>>>>>
>>>>>
>>>>> On 06/01/2012 10:19 AM, Rayson Ho wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> I've reviewed the integration, and wrote a short Grid Engine Hadoop
>>>>>> HOWTO:
>>>>>>
>>>>>> http://gridscheduler.sourceforge.net/howto/GridEngineHadoop.html
>>>>>>
>>>>>> The difference between the 2 methods (original SGE 6.2u5 vs
>>>>>> Prakashan's) is that with Prakashan's approach, Grid Engine is used
>>>>>> for resource allocation, and the Hadoop job scheduler/Job Tracker is
>>>>>> used to handle all the MapReduce operations. A Hadoop cluster is
>>>>>> created on demand with Prakashan's approach, but in the original SGE
>>>>>> 6.2u5 method Grid Engine replaces the Hadoop job scheduler.
>>>>>>
>>>>>> As standard Grid Engine PEs are used in this new approach, one can
>>>>>> call "qrsh -inherit" and use Grid Engine's method to start Hadoop
>>>>>> services on remote nodes, and thus get full job control, job
>>>>>> accounting, and cleanup at terminate benefits like any other tight PE
>>>>>> jobs!
>>>>>>
>>>>>> Rayson
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, May 29, 2012 at 10:36 AM, Prakashan
>>>>>> Korambath<ppk at ats.ucla.edu>
>>>>>>  wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I put my scripts in a tar file and send it to Rayson yesterday so
>>>>>>> that
>>>>>>> he
>>>>>>> can put it in a common place to download.
>>>>>>>
>>>>>>> Prakashan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 05/29/2012 07:18 AM, Jesse Becker wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, May 28, 2012 at 12:00:24PM -0400, Prakashan
>>>>>>>> Korambath wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is how we run hadoop using Grid Engine (for that matter
>>>>>>>>> any scheduler with appropriate alteration)
>>>>>>>>>
>>>>>>>>> http://www.ats.ucla.edu/clusters/hoffman2/hadoop/default.htm
>>>>>>>>>
>>>>>>>>> Basically, run either a prolog or call a script inside the
>>>>>>>>> submission command file itself to parse the output of
>>>>>>>>> PE_HOSTFILE to create hadoop *.site.xml, masters and slaves
>>>>>>>>> files at run time. This methodology is suitable for any
>>>>>>>>> scheduler as it is not dependent on them. If there is
>>>>>>>>> interest I can post the prologue script. Thanks.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Please do.
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users at gridengine.org
>>>> https://gridengine.org/mailman/listinfo/users
>>>>
>>>
>




More information about the users mailing list