[gridengine users] Hadoop Integration HOWTO (was: Hadoop Integration - how's it going)

Prakashan Korambath ppk at ats.ucla.edu
Mon Jun 4 17:24:34 UTC 2012


Hi Rayson,

Let me know why you have C API bindings from Ralph ready.  I 
can help you guys with testing it out.

Prakashan


On 06/04/2012 10:17 AM, Rayson Ho wrote:
> Hi Prakashan&  Ron,
>
> I thought about this issue while I was writing&  testing the HOWTO...
> but I didn't spend much more time on it as I needed to work on
> something else, and it requires an upcoming C API binding for HDFS
> from Ralph. Plus... I didn't want to pre-announce too many upcoming
> new features. :-)
>
> With the architecture of Prakashan's On-demand Hadoop Cluster, we can
> take advantage of Ralph's C HDFS API, and we can then easily write a
> scheduler plugin that queries HDFS block information. This scheduler
> plugin then affects scheduling decision such that Open Grid
> Scheduler/Grid Engine can send jobs to the data, which IMO is the core
> idea behind Hadoop - scheduling jobs&  tasks to the data.
>
> Note that we will also need to productionize the "Parallel Environment
> Queue Sort (PQS) Scheduler API", which was under technology preview in
> GE 2011.11:
>
> http://gridscheduler.sourceforge.net/Releases/ReleaseNotesGE2011.11.pdf
>
> Rayson
>
>
>
> On Mon, Jun 4, 2012 at 12:55 PM, Prakashan Korambath<ppk at ats.ucla.edu>  wrote:
>> Hi Ron,
>>
>> I don't have anything planned beyond what I released right now.  Idea is to
>> let what Hadoop does best to Hadoop and what SGE or any scheduler does best
>> to the scheduler.  I believe somebody from SDSC also released similar
>> strategy for PBS/Torque.  I worked only on the SGE because I mostly use SGE.
>>
>> Prakashan
>>
>>
>>
>> On 06/04/2012 09:45 AM, Ron Chen wrote:
>>>
>>> Hi Prakashan,
>>>
>>>
>>> I am trying to understand your integration, and it looks like Ravi Chandra
>>> Nallan's Hadoop Integration.
>>>
>>> One of the improvements in Daniel Templeton's Hadoop Integration is he
>>> models HDFS data as resources, and thus can schedule jobs to data. Is
>>> scheduling jobs to data a planned feature of your "On-Demand Hadoop Cluster"
>>> integration?
>>>
>>> For those who didn't know Ravi Chandra Nallan, he was with Sun Micro when
>>> he developed the integration. Last I checked, he was with Oracle.
>>>
>>>   -Ron
>>>
>>>
>>>
>>>
>>> ----- Original Message -----
>>> From: Rayson Ho<rayson at scalablelogic.com>
>>> To: Prakashan Korambath<ppk at ats.ucla.edu>
>>> Cc: "users at gridengine.org"<users at gridengine.org>
>>> Sent: Friday, June 1, 2012 3:04 PM
>>> Subject: Re: [gridengine users] Hadoop Integration HOWTO (was: Hadoop
>>> Integration - how's it going)
>>>
>>> Thanks again Prakashan for the contribution!
>>>
>>> Rayson
>>>
>>>
>>>
>>> On Fri, Jun 1, 2012 at 1:25 PM, Prakashan Korambath<ppk at ats.ucla.edu>
>>>   wrote:
>>>>
>>>> Thank you Rayson!  Appreciate you taking time and upload the tar files
>>>> and
>>>> writing the howto.
>>>>
>>>> Regards,
>>>>
>>>> Prakashan
>>>>
>>>>
>>>>
>>>> On 06/01/2012 10:19 AM, Rayson Ho wrote:
>>>>>
>>>>>
>>>>> I've reviewed the integration, and wrote a short Grid Engine Hadoop
>>>>> HOWTO:
>>>>>
>>>>> http://gridscheduler.sourceforge.net/howto/GridEngineHadoop.html
>>>>>
>>>>> The difference between the 2 methods (original SGE 6.2u5 vs
>>>>> Prakashan's) is that with Prakashan's approach, Grid Engine is used
>>>>> for resource allocation, and the Hadoop job scheduler/Job Tracker is
>>>>> used to handle all the MapReduce operations. A Hadoop cluster is
>>>>> created on demand with Prakashan's approach, but in the original SGE
>>>>> 6.2u5 method Grid Engine replaces the Hadoop job scheduler.
>>>>>
>>>>> As standard Grid Engine PEs are used in this new approach, one can
>>>>> call "qrsh -inherit" and use Grid Engine's method to start Hadoop
>>>>> services on remote nodes, and thus get full job control, job
>>>>> accounting, and cleanup at terminate benefits like any other tight PE
>>>>> jobs!
>>>>>
>>>>> Rayson
>>>>>
>>>>>
>>>>>
>>>>> On Tue, May 29, 2012 at 10:36 AM, Prakashan Korambath<ppk at ats.ucla.edu>
>>>>>   wrote:
>>>>>>
>>>>>>
>>>>>> I put my scripts in a tar file and send it to Rayson yesterday so that
>>>>>> he
>>>>>> can put it in a common place to download.
>>>>>>
>>>>>> Prakashan
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 05/29/2012 07:18 AM, Jesse Becker wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 28, 2012 at 12:00:24PM -0400, Prakashan
>>>>>>> Korambath wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> This is how we run hadoop using Grid Engine (for that matter
>>>>>>>> any scheduler with appropriate alteration)
>>>>>>>>
>>>>>>>> http://www.ats.ucla.edu/clusters/hoffman2/hadoop/default.htm
>>>>>>>>
>>>>>>>> Basically, run either a prolog or call a script inside the
>>>>>>>> submission command file itself to parse the output of
>>>>>>>> PE_HOSTFILE to create hadoop *.site.xml, masters and slaves
>>>>>>>> files at run time. This methodology is suitable for any
>>>>>>>> scheduler as it is not dependent on them. If there is
>>>>>>>> interest I can post the prologue script. Thanks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Please do.
>>>>>>>
>>>>>>
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users at gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>>>
>>


More information about the users mailing list