19 December 2014

Learning goals

This section should teach participants:

  1. What Hadoop is
  2. Current R/Hadoop integrations
  3. When to use R with Hadoop (guidelines)
  4. How to use R with Hadoop (lab)

Brief introduction to Hadoop

Hadoop

Key ideas: enables distributed computing; open source; widely used

  • older, more mature "cloud computing" technology


The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage…

Source: Apache Software Foundation - What is Apache Hadoop?


Key features

Two of Hadoop's main features are particularly relevant to this talk:

Feature Solves problem
Distributed storage How do we easily store and access large datasets?
Distributed/batch computing How do we quickly run analyses on large datasets?

Distributed vs. parallel computing

Distributed computing is analogous to parallel computing

Parallel - multiple processors run code

Distributed - multiple computers run code

Overview

1.  Hadoop links together servers to form a (storage + computing) cluster.

Overview

1.  Hadoop links together servers to form a (storage + computing) cluster.

Overview

2.  Creates a distributed file system called hdfs which splits large data files into smaller pieces that servers across the cluster store.

Overview

2.  Creates a distributed file system called hdfs which splits large data files into smaller pieces that servers across the cluster store.

Overview

3.  Uses the MapReduce programming model to implement
     distributed (i.e., parallel) computing.

Important interlude!

Size matters

  • Parallelization occurs when data is stored on multiple computers
  • Files are typically split into 64MB or 128MB chunks
  • Small files won't parallelize


MapReduce programs have 3 main stages

  1. Map: Apply a function to extract and group data
  2. Shuffle/sort: Sort the function's outputs
  3. Reduce: Compute summaries of the grouped outputs

Overview (MapReduce)

3.a.  Users upload Map and Reduce analysis code to Hadoop.

Overview (MapReduce)

3.b.  Hadoop distributes the Map code to the servers with the data. These servers runs local analyses that extract and group data.

Overview (MapReduce)

3.c.  Hadoop merges extracted data on one or more separate servers. These servers run the Reduce code that computes grouped data summaries.

Overview (MapReduce)

3.d.  Hadoop stores analytic results in its distributed file system, hdfs, on the server(s) that ran the Reduce code.

Overview (MapReduce)

3.e.  Analysts can retrieve these results for review or follow-on analysis.

Hadoop - Natural strengths

  • Extract data
  • Group data
  • Compute group summaries

Hadoop - Natural "weaknesses"

"Everything else"

  • Iterative algorithms
  • Multi-step workflows