Skip to content

Iterative Reduce Programming Guide

jpatanooga edited this page Nov 1, 2012 · 10 revisions

Overview

  • Designed specifically for parallel iterative algorithms on Hadoop
  • Implemented directly on top of YARN
  • Intrinsic Parallelism
  • Easier to focus on problem and less on distributed programming

Lifecycle of an IterativeReduce Application

Client
    Launches the YARN ApplicationMaster
Master
    Computes required resources
    Obtains resources from YARN
    Launches Workers
    Base class: ComputableMaster
Workers
    Computation on partial data (input split)
    Synchronizes with Master
    Base class: ComputableWorker

ComputableMaster Component

Overview
    We inherit from this interface (described in Appendix A) for our custom master process
    This is used conceptually similar to the AllReduce primitive
    It is always running on the YARN application master
    At each superstep the master will receive a list of messages from the workers based on the work they’ve done in the last iteration
Methods
    setup( Configuration c )
        Similar to “setup” in MapReduce, runs before any execution.
        used to get command line configuration keys into the local process
        allows the programmer the chance to setup any long lived local data structures that need to be initialized after the config parameters are loaded.
    complete( DataOutputStream out )
        the method that is called after all workers are complete
        the system passes in an output stream to write any work/models/etc to HDFS
        also allows the programmer a chance to clean up any held resources
    T compute( Collection<T> workerUpdates, Collection<T> masterUpdates )
        this is the method where the superstep action occurs
        we can see both all of the current superstep’s worker updates and the master updates all previous supersteps generated

ComputableWorker Component

Overview
    We inherit from this interface (described in Appendix B) for our custom worker process
    It is always running as a YARN container until the end of execution
    recieves a record reader reference to process records from
    Every so often it stops computing so the master node can combine results from all workers
Methods
    setup( Configuration c )
        Similar to “setup” in MapReduce, runs before any execution.
        used to get command line configuration keys into the local process
        allows the programmer the chance to setup any long lived local data structures that need to be initialized after the config parameters are loaded.
    T compute( )
        this is the method where the superstep action occurs
        we can see both all of the current superstep’s worker updates and the master updates all previous supersteps generated

Framework Demo: Parallel Summation

We’ll walk through a simple application which we used as a Computable unit test during the development of IterativeReduce
    It is meant to run some data through IterativeReduce for testing purposes
This application sums a large set of numbers in a distributed fashion
    Each worker has a split of the total dataset
    After each batch within the split, the batch partial sum is sent to the master
    The master adds up all previous global sums, and then all the worker sums to get the global current sum
CompoundAdditionMaster
    collects partial sums from workers, computes new global sum
CompoundAdditionWorker
    reads data from HDFS
    computes partial sum for batch
UpdateableInt
    inherits from Updateable, is the class that we use to send data back and forth between worker and master

Appendix A: Computable Master Interface

public interface ComputableMaster<T extends Updateable> {
 void setup(Configuration c);
 void complete(DataOutputStream out) throws IOException;
 T compute(Collection<T> workerUpdates, Collection<T> masterUpdates);
 T getResults();
}

Appendix B: Computable Worker Interface

public interface ComputableWorker { void setup(Configuration c); T compute(List records); T compute(); void setRecordParser(RecordParser r); T getResults(); void update(T t); }

References and Links

Original Hadoop World 2012 Presentation
    http://www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/strata-hadoop-world-2012-knitting-boar_slide_deck.html
Knitting Boar on Github
    https://github.com/jpatanooga/KnittingBoar
Clone this wiki locally