-
Notifications
You must be signed in to change notification settings - Fork 11
Iterative Reduce Programming Guide
jpatanooga edited this page Nov 1, 2012
·
10 revisions
- Designed specifically for parallel iterative algorithms on Hadoop
- Implemented directly on top of YARN
- Intrinsic Parallelism
- Easier to focus on problem and less on distributed programming
- Client
- Launches the YARN ApplicationMaster
- Master
- Computes required resources
- Obtains resources from YARN
- Launches Workers
- Base class: ComputableMaster
- Workers
- Computation on partial data (input split)
- Synchronizes with Master
- Base class: ComputableWorker
- Overview
- We inherit from this interface (described in Appendix A) for our custom master process
- This is used conceptually similar to the AllReduce primitive
- It is always running on the YARN application master
- At each superstep the master will receive a list of messages from the workers based on the work they’ve done in the last iteration
- Methods
-
setup( Configuration c )
- Similar to “setup” in MapReduce, runs before any execution.
- used to get command line configuration keys into the local process
- allows the programmer the chance to setup any long lived local data structures that need to be initialized after the config parameters are loaded.
-
complete( DataOutputStream out )
- the method that is called after all workers are complete
- the system passes in an output stream to write any work/models/etc to HDFS
- also allows the programmer a chance to clean up any held resources
-
T compute( Collection workerUpdates, Collection masterUpdates )
- this is the method where the superstep action occurs
- we can see both all of the current superstep’s worker updates and the master updates all previous supersteps generated
-
setup( Configuration c )
- Overview
- We inherit from this interface (described in Appendix B) for our custom worker process
- It is always running as a YARN container until the end of execution
- recieves a record reader reference to process records from
- Every so often it stops computing so the master node can combine results from all workers
- Methods
-
setup( Configuration c )
- Similar to “setup” in MapReduce, runs before any execution.
- used to get command line configuration keys into the local process
- allows the programmer the chance to setup any long lived local data structures that need to be initialized after the config parameters are loaded.
-
T compute( )
- this is the method where the superstep action occurs
- we can see both all of the current superstep’s worker updates and the master updates all previous supersteps generated
-
setup( Configuration c )
- We’ll walk through a simple application which we used as a Computable unit test during the development of IterativeReduce
- It is meant to run some data through IterativeReduce for testing purposes
- This application sums a large set of numbers in a distributed fashion
- Each worker has a split of the total dataset
- After each batch within the split, the batch partial sum is sent to the master
- The master adds up all previous global sums, and then all the worker sums to get the global current sum
- CompoundAdditionMaster
- collects partial sums from workers, computes new global sum
- CompoundAdditionWorker
- reads data from HDFS
- computes partial sum for batch
- UpdateableInt
- inherits from Updateable, is the class that we use to send data back and forth between worker and master
public interface ComputableMaster<T extends Updateable> {
void setup(Configuration c);
void complete(DataOutputStream out) throws IOException;
T compute(Collection<T> workerUpdates, Collection<T> masterUpdates);
T getResults();
}
public interface ComputableWorker<T extends Updateable> {
void setup(Configuration c);
T compute(List<T> records);
T compute();
void setRecordParser(RecordParser r);
T getResults();
void update(T t);
}
- Original Hadoop World 2012 Presentation
- Knitting Boar on Github