-
Notifications
You must be signed in to change notification settings - Fork 11
Iterative Reduce Programming Guide
jpatanooga edited this page Nov 1, 2012
·
10 revisions
- Designed specifically for parallel iterative algorithms on Hadoop
- Implemented directly on top of YARN
- Intrinsic Parallelism
- Easier to focus on problem and less on distributed programming
- Client
- Launches the YARN ApplicationMaster
- Master
- Computes required resources
- Obtains resources from YARN
- Launches Workers
- Base class: ComputableMaster
- Workers
- Computation on partial data (input split)
- Synchronizes with Master
- Base class: ComputableWorker
Overview
We inherit from this interface (described in Appendix A) for our custom master process
This is used conceptually similar to the AllReduce primitive
It is always running on the YARN application master
At each superstep the master will receive a list of messages from the workers based on the work they’ve done in the last iteration
Methods
setup( Configuration c )
Similar to “setup” in MapReduce, runs before any execution.
used to get command line configuration keys into the local process
allows the programmer the chance to setup any long lived local data structures that need to be initialized after the config parameters are loaded.
complete( DataOutputStream out )
the method that is called after all workers are complete
the system passes in an output stream to write any work/models/etc to HDFS
also allows the programmer a chance to clean up any held resources
T compute( Collection<T> workerUpdates, Collection<T> masterUpdates )
this is the method where the superstep action occurs
we can see both all of the current superstep’s worker updates and the master updates all previous supersteps generated
Overview
We inherit from this interface (described in Appendix B) for our custom worker process
It is always running as a YARN container until the end of execution
recieves a record reader reference to process records from
Every so often it stops computing so the master node can combine results from all workers
Methods
setup( Configuration c )
Similar to “setup” in MapReduce, runs before any execution.
used to get command line configuration keys into the local process
allows the programmer the chance to setup any long lived local data structures that need to be initialized after the config parameters are loaded.
T compute( )
this is the method where the superstep action occurs
we can see both all of the current superstep’s worker updates and the master updates all previous supersteps generated
We’ll walk through a simple application which we used as a Computable unit test during the development of IterativeReduce
It is meant to run some data through IterativeReduce for testing purposes
This application sums a large set of numbers in a distributed fashion
Each worker has a split of the total dataset
After each batch within the split, the batch partial sum is sent to the master
The master adds up all previous global sums, and then all the worker sums to get the global current sum
CompoundAdditionMaster
collects partial sums from workers, computes new global sum
CompoundAdditionWorker
reads data from HDFS
computes partial sum for batch
UpdateableInt
inherits from Updateable, is the class that we use to send data back and forth between worker and master
public interface ComputableMaster<T extends Updateable> {
void setup(Configuration c);
void complete(DataOutputStream out) throws IOException;
T compute(Collection<T> workerUpdates, Collection<T> masterUpdates);
T getResults();
}
public interface ComputableWorker<T extends Updateable> {
void setup(Configuration c);
T compute(List<T> records);
T compute();
void setRecordParser(RecordParser r);
T getResults();
void update(T t);
}
- Original Hadoop World 2012 Presentation
- Knitting Boar on Github