Usage

#Python Polymorphic MapReduce Polymr is under BSD License

##What the fuck? In the field of Big data analytics and data science, some dataset are huge, some are not so big (many Mb or Gb). From raw data preparation to predictive modelling the dataset size and computing requirements are volatile. Hadoop is an incredible platform to compute large dataset but overkill for Mb-Gb datasets

##What is Polymr? Polymr provides a MapReduce abstraction over 3 MapReduce engines :

In-memory Single process MapReduce engine for real debugging (breakpoint, step by step, ...) and dummy maps (ex counting, mean, median, frequency distribution, etc...).
In-memory Multi processing MapReduce engine (out of core model) for mid sized (many Gb) datasets with computational tasks, ex RandomForest, NLP, etc
Hadoop streaming interface for large data set

Use the same code, run on different MapReduce engines to reach the best performance and cost.

Polymr has 4 types input/output providers :

FileInputReader , FileOutputReader : Text file reader and writer - the job reads or writes data from or to a file
MemInputReader, MemOutputWriter : Array reader and writer - the job reads or writes data from or to memory for aggregate or small data pipeline
HdfsInputReader, HdfsOutputWriter : Hdfs file reader and writer - the job (Hadoop only) reads and writes from or to hdfs You may mix the providers into a MR job

Polymr provides a profiler to identify the most performant (time and cost) MapReduce engine, based on the local DRAM limitation and the map function mean duration while analyzing a sample of the dataset (1000 lines)

##How Polymr is helpful in data science? Polymr provides MapReduce classes for :

Data profiling : similar to R summary(), frequency analysis, automatic data typing, ...
Feature extraction : typing, factor, datetime, text, image, ...
Feature selection : Random tree feature selection, CHAID, ...
Model learning : Random forest, SVM, ...

Polymr uses scikit-learn with a Divide & Recombine pattern (aka weak model) to perform machine learning with a high cross validation level on large dataset

#Installation

Download from github
Install python 2.6 or above
Install scikit-learn
For Hadoop: install Hadoop 1.2 as Edge node, set HADOOP_HOME and POLYMR_HOME env variables and install scikit-learn on each task tracker node

Run on linux only (env var), python 2.6 or above

Dependency (not mandatory, if not exist use json) : cjson - pip install cjson

Usage

from polymr.mapreduce import MapReduce, FileInputReader, MemOutputWriter, HdfsInputReader,\
MULTI_CORE, SPARK, HADOOP
from uuid import uuid1
    
class MyMapRed(MapReduce):
    def map(self, text):
        words = text.split(',')
        return [(str(word),1) for word in words]
            
    def combine(self, key, values):
        return (key, sum(values))
    
    def reduce(self, key, values):
        return (key, sum(values))

def gen_input_file_sample(path,nloc=1000000):
   print "Generate a sample file of %d lines for %d Mbytes" % (nloc,nloc*19/100000)
   f = open(path,mode='w')
   map(lambda n: f.write(','.join(map(lambda i: str(i),range(100)))),range(nloc))
   f.close()

if __name__ == '__main__':

   sample_file_path = "/var/tmp/%s" % str(uuid1())
   nloc=1000000
   gen_input_file_sample(sample_file_path, nloc)
   sample_input = FileInputReader(sample_file_path)

   print sample_input.count()
   print sample_input.count(engine=MULTI_CORE)
   print sample_input.count(engine=HADOOP)

   mapred = MyMapRed()
   print mapred.profile(sample_input)
   print sample_input.compute(mapred)
   print sample_input.compute(mapred,engine=MULTI_CORE)
   print sample_input.compute(mapred,engine=HADOOP)

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
polymr		polymr
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Usage

About

Releases

Packages

Languages

jucabot/polymr

Folders and files

Latest commit

History

Repository files navigation

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages