Madoop: Michigan Hadoop

Michigan Hadoop (madoop) is a light weight MapReduce framework for education. Madoop implements the Hadoop Streaming interface. Madoop is implemented in Python and runs on a single machine.

For an in-depth explanation of how to write MapReduce programs in Python for Hadoop Streaming, see our Hadoop Streaming tutorial.

Quick start

Install Madoop.

$ pip install madoop

Create example MapReduce program with input files.

$ madoop --example
$ tree example
example
├── input
│   ├── input01.txt
│   └── input02.txt
├── map.py
└── reduce.py

Run example word count MapReduce program.

$ madoop \
  -input example/input \
  -output example/output \
  -mapper example/map.py \
  -reducer example/reduce.py

Concatenate and print the output.

$ cat example/output/part-*
Goodbye 1
Bye 1
Hadoop 2
World 2
Hello 2

Comparison with Apache Hadoop and CLI

Madoop implements a subset of the Hadoop Streaming interface. You can simulate the Hadoop Streaming interface at the command line with cat and sort.

Here's how to run our example MapReduce program on Apache Hadoop.

$ hadoop \
    jar path/to/hadoop-streaming-X.Y.Z.jar
    -input example/input \
    -output output \
    -mapper example/map.py \
    -reducer example/reduce.py
$ cat output/part-*

Here's how to run our example MapReduce program at the command line using cat and sort.

$ cat input/* | ./map.py | sort | ./reduce.py

Madoop	Hadoop	`cat`/`sort`
Implement some Hadoop options	All Hadoop options	No Hadoop options
Multiple mappers and reducers	Multiple mappers and reducers	One mapper, one reducer
Single machine	Many machines	Single Machine
`jar hadoop-streaming-X.Y.Z.jar` argument ignored	`jar hadoop-streaming-X.Y.Z.jar` argument required	No arguments
Lines within a group are sorted	Lines within a group are sorted	Lines within a group are sorted

Contributing

Contributions from the community are welcome! Check out the guide for contributing.

Acknowledgments

Michigan Hadoop is written by Andrew DeOrio [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 381 Commits
.github/workflows		.github/workflows
madoop		madoop
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.pylintrc		.pylintrc
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README_Hadoop_Streaming.md		README_Hadoop_Streaming.md
pyproject.toml		pyproject.toml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Madoop: Michigan Hadoop

Quick start

Comparison with Apache Hadoop and CLI

Contributing

Acknowledgments

About

Releases 12

Packages

Contributors 6

Languages

License

eecs485staff/madoop

Folders and files

Latest commit

History

Repository files navigation

Madoop: Michigan Hadoop

Quick start

Comparison with Apache Hadoop and CLI

Contributing

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 12

Packages 0

Contributors 6

Languages

Packages