A naive python implementation (no distributed computing) to mimic and understand the MapReduce paradigm.
MapReduce is a programming model for processing and generating big data sets with a parallel, distributed algorithm on a cluster. The "MapReduce System" is usually composed of three functions (or steps):
- Map: The map function, also referred to as the map task, processes a single key/value input pair and produces a set of intermediate key/value pairs.
- Shuffle: The shuffle function transfer data from Mapper to Reducer. It is a mandatory operation for reducers to proceed their jobs further as the shuffling process serves as input for the reduce tasks.
- Reduce: The reduce function, also referred to as the reduce task, consists of taking all key/value pairs produced in the map phase that share the same intermediate key and producing zero, one, or more data items.
Use the Pipfile to install packages in the virtualenv:
pipenv install
pipenv install --dev
Run the MapReduce example:
pipenv run wordcount
Run Unit and Integration tests
pipenv run test
- Python | Programming language
- Pipenv | Dependency management
- Pytest | Testing
- Pre-Commit | Managing and maintaining hooks
- Github Actions | CI/CD
- clean-text | Data cleaning
- Made with ❤️ by @vittoriopolverino ️