The project consists of a data pipeline to monitor, store, and view the flow of recent events change from Wikipedia. The approach chosen was a Python script, along with the MongoDB database and an HTML page.
Python programming language is present and well known in the market with a simple syntax and easy to understand. Python is also widely used for Big Data. In addition to dealing with many data difficulties and supporting parallel computing.
MongoDB has an easy and very efficient query syntax, without the need to take many turns to handle documents in general. MongoDB is also scalable. Thanks to sharding that can distribute data on multiple machines.
The visualization was developed in HTML with an image of a graphic because it was something very small and fast to be done.
It is also worth saying that this architectural decision had a strong appeal for the easy installation of the components and the reuse of the tools already installed.
Here is the API I've be using:
API | Description |
---|---|
EventStreams | Wikipedia Recent Events |
- Install all components of the file
requirements.txt
- Run the script
get-store-show-recent-change.py
- Open the HTML page
Recent Change Stream.html
- Task 1 (documentation)
- Task/Script 2 (extract)
- Task/Script 3 (storage)
- Task 4 (visualization)
- Task 5 (query)
- Task 6 (devops)
- Python 3.7.3
- pip 19.1.1
- sseclient 0.0.24
- MongoDB Compass Community 4.0.10
- pymongo 3.8.0
- matplotlib 3.1.0
Thank you!