The Duo Scraper builds a JSON file with the political leaders of each country found at this API. The Scraper performs a double scraping task, hence the name "duo":
-
data colection from APIs endpoints:
-- the Scraper first queries a sequence of API endpoints to obtain a list of countries & basic info about their past political leaders.
-
data collection from HMTL endpoints:
-- the Scraper then uses the wikipedia urls retrieved from the API to extract & sanitize the leaders' short bios from Wikipedia html pages
The combined information is written in an output JSON file.
- create a new virtual environment by executing this command in your terminal:
python3 -m venv wikipedia_scraper_env
- activate the environment by executing this command in your terminal:
source wikipedia_scraper_env/bin/activate
- install the required dependencies by executing this command in your terminal:
pip install -r requirements.txt
To run the program, clone this repo on your local machine, navigate to its directory in your terminal, make sure you have first executed your requirements.txt, then execute:
python3 main.py
This was my second solo project in the AI Bootcamp in Ghent, Belgium, 2024.
Its main goals were to practice:
- using virtual environments
- extracting data from APIs and from HTML
- using exception handling
- getting comfortable with JSON
- using OOP
- using regex to clean text data
This project was completed over the course of 3 days in February 2024.
My main challenges and opportunities to learn while doing the project were:
- handling cookies and sessions when performing GET requests
- handling various tags when parsing html to get the required content
I also created a separate branch for the project called feature/o11y
where I experiment with concurrency and observability (o11y) via Honeycomb.
Shoutout to 11011 for his advice and help with these experiments.
All my code is currently heavily:
- docstringed
- commented
.. and sometimes typed.
This is to help me learn and to make my sessions with our training coach more efficient.
Thanks for visiting my project page!
Connect with me on LinkedIn 🤍