-
-
Notifications
You must be signed in to change notification settings - Fork 42
Flow of execution of the system
This document is to describe the current high level design to explain the architecture and the sequence of steps that occur to run the project.
The parts covered are:
-
Django website: The place for the public to see results and some other responsible
-
Scheduling: To initiate the scrapers
-
MuckRock.com: The source of truth
-
Scrapers: The objects that process the websites
-
Django website: The Django website server sits in a dedicated server, probably in Heorku
The responsibilities of the website are:
- Display results to the public
- Provide an open source read only API
- provide a way to POST to the website's API for new scrape data, from the scrapers. The scrapers have an API token to allow them to post to the website.
- Initiate the scheduler explained below
- Scheduling: This is responsible for:
- fetch the data from MuckRock.com
- initiate the Scrapers at regular intervals
-
MuckRock.com: The source of truth containing many agencies, and their related information. There is an API to obtain data and that is periodically queried by the scheduling system.
-
Scrapers
Scrapers are responsible for:
- utilizing a lot of different software such as Beautiful Soup, Google Lighthouse API, etc. to learn about a website, and repeatedly obtain reports on a website.
The scrapers are initiated by a Scheduling system, and the results are posted to the API offered by the website.