Flow of execution of the system

Jump to bottom

Kalpan Jasani edited this page May 31, 2019 · 1 revision

This document is to describe the current high level design to explain the architecture and the sequence of steps that occur to run the project.

The parts covered are:

Django website: The place for the public to see results and some other responsible
Scheduling: To initiate the scrapers
MuckRock.com: The source of truth
Scrapers: The objects that process the websites
Django website: The Django website server sits in a dedicated server, probably in Heorku

The responsibilities of the website are:

Display results to the public
Provide an open source read only API
provide a way to POST to the website's API for new scrape data, from the scrapers. The scrapers have an API token to allow them to post to the website.
Initiate the scheduler explained below

Scheduling: This is responsible for:

fetch the data from MuckRock.com
initiate the Scrapers at regular intervals

MuckRock.com: The source of truth containing many agencies, and their related information. There is an API to obtain data and that is periodically queried by the scheduling system.
Scrapers

Scrapers are responsible for:

utilizing a lot of different software such as Beautiful Soup, Google Lighthouse API, etc. to learn about a website, and repeatedly obtain reports on a website.

The scrapers are initiated by a Scheduling system, and the results are posted to the API offered by the website.