Skip to content

Flow of execution of the system

Kalpan Jasani edited this page May 31, 2019 · 1 revision

This document is to describe the current high level design to explain the architecture and the sequence of steps that occur to run the project.

The parts covered are:

  1. Django website: The place for the public to see results and some other responsible

  2. Scheduling: To initiate the scrapers

  3. MuckRock.com: The source of truth

  4. Scrapers: The objects that process the websites

  5. Django website: The Django website server sits in a dedicated server, probably in Heorku

The responsibilities of the website are:

  • Display results to the public
  • Provide an open source read only API
  • provide a way to POST to the website's API for new scrape data, from the scrapers. The scrapers have an API token to allow them to post to the website.
  • Initiate the scheduler explained below
  1. Scheduling: This is responsible for:
  • fetch the data from MuckRock.com
  • initiate the Scrapers at regular intervals
  1. MuckRock.com: The source of truth containing many agencies, and their related information. There is an API to obtain data and that is periodically queried by the scheduling system.

  2. Scrapers

Scrapers are responsible for:

  • utilizing a lot of different software such as Beautiful Soup, Google Lighthouse API, etc. to learn about a website, and repeatedly obtain reports on a website.

The scrapers are initiated by a Scheduling system, and the results are posted to the API offered by the website.

Clone this wiki locally