In Li, a "source" is a Javascript class that provides lists of URLs that should be crawled, and methods to scrape the data from those pages.
This guide provides information on the criteria we use to determine whether a source should be added to the project, and offers details on how a source can be implemented.
An anotated sample source is in docs/sample-sources/sample.js
. It will give you a rough idea of the shape of a source, and how it is used.
Live sources are in src/shared/sources/
as well.
See [./source-criteria.md](Source criteria) to determine if a source should be included in this project.
As shown in the samples, a source
has crawlers
which pull down data files (json, csv, html, pdf, tsv, raw), and scrapers
which scrape those files and return useful data. Sources can pull in data for anything -- cities, counties, states, countries, or collections thereof. See the existing scrapers for ideas on how to deal with different ways of data being presented.
Copy the template in docs/sample-sources/template.md
to a new file in the correct country, region, and region directory (e.g., src/shared/sources/us/ca/mycounty-name.js
). That file contains some fields that you should fill in or delete, depending on the details of the source. Also see the comments in docs/sample-sources/sample.js
, and below.
At the moment, we provide support for page, headless, csv, tsv, pdf, json, raw. A central controller will execute the source to crawl the provided URLs and cache the date. You just need to supply the url
, type
, and name
if there are multiple urls to crawl.a
Scrapers are functions associated with the scrape
attribute on scrapers
in the source
. You may implement one or multiple scrapers if the source changes its formatting (see What to do if a scraper breaks?).
Your scraper should return an object, an array of objects, or null
.
The object may have the following attributes:
result = {
// [ISO 3166-1 alpha-3 country code](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) [required]
country: 'iso1:xxx',
// The state, province, or region (not required if defined on scraper object)
state: 'iso2:xxx',
// The county or parish (not required if defined on scraper object)
county: 'xxx',
// The city name (not required if defined on scraper object)
city: 'xxx',
// Total number of cases
cases: 42,
// Total number of deaths
deaths: 42,
// Total number of hospitalized
hospitalized: 42,
// Total number of discharged
discharged: 42,
// Total number recovered
recovered: 42,
// Total number tested
tested: 42,
// GeoJSON feature associated with the location (See [Features and population data](#features-and-population-data))
feature: 'xxx',
// Additional identifiers to aid with feature matching (See [Features and population data](#features-and-population-data))
featureId: 'xxx',
// The estimated population of the location (See [Features and population data](#features-and-population-data))
population: 42,
// Array of coordinates as `[longitude, latitude]` (See [Features and population data](#features-and-population-data))
coordinates: 'xxx',
}
Returning an array of objects is useful for aggregate sources, sources that provide information for more than one geographical area. For example, Canada provides information for all provinces of the country. If the scraper returns an array, each object in the array will have the attributes specified in the source object appended, meaning you only need to specify the fields that change per location (county
, cases
, deaths
for example).
null
should be returned in case no data is available. This could be the case if the source has not provided an update for today, or we are fetching historical information for which we have no cached data.
It's a tough challenge to write scrapers that will work when websites are inevitably updated. Here are some tips:
-
If your source is an HTML table, validate its structure: check that table headers contain expected text, that columns exist, etc. For example, if you say
result.deaths
is the value stored in column 2, but the source has changed column 2 from "Deaths" to "Cases", your scrape will complete successfully, but the data won't be correct. -
If data for a field is not present (eg. no recovered information), do not put 0 for that field. Make sure to leave the field undefined so the scraper knows there is no information for that particular field.
-
Write your scraper so it handles aggregate data with a single scraper entry (i.e. find a table, process the table)
-
Try not to hardcode county or city names, instead let the data on the page populate that
-
Try to make your scraper less brittle by avoiding using generated class names (i.e. CSS modules)
-
When targeting elements, don't assume order will be the same (i.e. if there are multiple
.count
elements, don't assume the second one is deaths, verify it by parsing the label)
Source scrapers need to be able to operate correctly on old data, so updates to scrapers must be backwards compatible. If you know the date the site broke, you can have two implementations (or more) of a scraper in the same function, based on date. Most sources in src/shared/sources
deal with such cases.
We strive to provide a GeoJSON feature and population number for every location in our dataset. When adding a source for a country, we may already have this information and can populate it automatically. For smaller regional entities, this information may not be available and has to be added manually.
Features can be specified in three ways: through the country
, state
and county
field, by matching the longitude
and latitude
to a particular feature, through the featureId
field, or through the feature
field.
While the first two methods works most of the time, sometimes you will have to rely on featureId
to help the crawler make the correct guess. featureId
is an object that specifies one or more of the attributes below:
name
adm1_code
iso_a2
iso_3166_2
code_hasc
postal
In case we do not have any geographical information for the location you are trying to scrape, you can provide a GeoJSON feature directly in the feature
attribute you can return with the scraper.
If we have a feature for the location, we will calculate a longitude
and latitude
. You may also specify a custom longitude and latitude by specifying a value in the coordinates
attribute.
Population can usually be guessed automatically, but if that is not the case, you can provide a population number by returning a value for the population
field in the returned object of the scraper.
You should test your source first by running npm run test
. This will perform some basic tests to make sure nothing crashes and the source object is in the correct form.
We run scrapes periodically for every cached file, for every source.
If you change a source, it will be exercised when you run npm run test:integration
. See Testing for more information.