-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #345 from SmartCambridge/jw35-download-api
Add the 'Download API'
- Loading branch information
Showing
1,521 changed files
with
3,496 additions
and
126 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -106,3 +106,5 @@ tfc_web/data/TNDS_NEW/*.zip | |
/tfc_web/media | ||
|
||
nohup.out | ||
|
||
tfc_web/api/tests/data/download_api/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,182 @@ | ||
SmartCambridge Download API | ||
=========================== | ||
|
||
The 'Download API' is a framework that provides efficient download | ||
access to potentially large amounts of archived SmartCambridge data as | ||
compressed CSV files. It compliments the Django REST framework-based | ||
'Program API' and works by pre-building zipped CSV files containing | ||
extracts of the available data for various date ranges and serving them | ||
direct from Nginx. | ||
|
||
Components | ||
========== | ||
|
||
The framework consists of: | ||
|
||
1. The Django configuration item `DOWNLOAD_FEEDS` that defines which | ||
archives should be maintained and made available. | ||
|
||
2. The Django management command `build_download_data` | ||
(`tfc_web/api/management/commands/build_download_data.py`) which | ||
creates, updates (and occasionally deletes) zip'ed CSV files containing | ||
extracts of SmartCambridge data, based on the configuration in | ||
`DOWNLOAD_FEEDS`. It is largely idempotent (like `make`) so each time it | ||
is run it just updates the collection of archive files as needed. This | ||
command should be run once per day from `tfc_web`'s crontab. | ||
|
||
2. `/media/tfc/download_api/` where the built archive files are stored. | ||
|
||
3. Nginx configuration from `/etc/nginx/includes2/tfc_web.conf` (source | ||
in `nginx/includes2/tfc_web.conf` in the tfc_prod repository) that makes | ||
the files in `/media/tfc/download_api/` available under | ||
`https://smartcambridge.org/api/download_files/` but only to people who | ||
have registered on the platform, authenticate to the Django app and | ||
agreed to the platform T&Cs. | ||
|
||
3. Other Django components in the 'api' app (`tfc_web/api/`, including | ||
`nginx_auth_probe()`, `download()` and `download_schema()` in | ||
`views.py`, `templates/api/download.html`, | ||
`templates/api/*-schema.html`) that present the page at | ||
[https://smartcambridge.org/api/download/](https://smartcambridge.org/api/download/) | ||
which provides an index to available downloadable files. | ||
`build_download_data` can produce archives with various date ranges | ||
but `views.py` and `templates/api/download.html` currently assume that | ||
there are only annual, monthly and/or daily archives. | ||
|
||
Configuration | ||
============= | ||
|
||
The Django configuration item `DOWNLOAD_FEEDS` contains a sequence of | ||
dictionaries, each corresponding to a 'feed' of data to make available. | ||
Each dictionary contain the following keys: | ||
|
||
`name`: A short name or tag for this feed. Used in log records and URLs | ||
|
||
`title`: A human-readable title for this feed. Used as a title on the | ||
index web page | ||
|
||
`desc`: A longer description of this feed. Displayed on the index web | ||
page. | ||
|
||
`archive_by_default`: Optional. If present and `True`, archives for this | ||
feed are processed when build_download_data is run without an explicit | ||
list of feeds | ||
|
||
`display`: Optional. If present and `True`, this feed is listed on the | ||
index web page. Setting this to `False` won't prevent people accessing | ||
any archives that exist for this feed if they know or can guess the URL, | ||
it just means they won't be listed. There's currently no way to make | ||
data available based on who someone is. | ||
|
||
`first_year`: The earliest year for which this feed contains data. Only | ||
data dated between 1 January in this year and yesterday will be | ||
processed. | ||
|
||
`archives`: Optional. If present, must be a sequence of dictionaries | ||
defining how each archive that will be maintained. | ||
|
||
`metadata`: Optional. If present, must be a dictionary defining how a | ||
data metadata will be maintained. | ||
|
||
Archive and metadata dictionaries contain the following keys: | ||
|
||
`name`: A short name or tag for this archive. Used in log records | ||
|
||
`source_pattern`: A filesystem path relative to the Django configuration | ||
item `DATA_PATH` (typically `/media/tfc`) that selects data files to be | ||
processed for inclusion in an archive. The pattern can containing file | ||
system glob wild-cards which will be expanded. The pattern is processed | ||
by `string.format()` with a parameter `date` that contains the start | ||
date for the archive being processed. As a result `{date:%Y}` will, for | ||
example, be replaced by the relevant year. An annual archive might have | ||
a `source_pattern` like `cam_aq/data_bin/{date:%Y}/*/*/*.json`. | ||
|
||
`destination`: A file system path relative to the Django configuration | ||
item `DEST_DIR` (defaulting to `{{ DATA_PATH }}/download_api/`) to which | ||
the archive will be written. The pattern is processed by | ||
`string.format()` as for `source_pattern`. | ||
|
||
`extractor`: A dot-separated Python path identifying the 'extractor' | ||
function responsible for extracting data for this archive from the | ||
source files and loading it into CVS (see below for details). Extractors | ||
are normally stored in `tfc_web/api/extractors/*.py`. | ||
|
||
`step`: The time step between successive archives, expressed as named | ||
parameters for | ||
[`dateutil.relativedelta.relativedelta()`](https://dateutil.readthedocs.io/en/stable/relativedelta.html). | ||
For example `{'years': 1}` for an annual archive. | ||
|
||
`start`: Optional. The first day for which an archive file should be | ||
exist, relative to today and expressed as named parameters for | ||
[`dateutil.relativedelta.relativedelta()`](https://dateutil.readthedocs.io/en/stable/relativedelta.html). | ||
So for example `{'year': 1960', 'month': 3, 'day': 5}` represents | ||
1960-03-05, `{'year': -1'}` represents today's date date last year and | ||
`{day': 1}` represents the first day of the current month. Defaults to 1 | ||
January on the feed's `start_date`. Any existing archive files between 1 | ||
January on the feed's `start_date` and the value of `start` will be | ||
deleted. | ||
|
||
`end`: Optional. The last day for which an archive file should exist, | ||
expressed as above for `start`. Defaults to yesterday. Any existing | ||
archive files between `end` and yesterday will be deleted. | ||
|
||
`build_download_data` | ||
===================== | ||
|
||
By default, `build_download_data` manages archives for all feeds with | ||
`archive_by_default` set to `True`. Alternatively a list of one or more | ||
feeds to manage can be supplied on the command line. | ||
|
||
In managing archives, `build_download_data` will create any that are | ||
missing, update any for which there are source files with later | ||
modification dates than the corresponding archive, and delete any for | ||
which there is no data or which correspond to dates before the archive's | ||
`start` or after its `end`. The command-line option `--force` will force | ||
all existing archives to be updated irrespective of dates. | ||
|
||
Extractor functions | ||
=================== | ||
|
||
`build_download_data` uses 'extractor' functions to extract and format | ||
data from each feed's data files. These can appear anywhere in the | ||
Python include path but typically in a file named after the | ||
corresponding feed name in `tfc_web/aq/extractors` - e.g. | ||
`tfc_web/aq/extractors/parking.py`. Most feeds need a pair of | ||
extractors - one for the data itself and one for the feed metadata from | ||
`/media/tfc/sys/` - but this can vary (the 'aq' feed has two data | ||
extractors, for example, and the 'bus' feed has no metadata). | ||
|
||
Extractor functions receive a list of names of feed data files to | ||
process and a Python CSV writer object as parameters. Their return value | ||
is ignored. They are expected to write a header row to the CSV writer | ||
and then to extract information from each file, manipulate it as needed, | ||
and write it to the CSV writer. | ||
|
||
See the `parking.py` extractor for a straight-forward example. | ||
|
||
Adding a new data source | ||
======================== | ||
|
||
Making a new data feed downloadable unfortunately needs changes in | ||
several places (blame the system designer): | ||
|
||
1. Create 'extractor' functions for the data. | ||
|
||
2. Edit `settings.py` and add a new element to `DOWNLOAD_FEEDS` to | ||
represent the new feed. Set `display` to `False` until you are ready to | ||
publish the data. You may also want to set `archive_by_default` to | ||
`False` initially. | ||
|
||
Run `./manage.py build_download_data <feed name>` by hand and confirm | ||
that appropriate archives are created. | ||
|
||
4. Optionally add a file in `tfc_web/api/templates/api/` called `<feed | ||
name>-schema.html` containing a description of the data and its format | ||
(column names, units, etc). | ||
|
||
Set `display` to `True` and confirm that | ||
[https://smartcambridge.org/api/download/](https://smartcambridge.org/api/download/) | ||
displays the feed as expected. | ||
|
||
6. Set `archive_by_default` to `True` to automatically maintain the | ||
archives into the future. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
|
||
''' | ||
The set of functions used by the build_download_data command to extract data | ||
and store it in CVS files. | ||
''' | ||
|
||
import json | ||
import logging | ||
|
||
import dateutil.parser | ||
|
||
from .util import epoch_to_text | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
# Data extractors receive a list of file names and a CSV writer object. | ||
# They are expected to write appropriate headers to the CSV writer and | ||
# then extract relevant fields from each file, format them as necessary | ||
# and write the result CSV writer. | ||
|
||
|
||
def aq_header_extractor(files, writer): | ||
|
||
logger.debug('In aq_files_extractor') | ||
|
||
fields = ( | ||
'station_id', 'sensor_type', 'ts', 'ts_text', | ||
'BatteryVoltage', 'COFinal', 'COOffset', 'COPrescaled', 'COScaled', | ||
'COSerialNumber', 'COSlope', 'COStatus', 'GasProtocol', 'GasStatus', | ||
'Humidity', 'Latitude', 'Longitude', 'Name', 'NO2Final', 'NO2Offset', | ||
'NO2Prescaled', 'NO2Scaled', 'NO2SerialNumber', 'NO2Slope', 'NO2Status', | ||
'NOFinal', 'NOOffset', 'NOPrescaled', 'NOScaled', 'NOSerialNumber', | ||
'NOSlope', 'NOStatus', 'O3Final', 'O3Offset', 'O3Prescaled', 'O3Scaled', | ||
'O3SerialNumber', 'O3Slope', 'O3Status', 'OtherInfo', 'P1', 'P2', 'P3', | ||
'ParticleNumber', 'ParticleProtocol', 'ParticleStatus', 'PM10Final', | ||
'PM10Offset', 'PM10Output', 'PM10PreScaled', 'PM10Slope', 'PM1Final', | ||
'PM1Offset', 'PM1Output', 'PM1PreScaled', 'PM1Slope', 'PM2_5Final', | ||
'PM2_5Offset', 'PM2_5OutPut', 'PM2_5PreScaled', 'PM2_5Slope', | ||
'PMTotalOffset', 'PMTotalPreScaled', 'PMTotalSlope', 'PodFeaturetype', | ||
'Pressure', 'SerialNo', 'SO2Final', 'SO2Offset', 'SO2Prescaled', | ||
'SO2Scaled', 'SO2SerialNumber', 'SO2Slope', 'SO2Status', 'T1', 'T2', 'T3', | ||
'Temperature', 'TSP' | ||
) | ||
|
||
writer.writerow(fields) | ||
|
||
for file in files: | ||
logger.debug('Processing %s', file) | ||
with open(file) as reader: | ||
data = json.load(reader) | ||
header = data['Header'] | ||
try: | ||
station_id = str(header['StationID']) | ||
except KeyError: | ||
station_id = str(header['StationId']) | ||
if not station_id.startswith('S-'): | ||
station_id = 'S-' + station_id | ||
header['station_id'] = station_id | ||
header['sensor_type'] = data.get('SensorType') | ||
ts = dateutil.parser.parse(header['Timestamp']).timestamp() | ||
header['ts'] = ts | ||
header['ts_text'] = epoch_to_text(ts) | ||
writer.writerow([header.get(f) for f in fields]) | ||
|
||
|
||
def aq_data_extractor(files, writer): | ||
|
||
logger.debug('In aq_data_extractor') | ||
fields = ('station_id', 'sensor_type', 'ts', 'ts_text', 'reading') | ||
writer.writerow(fields) | ||
|
||
for file in files: | ||
logger.debug('Processing %s', file) | ||
with open(file) as reader: | ||
data = json.load(reader) | ||
for ts, reading in data['Readings']: | ||
# Capitalisation of 'StationID' inconsistent | ||
try: | ||
station_id = str(data['Header']['StationID']) | ||
except KeyError: | ||
station_id = str(data['Header']['StationId']) | ||
if not station_id.startswith('S-'): | ||
station_id = 'S-' + station_id | ||
row = [ | ||
station_id, | ||
data['SensorType'], | ||
ts, | ||
epoch_to_text(ts/1000), | ||
reading | ||
] | ||
writer.writerow(row) | ||
|
||
|
||
# Metadata extractors for each storage type. They receive a single filename | ||
# in 'files' and a CSV writer object. | ||
|
||
|
||
def aq_metadata_extractor(files, writer): | ||
|
||
logger.debug('In aq_metadata_extractor') | ||
fields = ('station_id', 'Name', 'Description', 'SensorTypes', 'Latitude', 'Longitude', 'date_from', 'date_to') | ||
writer.writerow(fields) | ||
|
||
assert len(files) == 1, 'Expecting exactly one file' | ||
logger.debug('Processing %s', files[0]) | ||
with open(files[0]) as reader: | ||
for station in json.load(reader)['aq_list']: | ||
station['station_id'] = station['StationID'] | ||
station['SensorTypes'] = '|'.join(station['SensorTypes']) | ||
writer.writerow([station.get(f) for f in fields]) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
|
||
''' | ||
The set of functions used by the build_download_data command to extract data | ||
and store it in CVS files. | ||
''' | ||
|
||
import json | ||
import logging | ||
|
||
from .util import epoch_to_text | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
# Data extractors receive a list of file names and a CSV writer object. | ||
# They are expected to write appropriate headers to the CSV writer and | ||
# then extract relevant fields from each file, format them as necessary | ||
# and write the result CSV writer. | ||
|
||
|
||
def bus_extractor(files, writer): | ||
|
||
logger.debug('In bus_extractor') | ||
fields = ( | ||
'ts', 'ts_text', 'VehicleRef', 'LineRef', 'DirectionRef', | ||
'OperatorRef', 'OriginRef', 'OriginName', 'DestinationRef', | ||
'DestinationName', 'OriginAimedDepartureTime', 'Longitude', | ||
'Latitude', 'Bearing', 'Delay' | ||
) | ||
writer.writerow([f for f in fields]) | ||
|
||
for file in files: | ||
logger.debug('Processing %s', file) | ||
with open(file) as reader: | ||
data = json.load(reader) | ||
for record in data['request_data']: | ||
record['ts'] = record['acp_ts'] | ||
record['ts_text'] = epoch_to_text(record['acp_ts']) | ||
writer.writerow([record.get(f) for f in fields]) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
|
||
''' | ||
The functions used by the build_download_data command to extract car | ||
park dataand store it in CVS files. | ||
''' | ||
|
||
import json | ||
import logging | ||
|
||
from .util import epoch_to_text | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
# Data extractors receive a list of file names and a CSV writer object. | ||
# They are expected to write appropriate headers to the CSV writer and | ||
# then extract relevant fields from each file, format them as necessary | ||
# and write the result CSV writer. | ||
|
||
|
||
def cam_park_rss_extractor(files, writer): | ||
|
||
logger.debug('In cam_park_rss_extractor') | ||
fields = ('parking_id', 'ts', 'ts_text', 'spaces_capacity', 'spaces_occupied', 'spaces_free') | ||
writer.writerow(fields) | ||
|
||
for file in files: | ||
logger.debug('Processing %s', file) | ||
with open(file) as reader: | ||
for line in reader: | ||
data = json.loads(line) | ||
data['ts_text'] = epoch_to_text(data['ts']) | ||
writer.writerow([data.get(f) for f in fields]) | ||
|
||
|
||
# Metadata extractors for each storage type. They receive a single filename | ||
# in 'files' and a CSV writer object. | ||
|
||
def cam_park_rss_metadata_extractor(files, writer): | ||
|
||
logger.debug('In cam_park_rss_metadata_extractor') | ||
fields = ('parking_id', 'parking_name', 'parking_type', 'latitude', 'longitude') | ||
writer.writerow(fields) | ||
|
||
assert len(files) == 1, 'Expecting exactly one file' | ||
logger.debug('Processing %s', files[0]) | ||
with open(files[0]) as reader: | ||
data = json.load(reader) | ||
for carpark in data['parking_list']: | ||
writer.writerow([carpark.get(f) for f in fields]) |
Oops, something went wrong.