Merge pull request #345 from SmartCambridge/jw35-download-api

Add the 'Download API'
SmartCambridge · Nov 30, 2019 · 73e7b0b · 73e7b0b
2 parents 8a8c37f + 285cc5d
commit 73e7b0b
Show file tree

Hide file tree

Showing 1,521 changed files with 3,496 additions and 126 deletions.
diff --git a/.gitignore b/.gitignore
@@ -106,3 +106,5 @@ tfc_web/data/TNDS_NEW/*.zip
 /tfc_web/media
 
 nohup.out
+
+tfc_web/api/tests/data/download_api/
diff --git a/download_api.md b/download_api.md
@@ -0,0 +1,182 @@
+SmartCambridge Download API
+===========================
+
+The 'Download API' is a framework that provides efficient download
+access to potentially large amounts of archived SmartCambridge data as
+compressed CSV files. It compliments the Django REST framework-based
+'Program API' and works by pre-building zipped CSV files containing
+extracts of the available data for various date ranges and serving them
+direct from Nginx.
+
+Components
+==========
+
+The framework consists of:
+
+1. The Django configuration item `DOWNLOAD_FEEDS` that defines which
+archives should be maintained and made available.
+
+2. The Django management command `build_download_data`
+(`tfc_web/api/management/commands/build_download_data.py`) which
+creates, updates (and occasionally deletes) zip'ed CSV files containing
+extracts of SmartCambridge data, based on the configuration in
+`DOWNLOAD_FEEDS`. It is largely idempotent (like `make`) so each time it
+is run it just updates the collection of archive files as needed. This
+command should be run once per day from `tfc_web`'s crontab.
+
+2. `/media/tfc/download_api/` where the built archive files are stored.
+
+3. Nginx configuration from `/etc/nginx/includes2/tfc_web.conf` (source
+in `nginx/includes2/tfc_web.conf` in the tfc_prod repository) that makes
+the files in `/media/tfc/download_api/` available under
+`https://smartcambridge.org/api/download_files/` but only to people who
+have registered on the platform, authenticate to the Django app and
+agreed to the platform T&Cs.
+
+3. Other Django components in the 'api' app (`tfc_web/api/`, including
+`nginx_auth_probe()`, `download()` and `download_schema()` in
+`views.py`, `templates/api/download.html`,
+`templates/api/*-schema.html`) that present the page at
+[https://smartcambridge.org/api/download/](https://smartcambridge.org/api/download/)
+which provides an index to available downloadable files.
+`build_download_data` can produce archives with various date ranges
+but `views.py` and `templates/api/download.html` currently assume that
+there are only annual, monthly and/or daily archives.
+
+Configuration
+=============
+
+The Django configuration item `DOWNLOAD_FEEDS` contains a sequence of
+dictionaries, each corresponding to a 'feed' of data to make available.
+Each dictionary contain the following keys:
+
+`name`: A short name or tag for this feed. Used in log records and URLs
+
+`title`: A human-readable title for this feed. Used as a title on the
+index web page
+
+`desc`: A longer description of this feed. Displayed on the index web
+page.
+
+`archive_by_default`: Optional. If present and `True`, archives for this
+feed are processed when build_download_data is run without an explicit
+list of feeds
+
+`display`: Optional. If present and `True`, this feed is listed on the
+index web page. Setting this to `False` won't prevent people accessing
+any archives that exist for this feed if they know or can guess the URL,
+it just means they won't be listed. There's currently no way to make
+data available based on who someone is.
+
+`first_year`: The earliest year for which this feed contains data. Only
+data dated between 1 January in this year and yesterday will be
+processed.
+
+`archives`: Optional. If present, must be a sequence of dictionaries
+defining how each archive that will be maintained.
+
+`metadata`: Optional. If present, must be a dictionary defining how a
+data metadata will be maintained.
+
+Archive and metadata dictionaries contain the following keys:
+
+`name`: A short name or tag for this archive. Used in log records
+
+`source_pattern`: A filesystem path relative to the Django configuration
+item `DATA_PATH` (typically `/media/tfc`) that selects data files to be
+processed for inclusion in an archive. The pattern can containing file
+system glob wild-cards which will be expanded. The pattern is processed
+by `string.format()` with a parameter `date` that contains the start
+date for the archive being processed. As a result `{date:%Y}` will, for
+example, be replaced by the relevant year. An annual archive might have
+a `source_pattern` like `cam_aq/data_bin/{date:%Y}/*/*/*.json`.
+
+`destination`: A file system path relative to the Django configuration
+item `DEST_DIR` (defaulting to `{{ DATA_PATH }}/download_api/`) to which
+the archive will be written. The pattern is processed by
+`string.format()` as for `source_pattern`.
+
+`extractor`: A dot-separated Python path identifying the 'extractor'
+function responsible for extracting data for this archive from the
+source files and loading it into CVS (see below for details). Extractors
+are normally stored in `tfc_web/api/extractors/*.py`.
+
+`step`: The time step between successive archives, expressed as named
+parameters for
+[`dateutil.relativedelta.relativedelta()`](https://dateutil.readthedocs.io/en/stable/relativedelta.html).
+For example `{'years': 1}` for an annual archive.
+
+`start`: Optional. The first day for which an archive file should be
+exist, relative to today and expressed as named parameters for
+[`dateutil.relativedelta.relativedelta()`](https://dateutil.readthedocs.io/en/stable/relativedelta.html).
+So for example `{'year': 1960', 'month': 3, 'day': 5}` represents
+1960-03-05, `{'year': -1'}` represents today's date date last year and
+`{day': 1}` represents the first day of the current month. Defaults to 1
+January on the feed's `start_date`. Any existing archive files between 1
+January on the feed's `start_date` and the value of `start` will be
+deleted.
+
+`end`: Optional. The last day for which an archive file should exist,
+expressed as above for `start`. Defaults to yesterday. Any existing
+archive files between `end` and yesterday will be deleted.
+
+`build_download_data`
+=====================
+
+By default, `build_download_data` manages archives for all feeds with
+`archive_by_default` set to `True`. Alternatively a list of one or more
+feeds to manage can be supplied on the command line.
+
+In managing archives, `build_download_data` will create any that are
+missing, update any for which there are source files with later
+modification dates than the corresponding archive, and delete any for
+which there is no data or which correspond to dates before the archive's
+`start` or after its `end`. The command-line option `--force` will force
+all existing archives to be updated irrespective of dates.
+
+Extractor functions
+===================
+
+`build_download_data` uses 'extractor' functions to extract and format
+data from each feed's data files. These can appear anywhere in the
+Python include path but typically in a file named after the
+corresponding feed name in `tfc_web/aq/extractors` - e.g.
+`tfc_web/aq/extractors/parking.py`. Most feeds need a pair of
+extractors - one for the data itself and one for the feed metadata from
+`/media/tfc/sys/` - but this can vary (the 'aq' feed has two data
+extractors, for example, and the 'bus' feed has no metadata).
+
+Extractor functions receive a list of names of feed data files to
+process and a Python CSV writer object as parameters. Their return value
+is ignored. They are expected to write a header row to the CSV writer
+and then to extract information from each file, manipulate it as needed,
+and write it to the CSV writer.
+
+See the `parking.py` extractor for a straight-forward example.
+
+Adding a new data source
+========================
+
+Making a new data feed downloadable unfortunately needs changes in
+several places (blame the system designer):
+
+1. Create 'extractor' functions for the data.
+
+2. Edit `settings.py` and add a new element to `DOWNLOAD_FEEDS` to
+represent the new feed. Set `display` to `False` until you are ready to
+publish the data. You may also want to set `archive_by_default` to
+`False` initially.
+
+   Run `./manage.py build_download_data <feed name>` by hand and confirm
+that appropriate archives are created.
+
+4. Optionally add a file in `tfc_web/api/templates/api/` called `<feed
+name>-schema.html` containing a description of the data and its format
+(column names, units, etc).
+
+   Set `display` to `True` and confirm that
+[https://smartcambridge.org/api/download/](https://smartcambridge.org/api/download/)
+displays the feed as expected.
+
+6. Set `archive_by_default` to `True` to automatically maintain the
+archives into the future.
diff --git a/tfc_web/api/extractors/aq.py b/tfc_web/api/extractors/aq.py
@@ -0,0 +1,111 @@
+
+'''
+The set of functions used by the build_download_data command to extract data
+and store it in CVS files.
+'''
+
+import json
+import logging
+
+import dateutil.parser
+
+from .util import epoch_to_text
+
+logger = logging.getLogger(__name__)
+
+
+# Data extractors receive a list of file names and a CSV writer object.
+# They are expected to write appropriate headers to the CSV writer and
+# then extract relevant fields from each file, format them as necessary
+# and write the result CSV writer.
+
+
+def aq_header_extractor(files, writer):
+
+    logger.debug('In aq_files_extractor')
+
+    fields = (
+        'station_id', 'sensor_type', 'ts', 'ts_text',
+        'BatteryVoltage', 'COFinal', 'COOffset', 'COPrescaled', 'COScaled',
+        'COSerialNumber', 'COSlope', 'COStatus', 'GasProtocol', 'GasStatus',
+        'Humidity', 'Latitude', 'Longitude', 'Name', 'NO2Final', 'NO2Offset',
+        'NO2Prescaled', 'NO2Scaled', 'NO2SerialNumber', 'NO2Slope', 'NO2Status',
+        'NOFinal', 'NOOffset', 'NOPrescaled', 'NOScaled', 'NOSerialNumber',
+        'NOSlope', 'NOStatus', 'O3Final', 'O3Offset', 'O3Prescaled', 'O3Scaled',
+        'O3SerialNumber', 'O3Slope', 'O3Status', 'OtherInfo', 'P1', 'P2', 'P3',
+        'ParticleNumber', 'ParticleProtocol', 'ParticleStatus', 'PM10Final',
+        'PM10Offset', 'PM10Output', 'PM10PreScaled', 'PM10Slope', 'PM1Final',
+        'PM1Offset', 'PM1Output', 'PM1PreScaled', 'PM1Slope', 'PM2_5Final',
+        'PM2_5Offset', 'PM2_5OutPut', 'PM2_5PreScaled', 'PM2_5Slope',
+        'PMTotalOffset', 'PMTotalPreScaled', 'PMTotalSlope', 'PodFeaturetype',
+        'Pressure', 'SerialNo', 'SO2Final', 'SO2Offset', 'SO2Prescaled',
+        'SO2Scaled', 'SO2SerialNumber', 'SO2Slope', 'SO2Status', 'T1', 'T2', 'T3',
+        'Temperature', 'TSP'
+    )
+
+    writer.writerow(fields)
+
+    for file in files:
+        logger.debug('Processing %s', file)
+        with open(file) as reader:
+            data = json.load(reader)
+            header = data['Header']
+            try:
+                station_id = str(header['StationID'])
+            except KeyError:
+                station_id = str(header['StationId'])
+            if not station_id.startswith('S-'):
+                station_id = 'S-' + station_id
+            header['station_id'] = station_id
+            header['sensor_type'] = data.get('SensorType')
+            ts = dateutil.parser.parse(header['Timestamp']).timestamp()
+            header['ts'] = ts
+            header['ts_text'] = epoch_to_text(ts)
+            writer.writerow([header.get(f) for f in fields])
+
+
+def aq_data_extractor(files, writer):
+
+    logger.debug('In aq_data_extractor')
+    fields = ('station_id', 'sensor_type', 'ts', 'ts_text', 'reading')
+    writer.writerow(fields)
+
+    for file in files:
+        logger.debug('Processing %s', file)
+        with open(file) as reader:
+            data = json.load(reader)
+            for ts, reading in data['Readings']:
+                # Capitalisation of 'StationID' inconsistent
+                try:
+                    station_id = str(data['Header']['StationID'])
+                except KeyError:
+                    station_id = str(data['Header']['StationId'])
+                if not station_id.startswith('S-'):
+                    station_id = 'S-' + station_id
+                row = [
+                  station_id,
+                  data['SensorType'],
+                  ts,
+                  epoch_to_text(ts/1000),
+                  reading
+                ]
+                writer.writerow(row)
+
+
+# Metadata extractors for each storage type. They receive a single filename
+# in 'files' and a CSV writer object.
+
+
+def aq_metadata_extractor(files, writer):
+
+    logger.debug('In aq_metadata_extractor')
+    fields = ('station_id', 'Name', 'Description', 'SensorTypes', 'Latitude', 'Longitude', 'date_from', 'date_to')
+    writer.writerow(fields)
+
+    assert len(files) == 1, 'Expecting exactly one file'
+    logger.debug('Processing %s', files[0])
+    with open(files[0]) as reader:
+        for station in json.load(reader)['aq_list']:
+            station['station_id'] = station['StationID']
+            station['SensorTypes'] = '|'.join(station['SensorTypes'])
+            writer.writerow([station.get(f) for f in fields])
diff --git a/tfc_web/api/extractors/bus.py b/tfc_web/api/extractors/bus.py
@@ -0,0 +1,39 @@
+
+'''
+The set of functions used by the build_download_data command to extract data
+and store it in CVS files.
+'''
+
+import json
+import logging
+
+from .util import epoch_to_text
+
+logger = logging.getLogger(__name__)
+
+
+# Data extractors receive a list of file names and a CSV writer object.
+# They are expected to write appropriate headers to the CSV writer and
+# then extract relevant fields from each file, format them as necessary
+# and write the result CSV writer.
+
+
+def bus_extractor(files, writer):
+
+    logger.debug('In bus_extractor')
+    fields = (
+        'ts', 'ts_text', 'VehicleRef', 'LineRef', 'DirectionRef',
+        'OperatorRef', 'OriginRef', 'OriginName', 'DestinationRef',
+        'DestinationName', 'OriginAimedDepartureTime', 'Longitude',
+        'Latitude', 'Bearing', 'Delay'
+    )
+    writer.writerow([f for f in fields])
+
+    for file in files:
+        logger.debug('Processing %s', file)
+        with open(file) as reader:
+            data = json.load(reader)
+            for record in data['request_data']:
+                record['ts'] = record['acp_ts']
+                record['ts_text'] = epoch_to_text(record['acp_ts'])
+                writer.writerow([record.get(f) for f in fields])
diff --git a/tfc_web/api/extractors/parking.py b/tfc_web/api/extractors/parking.py
@@ -0,0 +1,50 @@
+
+'''
+The functions used by the build_download_data command to extract car
+park dataand store it in CVS files.
+'''
+
+import json
+import logging
+
+from .util import epoch_to_text
+
+logger = logging.getLogger(__name__)
+
+
+# Data extractors receive a list of file names and a CSV writer object.
+# They are expected to write appropriate headers to the CSV writer and
+# then extract relevant fields from each file, format them as necessary
+# and write the result CSV writer.
+
+
+def cam_park_rss_extractor(files, writer):
+
+    logger.debug('In cam_park_rss_extractor')
+    fields = ('parking_id', 'ts', 'ts_text', 'spaces_capacity', 'spaces_occupied', 'spaces_free')
+    writer.writerow(fields)
+
+    for file in files:
+        logger.debug('Processing %s', file)
+        with open(file) as reader:
+            for line in reader:
+                data = json.loads(line)
+                data['ts_text'] = epoch_to_text(data['ts'])
+                writer.writerow([data.get(f) for f in fields])
+
+
+# Metadata extractors for each storage type. They receive a single filename
+# in 'files' and a CSV writer object.
+
+def cam_park_rss_metadata_extractor(files, writer):
+
+    logger.debug('In cam_park_rss_metadata_extractor')
+    fields = ('parking_id', 'parking_name', 'parking_type', 'latitude', 'longitude')
+    writer.writerow(fields)
+
+    assert len(files) == 1, 'Expecting exactly one file'
+    logger.debug('Processing %s', files[0])
+    with open(files[0]) as reader:
+        data = json.load(reader)
+        for carpark in data['parking_list']:
+            writer.writerow([carpark.get(f) for f in fields])
Original file line number	Diff line number	Diff line change
Expand Up		@@ -106,3 +106,5 @@ tfc_web/data/TNDS_NEW/*.zip
		/tfc_web/media

		nohup.out

		tfc_web/api/tests/data/download_api/