Data pipeline for REDSS/CSAP.
This pipeline fetches all project data from a Rapid Pro instance, and processes it to produce CSV files suitable for downstream analysis.
Before the pipeline can be run, the following tools must be installed:
- Docker
- Bash
Development requires the following additional tools:
- Python 3.6+
- pipenv
- git
Running the pipeline requires (1) creating a phone number <-> UUID table to support de-identification of respondents, (2) fetching all the relevant data from Rapid Pro, and (3) processing the raw data to produce the outputs required for coding and then for analysis.
To simplify the configuration and execution of these stages, this project includes a run_scripts
directory, which contains shell scripts for driving each of the stages.
More detailed descriptions of the functions of each of those stages, and instructions for using
the run scripts, are provided below.
First, create an empty phone number <-> UUID table by running the following command in
the run_scripts
directory:
$ ./1_create_uuid_table.sh <data-root>
where data-root
is an absolute path to the directory in which all pipeline data should be stored.
The UUID table will be saved to a file in the directory <data-root>/UUIDs
.
Next, fetch all the raw data required by the pipeline from Rapid Pro by running the following command in
the run_scripts
directory:
$ ./2_fetch_raw_data.sh <user> <rapid-pro-root> <rapid-pro-server> <rapid-pro-token> <data-root>
.
where:
user
is the identifier of the person running the script, for use in the TracedData Metadata e.g.[email protected]
rapid-pro-root
is an absolute path to the directory to store a local clone of RapidProTools in. The RapidProTools project hosts the re-usable RapidPro data fetchers. The exact version required by this project is checked out automatically.rapid-pro-server
is the root address of the RapidPro server to retrieve data from e.g.https://app.rapidpro.io
.rapid-pro-token
is the access token for this instance of RapidPro. The access token may be found by logging into RapidPro's web interface, then navigating to your organisation page (via the button in the top-right), then copying the hexadecimal string given after "Your API Token is ..."data-root
is an absolute path to the directory in which all pipeline data should be stored. Raw data will be saved to TracedData JSON files in<data-root>/Raw Data
.
Finally, process the raw data to produce outputs for The Interface, ICR, Coda, and messages/individuals/production
CSVs for final analysis, by running the following command in the run_scripts
directory.
$ ./3_generate_outputs.sh [--drive-upload <drive-service-account-credentials-url> <drive-upload-dir>] <user> <data-root>
where:
--drive-upload
is an optional flag for uploading the messages, individuals, and production CSVs to Drive. If this flag set, pass the arguments:drive-service-account-credentials-url
, a gs URL to the private credentials file of a Google Drive service account. This service account will be used to upload outputted data for analysis to a directory on Google Drive.drive-upload-dir
, the path to a directory in Google Drive to upload the messages, individuals, and production CSVs to. Before files can be uploaded to a directory, the directory must be shared with the service account's email address (which can be found in theclient_email
field of the service account's credentials file).user
is the identifier of the person running the script, for use in the TracedData Metadata e.g.[email protected]
.data-root
is an absolute path to the directory in which all pipeline data should be stored. Updated Coda files containing new data to be coded will be saved in<data-root>/Raw Data
. All other output files will be saved in<data-root>/Outputs
.
As well as uploading the messages, individuals, and production CSVs to Drive, this stage outputs the following to
<data-root>/Outputs
:
- Local copies of the messages, individuals, and production CSVs (
csap_mes.csv
,csap_ind.csv
,csap_prod.csv
) - A serialized export of the list of TracedData objects representing all the data that was exported for analysis
(
traced_data.json
) - For each week of radio shows, a random sample of 200 messages that weren't classified as noise, for use in ICR (
ICR/
) - Coda V2 messages files for each dataset (
Coda Files/<dataset>.json
). To upload these datasets to Coda, use theset.py
oradd.py
tools in the Coda V2 repository.
To make coded data available to the pipeline, use the get.py
script in CodaV2/data_tools
to export coded messages
to <data-root>/Coded Coda Files/<dataset>.json
To run the main processing stage with statistical cpu profiling enabled, pass the argument
--profile-cpu <profile-output-file>
to run_scripts/3_generate_outputs.sh
.
The output file is generated by the statistical profiler Pyflame, and is in a
format compatible suitable for visualisation using FlameGraph.