Adding new data to the graph

Summary of the main steps

Preliminary exploration of the new data (see below)
Create source and processing script (see below)
Update db_schema and data_integration (see below)
Create test/source_data/FOLDER for the new dataset
Test source script runs OK locally
Test processing script steps (up until import function)
Commit to dev-$USER branch

Preliminary data exploration recommendations

Can be done locally in notebooks

Do data checks against the graph, e.g. in jupyter notebook
Create and test the source script locally (should work without the graph, unless required)
Manage source and process scripts within the repo codebase (in workflow/scripts) and modify ymls (in config/) to check that everything syncs

Creating the scripts/ymls

data_integration.yml
- modify the version in config/, not test/config!
- _files_: this has to match the name of the main output file from the source script
- _script_: this is the name of processing script
- _source_: not used during the build, but should be descriptive, as it will be stored in the node/rel property
- the node/relationship name (top line) will be used as the input parameter to run the processing script
db_schema.yml
- modify the version in config/, not test/config!
- specify which nodes to use (when adding a rel)
- specify any other properties that processing script retains in df to be imported in the graph
Source script
- in workflow/scripts/source/
- Save data in raw format from the source in workflow/source_data/FOLDER (as a backup)
- Script should download the latest version of the dataset from source (if possible)
- Script should do all filtering/mapping and output a file that will be read by the processing script
Processing script
- in workflow/scripts/processing/
- Read in the main file created by the source script
- The script may need to rename/subset columns to those which are defined in db_schema for this dataset (i.e. what graph is expecting to import)

Building the graph on server

There are 3 steps:

build the graph as is (optional; to eliminate any issues prior to adding new data)
add data (using scripts + ymls)
rebuild the graph with new data

Prep

The graph build has to happen on jojo (or other server)
(source bashrc) and conda activate neo4j_build
Make a folder in workflow/source_data/FOLDER
Modify DATA_DIR in .env when running the source script to point at the local source_data folder

Step 1

snakemake -r clean_all -j 1
snakemake -r all -j 4

Step 2

Assuming scripts and ymls are created and locally testes, run:

# run source script
python -m workflow.scripts.source.SOURCE_SCRIPT

# run processing script
python -m workflow.scripts.processing.rels.PROCESSING_SCRIPT -n (rel name in data_integration.yml) -d workflow/source_data/

# check new data
snakemake -r check_new_data -j 10

Step 3

snakemake -r all -j 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADDING_DATA_DETAILED_GUIDE.md

ADDING_DATA_DETAILED_GUIDE.md

Adding new data to the graph

Summary of the main steps

Preliminary data exploration recommendations

Creating the scripts/ymls

Building the graph on server

Prep

Step 1

Step 2

Step 3

Files

ADDING_DATA_DETAILED_GUIDE.md

Latest commit

History

ADDING_DATA_DETAILED_GUIDE.md

File metadata and controls

Adding new data to the graph

Summary of the main steps

Preliminary data exploration recommendations

Creating the scripts/ymls

Building the graph on server

Prep

Step 1

Step 2

Step 3