A software suite that enables remote extraction, transformation and loading of data.
This repository is geared heavily towards drawing articles from PubMed, identifying scientific articles containing information about biological pathways, and loading the records into a data store.
- Python (version >=3.8<3.10)
- Poetry (version >=1.5.0)
- Docker (version 20.10.14) and Docker Compose (version 2.5.1)
- We use Docker to create a RethinkDB (v2.3.6) instance for loading data.
- Miniconda (Optional)
- For creating virtual environments. Any other solution will work.
- Graphics Processing Unit (GPU) (Optional)
- The pipeline classifier can be sped up an order of magnitude by running on a system with a GPU. We have been using a system running Ubuntu 18.04.5 LTS, Intel(R) Xeon(R) CPU E5-2687W, 24 Core with an NVIDIA GP102 [TITAN Xp] GPU.
Create a conda environment, here named pipeline
:
$ conda create --name pipeline python=3.8 --yes
$ conda activate pipeline
Download the remote:
$ git clone https://github.com/jvwong/classifier-pipeline
$ cd classifier-pipeline
Install the dependencies:
$ poetry install
To start up the server:
uvicorn classifier_pipeline.main:app --port 8000 --reload
- uvicron options
--reload
: Enable auto-reload.--port INTEGER
: Bind socket to this port (default 8000)
And now, go to http://127.0.0.1:8000/redoc (swap out the port if neccessary) to see the automatic documentation.
Launch a pipeline to process daily updates from PubMed and dump the RethinkDB database:
$ ./scripts/cron/install.sh
The scripts
directory contains python files that chain functions in classifier_pipeline
to:
- read in data from
- csv, via stdin (
csv2dict_reader
) - daily PubMed updates (
updatefiles_extractor
)
- csv, via stdin (
- retrieve records/files from PubMed (
pubmed_transformer
) - apply various filters on the individual records (
citation_pubtype_filter
,citation_date_filter
) - apply a deep-learning classifier to text fields (
classification_transformer
) - loads the formatted data into a RethinkDB instance (
db_loader
)
- Pipelines are launched through bash scripts that retrieve PubMed article records in two ways:
./scripts/cron/cron.sh
: retrieves via the FTP file server all new content./scripts/csv/pmids.sh
: retrieve using the NCBI E-Utilities given a set of PubMed IDs
- Variables
DATA_DIR
root directory where your data files existDATA_FILE
name of the csv file in yourDATA_DIR
ARG_IDCOLUMN
the csv header column name containing either- a list of update files to extract (
dailyupdates.sh
) - a list of PubMed IDs to extract (
pmids.sh
)
- a list of update files to extract (
JOB_NAME
the name of this pipeline jobCONDA_ENV
should be the environment name you declared in the first stepsARG_TYPE
- use
fetch
for downloading individual PubMed IDs - use
download
to retrieve FTP update files
- use
ARG_MINYEAR
articles published in years before this will be filtered out (optional)ARG_TABLE
is the name of the table to dump results intoARG_THRESHOLD
set the lowest probability to classify an article as 'positive' using pathway-abstract-classifier
There is a convenience script that can be launched:
$ ./test.sh
This will run the tests in ./tests
, lint with flake8 and type check with mypy.