The aim of this application is to allow the reproducibility of OpenTarget Platform data release pipeline. The input files are copied in a local hard disk and eventually in a specific google storage bucket
Currently, the application executes 11 steps and finally it generates the input resources for the ETL pipeline (https://github.com/opentargets/data_pipeline)
List of available steps:
- BaselineExpression
- Disease
- Drug
- Evidence
- Expression
- Go
- Homologues
- Interactions
- Literature
- Mousephenotypes
- OpenFDA
- Otar
- Pharmacogenomics
- PPPEvidence
- Reactome
- SO
- Target
- TargetEngine
Within this application you can simply download a file from FTP, HTTP or Google Cloud Bucket but at the same time the file can be processed in order to generate a new resource. The final files are located under the output directory while the files used for the computation are saved under stages.
Below more details about how to execute the script.
Requires Google Cloud SDK and docker (if running locally)
- Set the values in the config or copy it to profiles/config.<profile_name>
- Note: you're unlikely to need to change anything other than the PIS Config section and maybe the image tag.
- Set the profile to one you need e.g.
make set_profile profile=dev
- Launch with pis arguments e.g. (steps for public) so
make launch_remote args='-exclude otar pppevidence'
- This deploys a vm on GCP, runs PIS with the args given, uploads the PIS output, configs and logs to the configured GCS bucket.
- To follow the logs while the VM is up, run the following:
gcloud compute ssh --zone "<pisvm.zone>" "<pisvm.username>@<pisvm.name" --project "open-targets-eu-dev" --command='journalctl -u google-startup-scripts.service -f'
- Launch with pis arguments e.g. (steps for public) so
make launch_local args='-exclude otar pppevidence'
- This runs PIS with the args given, uploads the PIS output, configs and logs to the configured GCS bucket.
- If you launched PIS remotely, you'll want to tear down the infrastructure after it's finished:
make clean_infrastructure
. If you want to tear down all infrastructure from previous runs:make clean_all_infrastructure
- The session configs are stored locally in "sessions/<session_id>". A helper make target allows you to clear this:
make clean_session_metadata
or for all sessions:make clean_all_sessions_metadata
- Python3.8
- Poetry
- Apache-Jena
- git
- jq
- Google Cloud SDK
Installation:
curl -sSL https://install.python-poetry.org | python3 -
Platform input support is available as docker image, a Dockerfile is provided for building the container image.
For instance
mkdir /tmp/pi
sudo docker run
-v path_with_credentials:/usr/src/app/cred
-v /tmp/pis:/usr/src/app/output
pis_docker_image
-steps Evidence
-gkey /usr/src/app/cred/open-targets-gac.json
-gb ot-team/pis_output/docker
When providing a Google Cloud key file, please, make sure that it is mounted within the container at this exact mount point
/srv/credentials/open-targets-gac.json
This allows the activation of the service account when running the container image, so gcloud-sdk can work with the destination Google Cloud Storage Bucket.
For using an external config file, simply add the option -c and the path where the config file is available
This is an example how to run singularity using the docker image.
singularity exec \
-B /nfs/ftp/private/otftpuser/output_pis:/usr/src/app/output \
docker://quay.io/opentargets/platform-input-support:cm_singularity \
conda run --no-cature-output -n pis-py3.8 python3 /usr/src/app/platform-input-support.py -steps drug
In order to run PIS inside the current EBI infrastructure the best praticse is to use Singularity and LSF.
First of all check the google cloud account rights for the proper project.
ls -la /homes/$USER/.config/gcloud/application_default_credentials.json
Eventually run the following command to generate the file above:
gcloud config set project open-targets-eu-dev
or
gcloud config set project open-targets-prod
gcloud auth application-default login
You can use singularity/ebi.sh
to run PIS inside the EBI infrastructure.
./singularity docker_tag_image step google_storage_path
where
- docker_tag_image: docker image tag (quay.io) (under review)
- step : Eg. drug
- google_storage_path: gs bucket [not mandatory]
./singularity/ebi.sh 21.04 drug
./singularity/ebi.sh 21.04 drug ot-snapshots/21.04/input
cd ~
wget -O apache-jena.tar.gz https://www.mirrorservice.org/sites/ftp.apache.org/jena/binaries/apache-jena-4.2.0.tar.gz
tar xvf apache-jena.tar.gz --one-top-level=apache-jena --strip-components 1 -C /usr/share/
Add ~/apache-jena/bi to .bashrc
EG.
export PATH="$PATH:/your_path/apache-jena/bin"
source .bashrc
The config.yaml file contains several sections. Most of the sections are used by the steps in order to download, to extract and to manipulate the input and generate the proper output.
The section config can be used for specify where utility programs riot
and jq
are installed on the system.
If the command shutil.which fails during PIS execution double check these configurations.
"java_vm" parameter will set up the JVM heap environment variable for the execution of the command riot
.
The majority of the configuration file is specifying metadata for the execution
of individual steps, eg tep
, ensembl
or drug
. These can be recognised as they
all have a gs_output_dir
key which indicates where the files will be saved in GCP.
There is inconsistency between keys in different steps, as they accommodate very different
input types and required levels of configuration. See step guides for
configuration requirements for specific steps.
The section config can be used for specify where riot
or jq
are installed.
We are only building the functionality which we need which introduces some limitations. At present if a zip file is downloaded we only extract 1 file from the archive. To configure a zip file create an entry in the config such as:
- uri: https://probeminer.icr.ac.uk/probeminer_datadump.zip
output_filename: probeminer-datadump-{suffix}.tsv.zip
unzip_file: true
resource: probeminer
The unzip_file flag tells RetrieveResource.py
to treat the file as an archive.
The uri
field indicates from where to download the data. The archive will be saved under output_filename
. The first
element of the archive will be extracted under output_filename
with the suffix '[gz|zip]' removed, so in this case, probeminer-datadump-{suffix}.tsv.
git clone https://github.com/opentargets/platform-input-support
cd platform-input-support
poetry install --no-root
poetry run python3 platform-input-support.py -h
cd your_path_application
poetry run python3 platform-input-support -h
usage: platform-input-support.py [-h] [-c CONFIG]
[-gkey GCP_CREDENTIALS]
[-gb GCP_BUCKET] [-o OUTPUT_DIR] [-t]
[-s SUFFIX] [-steps STEPS [STEPS ...]]
[-exclude EXCLUDE [EXCLUDE ...]]
[-l] [--log-level LOG_LEVEL]
[--log-config LOG_CONFIG]
...
optional arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
path to config file (YAML) [env var: CONFIG] (default:
None)
-gkey GOOGLE_CREDENTIAL_KEY, --gcp_credentials GOOGLE_CREDENTIAL_KEY
The path were the JSON credential file is stored. [env
var: GCP_CREDENTIALS] (default: None)
-gb GCP_BUCKET, --gcp_bucket GCP_BUCKET
Copy the files from the output directory to a specific
google bucket (default: None)
-o OUTPUT_DIR, --output OUTPUT_DIR
By default, the files are generated in the root
directory [env var: OT_OUTPUT_DIR] (default: None)
-s SUFFIX, --suffix SUFFIX
The default suffix is yyyy-mm-dd [env var:
OT_SUFFIX_INPUT] (default: None)
-steps STEPS [STEPS ...]
Run a specific list of sections of the config file. Eg
annotations annotations_from_buckets (default: None)
-exclude EXCLUDE [EXCLUDE ...]
Exclude a specific list of sections of the config
file. Eg annotations annotations_from_buckets
(default: None)
-l, --list_steps List of steps callable (default: False)
--log-level LOG_LEVEL
set the log level [env var: LOG_LEVEL] (default: INFO)
If you want to run PIS in a Docker container follow these steps:
- Get the code
git clone https://github.com/opentargets/platform-input-support
- create container
docker build --tag <image tag> <path to Dockerfile>
- start container mounting the cloned code as a volume (here I assume you cloned the code into your home directory)
docker run -v ~/platform-input-support:/usr/src/app --rm -it --entrypoint bash <image tag>
This command will drop you into a bash shell inside the container, where you can execute the code. - execute code
poetry run python3 platform-input-support.py -steps <step> --log-level=DEBUG
The directory "resources" contains the file logging.ini with a list of default value. If the logging.ini is not available or the user removes it than the code sets up a list of default parameters. In both case, the log output file is store under "log"
To copy the files in a specific google storage bucket valid credentials must be used. The required parameter -gkey (--gcp_credentials) allows the specification of Google storage JSON credential. Eg.
python platform-input-support.py -gkey /path/open-targets-gac.json -gb bucket/object_path
or
python platform-input-support.py
--gcp_credentials /path/open-targets-gac.json
--gcp_bucket ot-snapshots/es5-sufentanil/tmp
python platform-input-support.py
--gcp_credentials /path/open-targets-gac.json
--gcp_bucket ot-snapshots/es5-sufentanil/tmp
or
python platform-input-support.py
--gcp_credentials /path/open-targets-gac.json
--gcp_bucket ot-snapshots/es5-sufentanil/tmp
-steps annotations evidence
-exclude drug
or
python platform-input-support.py
-gkey /path/open-targets-gac.json
-gb bucket/object_path -steps drug
--log-level DEBUG > log.txt
The zip files generated might be corrupted. The follow command checks if the files are correct. sh check_corrupted_files.sh
Create a linux VM server and run the following commands
sudo apt update
sudo apt install git
sudo apt-get install bzip2 wget
wget https://repo.anaconda.com/archive/Anaconda3-2020.07-Linux-x86_64.sh
bash Anaconda3-2020.07-Linux-x86_64.sh
source ~/.bashrc
mkdir gitRepo
cd gitRepo
git clone https://github.com/opentargets/platform-input-support.git
cd platform-input-support
curl -sSL https://install.python-poetry.org | python3 -
poetry install --no-root
poetry run python3 platform-input-support.py -l
Use nohup to avoid that the process hang up.
Eg.
nohup poetry run python3 platform-input-support.py
-gkey /path/open-targets-gac.json
-gb bucket/object_path -steps drug
--log-level DEBUG > log.txt &
The program is broken into steps such as drug
, interactions
, etc. Each step can be configured as necessary in the
config file and run using command line arguments.
The EFO step is used to gather the raw data for the platform ETL.
The scope of EFO is to support the annotation, analysis and visualization of data handled by the core ontology for Open Targets.
This step downloads and manipulates the input files and it generates the following output:
- ontology-efo-v3.xx.yy.jsonl : list of EFO terms
- ontology-mondo.jsonl : list of MONDO terms
- ontology-hpo.jsonl : list of HPO terms
- hpo-phenotypes-yyyy-mm-dd.jsonl : mapping between CrossReference databaseId
The Drug step is used to gather the raw data for the platform ETL.
ChEMBL have made an Elasticsearch instance available for querying. To keep data volumes and running times down specify the index and fields which are required in the config file.
This step is used to download raw json data from the ENSEMBL ftp server for
specified species. These inputs are then processed with jq
to extract the id and name fields which are required by the
ETL. The downloaded json files approach 30GB in size, and we only extract ~10MB from them. It is worth considering if we
want to retain these files long-term.
This step is used to download data from the internal OT resources and proteinatlas information. The tissue translation map requires some manipulations due the weird JSON format. All the files generated are required by the ETL.
This step collects adversed events of interest from OpenFDA FAERS. It requires two input parameters via the configuation file:
- URL of the events black list.
- URL where to find OpenFDA FAERS repository metadata in JSON format.
This step collects two kinds of information on OTAR Projects, used in the internal platform:
- Metadata information on the projects.
- Mapping Information, between the projects and diseases
platform-input-support
is the entrypoint to the program; it loads theconfig.yaml
which specifies the available steps and any necessary configuration which goes with them. This configuration is represented internally as a dictionary. Platform input support configures aRetrieveResource
object and calls therun
method triggering the selected steps.RetrieveResource
will consult the steps selected and trigger a method for each selected step. Most steps will defer to a helper object inModules
to retrieve the selected resources.
sudo apt-get install autoconf libtool