Skip to content

How to help

Deepak edited this page Sep 14, 2020 · 17 revisions

Download and use the code

git clone
cd kg-covid-19
python3 -m venv venv # optional
pip install -r requirements.txt
python install

To run the pipeline,

python download
python transform
python merge

Add any issues and questions here.

Write code to ingest data

Most urgent need is for code to ingest data from new sources.

Find a data source to ingest:

An issue tracker with a list of new data sources is here.

Look at the data file(s), and plan how you are going to write out data to nodes and edges:

You'll need to write out a nodes.tsv file describing each entity you are ingesting, and an edges.tsv describing the relationships between entities, as described here.

nodes.tsv should have at least these columns (you can add more columns if you like):

id name category

id should be a CURIE that uses one of these identifiers. They are enumerated here. For genes, a Uniprot ID is preferred, if available.

category should be a Biolink category in CURIE format, for example biolink:Gene

edges.tsv should have at least these columns:

subject edge_label object relation

subject and object should be ids that are present in the nodes.tsv file (again, as CURIEs that uses one of these). edge_label should be a CURIE for the biolink edge_label that describes the relationship. relation should be a CURIE for the term from the relation ontology.

Read how to make a PR, and fork the repo:

  • Read these instructions about how to make a pull request in github. Fork the code and set up your development environment.

Add a block to download.yaml to download data file for source:

  • Add a block of yaml containing the url of the file you need to download for the source (and optionally a brief description) in download.yaml like so - each item will be downloaded when the download command is executed:
# brief comment about this source, one or more blocks with a url: (and optionally a local_name:, to avoid name collisions)
  # first file
  local_name: some_data.txt
  # second file
  local_name: some_more_data.txt

Add code to ingest and transform data:

  • Add a new sub-directory in kg_emerging_viruses/transform_utils with a unique name for your source. If the data come from a scientific paper, consider prepending the pubmed ID to the name of the source (e.g. pmid28355270_hcov229e_a549_cells)
  • In this sub-directory, write a class that ingests the file(s) you added above in the yaml, which will be in data/raw/[file name without path]. Your class should have a constructor and a run() function, which is called to perform the ingest. It should output data into data/transformed/[source name] for all nodes and edges, in tsv format, as described here.
  • Also add the following metadata in the comments of your script:
    • data source
    • files used
    • release version that you are ingesting
    • documentation on which fields are relevant and how they map to node and edge properties
  • In kg_covid_19/, add a key/value pair to DATA_SOURCES. The key should be the [source name] above, and the value should be the name of the class above. Also add an import statement for the class.
  • In merge.yaml, add a block for your new source, something like:
      type: tsv
         - data/transformed/[source_name]/nodes.tsv
         - data/transformed/[source_name]/edges.tsv

Submit your PR on github, and link the github issue for the data source you ingested

Might want to run pylint and mypy and fix any issues before submitting your PR.

Develop jupyter notebooks to show how to use kg-covid-19

To be developed. Please contact Justin or anyone on the development team if you'd like to help!