This repository contains the code for the BioCypher adapter for the CollecTRI dataset. The adapter is a Python module that converts the CollecTRI dataset into the BioCypher format. It also serves as a tutorial for end-to-end knowledge graph construction using BioCypher.
-
Download and cache the resource
-
Run BioCypher to create knowledge graph
-
Deploy knowledge graph and web frontend
-
Create repository: using the template repository is the easiest way to get started. The template repository contains the basic structure of a BioCypher adapter. Clone the template repository and rename it and the adapter to your project's name. Also adjust the
pyproject.toml
file to reflect your project's name and version. -
Find the data: the CollecTRI dataset is available as a flat file at https://rescued.omnipathdb.org/CollecTRI.csv. With this link, we can set up a BioCypher
Resource
object to download and cache the data. We implement this and all other steps of the build pipeline in the create_knowledge_graph.py create script. Check there for the full code.
bc = BioCypher()
collectri = Resource(
name="collectri",
url_s="https://rescued.omnipathdb.org/CollecTRI.csv",
lifetime=0, # CollecTRI is a static resource
)
paths = bc.download(collectri)
- Adjust the adapter based on the contents of the dataset. This is the most labour-intensive step, as it involves systematising the dataset and mapping it to a suitable ontology, as well as designing the ETL (extract-transform-load) process in the adapter module. The CollecTRI dataset is comparatively simple, which makes it a good example. You can find a detailed description of the process below (adapter design and ontology mapping).
When building the adapter, it can be helpful to use the Pandas functionality of
BioCypher to preview the KG components. Using the add()
and to_df()
methods,
we can check whether the adapter is working as expected.
bc.add(adapter.get_nodes())
bc.add(adapter.get_edges())
dfs = bc.to_df()
for name, df in dfs.items():
print(name)
print(df.head())
- Run BioCypher to create the knowledge graph. This step is straightforward,
using the information provided by the mapping configuration and the process
provided by the adapter created in the previous step. For compatibility with the
Docker compose workflow, we use the
write_nodes()
andwrite_edges()
methods to generate CSV files for import into Neo4j, as well as the import call statement and a summary of the build process.
bc.write_nodes(adapter.get_nodes())
bc.write_edges(adapter.get_edges())
# Write admin import statement
bc.write_import_call()
# Print summary
bc.summary()
- Run Docker compose to deploy the knowledge graph. Running the standard
docker-compose.yaml
configuration will build the graph, import it into Neo4j, and deploy a Neo4j instance to be accessed on https://localhost:7474. The graph can then be browsed and queried.
docker compose up -d
You can also include the ChatGSE frontend in the deployment by running the
docker-compose-chatgse.yaml
configuration. This will also deploy a ChatGSE
instance to be accessed on https://localhost:8501. In the Knowledge Graph
tab,
you can use natural language queries to generate Cypher queries and run them on
the graph. For connecting, you need to change the Neo4j host IP from localhost
to deploy
, which is the name of the Docker service running the Neo4j instance.
You should be able to answer questions like "Which transcription factors
activate TP53?", "Which genes are regulated by transcription factors starting
with 'ZNF'?", or "Which are DNA-binding transcription factors?"
docker compose -f docker-compose-chatgse.yaml up -d
To stop the deployment, run
docker compose down --volumes
or
docker compose -f docker-compose-chatgse.yaml down --volumes
Removing the volumes is necessary to ensure a clean deployment when running
docker compose up
again. Otherwise, the graph will contain duplicate nodes and
edges.
We can look at the downloaded dataset (using the path from the previous step) to get an idea of its contents:
import pandas as pd
df = pd.read_csv(paths[0])
print(df.head())
# source target weight ... resources PMID sign.decision
# 0 MYC TERT 1 ... ExTRI;HTRI;TRRUST;TFactS;NTNU.Curated;Pavlidis... 10022128;10491298;10606235;10637317;10723141;1... PMID
# 1 SPI1 BGLAP 1 ... ExTRI 10022617 default activation
# 2 AP1 JUN 1 ... ExTRI;TRRUST;NTNU.Curated 10022869;10037172;10208431;10366004;11281649;1... PMID
# 3 SMAD3 JUN 1 ... ExTRI;TRRUST;TFactS;NTNU.Curated 10022869;12374795 PMID
# 4 SMAD4 JUN 1 ... ExTRI;TRRUST;TFactS;NTNU.Curated 10022869;12374795 PMID
print(df.columns)
# Index(['source', 'target', 'weight', 'TF.category', 'resources', 'PMID',
# 'sign.decision'],
# dtype='object')
We can then use this knowledge to design the adapater, i.e., the ETL process. Briefly, the adapter extracts sources and targets, which are both genes, and establishes relationships that embody the regulons. These relationships are enriched by the curation information contained in the table.
We use Enums to define the types of nodes and edges and their properties. This
helps in organising the process and also allows the use of auto-completion in
downstream tasks. We have two node types, gene
and transcription factor
, and
one relationship type, transcriptional regulation
. We also define the
properties of the nodes and edges, which are none for genes, category
for
transcription factors, and weight
, resources
, references
, and
sign_decision
for the relationship. (Note that we rename some of the original
attributes to make them more intuitive, e.g., PMID
to references
, or
machine-compatible, e.g., sign.decision
to sign_decision
. This conversion is
handled by the adapter and needs to be reflected in our schema configuration.)
The adapter then uses the pandas
library to read the dataset and extract the
relevant information. We use the _preprocess_data()
method to load the
dataframe and extract unique genes and TFs. Since each row of the dataset is one
relationship, we can simply iterate over the rows to create the relationships
directly.
The last component the adapter needs is two public methods, get_nodes()
and
get_edges()
, which return generators of nodes and edges, respectively. These
methods are used in the build script (create_knowledge_graph.py
) to create the
knowledge graph.
In addition, we use the information to create an ontology
mapping
in the schema_config.yaml
file, which reflect the ontological grounding of the
data. Since CollecTRI deals with transcriptional regulation in a gene-gene
context, we only need to define gene
nodes and some regulatory interaction
between them. For this simple case, we resort to the shallow default ontology,
Biolink,
which already contains Gene entities and regulatory relationships. This also
means we do not need to specify the ontology in the biocypher_config.yaml
file, as Biolink is the default.
We use the existing entity type gene
, and we extend the existing pairwise gene to gene association
relationship to transcriptional regulation
using
inheritance.
For clarity, we also introduce a transcription factor
entity type, which
inherits from gene
; this way, we can query for transcription factors
specifically while retaining the ability to query for all genes.
gene:
represented_as: node
preferred_id: hgnc.symbol
properties:
name: str
transcription factor:
is_a: gene
represented_as: node
preferred_id: hgnc.symbol
properties:
name: str
category: str
transcriptional regulation:
is_a: pairwise gene to gene interaction
represented_as: edge
source: transcription factor
target: gene
properties:
weight: float
resources: str
references: str
sign_decision: str
Note that, since we pass BioCypherNode
and BioCypherEdge
objects to the
BioCypher instance, which already include the correct labels of the ontology
classes we map to (gene
, transcription factor
, and transcriptional regulation
), we do not need to specify the input_label
fields of each class.
We do, however, add some optional components to the schema configuration, mainly
to make interaction with the LLM framework BioChatter easier. For instance, we
provide explicit properties
for each class, which are used to generate the
schema_info.yaml
file, an extended schema configuration for BioChatter
integration. We also include the name
property as a shortcut to the gene
symbol without added prefix (which usually is good practice to ensure uniqueness
of identifiers, in this case the hgnc.symbol
). This way, we (and the LLM) can
use the name
property to refer to genes by their symbol, e.g., MYC
instead
of hgnc.symbol:MYC
.
We also rename weight
to activation_or_inhibition
, since it is a binary
attribute that only has two values, 1
and -1
, which we also modify to become
a string with the categories activation
and inhibition
. This makes the
attribute more intuitive to human and LLM users.