Protein embeddings improve phage-host interaction prediction

This work was accepted for publication in PLOS ONE.

The final version of our paper (as published in PLOS ONE) can be accessed via this link.
Presenting this paper, the lead author (Mark Edward M. Gonzales) won 2nd Prize at the 2023 Magsaysay Future Engineers/Technologists Award.
- This award is conferred by the National Academy of Science and Technology, the highest recognition and scientific advisory body of the Philippines, to recognize outstanding research outputs on engineering and technology at the collegiate level.
- The presentation can be viewed here (29:35–39:51), and the slides can be accessed via this link.

If you find our work useful, please consider citing:

@article{10.1371/journal.pone.0289030,
    doi = {10.1371/journal.pone.0289030},
    author = {Gonzales, Mark Edward M. AND Ureta, Jennifer C. AND Shrestha, Anish M. S.},
    journal = {PLOS ONE},
    publisher = {Public Library of Science},
    title = {Protein embeddings improve phage-host interaction prediction},
    year = {2023},
    month = {07},
    volume = {18},
    url = {https://doi.org/10.1371/journal.pone.0289030},
    pages = {1-22},
    number = {7}
}

Description

ABSTRACT: With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help shortlist candidate phages. Most existing models consider entire proteomes and rely on manual feature engineering, which poses difficulty in selecting the most informative sequence properties to serve as input to the model. In this paper, we framed phage-host interaction prediction as a multiclass classification problem that takes as input the embeddings of a phage's receptor-binding proteins, which are known to be the key machinery for host recognition, and predicts the host genus. We explored different protein language models to automatically encode these protein sequences into dense embeddings without the need for additional alignment or structural information. We show that the use of embeddings of receptor-binding proteins presents improvements over handcrafted genomic and protein sequence features. The highest performance was obtained using the transformer-based protein language model ProtT5, resulting in a 3% to 4% increase in weighted F1 and recall scores across different prediction confidence thresholds, compared to using selected handcrafted sequence features.

AUTHOR SUMMARY: Antimicrobial resistance is among the major global health issues at present. As alternatives to the usual antibiotics, drug formulations based on phages (bacteria-infecting viruses) have received increased interest, as phages are known to attack only a narrow range of bacterial hosts and antagonize the target pathogen with minimal side effects. The screening of candidate phages has recently been facilitated through the use of machine learning models for inferring phage-host pairs. The performance of these models relies heavily on the transformation of raw biological sequences into a collection of numerical features. However, since a wide array of potentially informative features can be extracted from sequences, selecting the most relevant ones is challenging. Our approach eliminates the need for this manual feature engineering by employing protein language models to automatically generate numerical representations for specific subsets of tail proteins known as receptor-binding proteins. These proteins are responsible for a phage's initial contact with the host bacterium and are thus regarded as important determinants of host specificity. Our results show that this approach presents improvements over using handcrafted genomic and protein sequence features in predicting phage-host interaction.

↑ Return to Table of Contents.

Project Structure

The experiments folder contains the files and scripts for running our model and reproducing our results. Note that additional (large) files have to be downloaded (or generated) following the instructions in the Jupyter notebooks.

Directories

Directory	Description
`inphared`	Contains the list of phage-host pairs in TSV format. The GenBank and FASTA files with the genomic and protein sequences of the phages, the embeddings of the receptor-binding proteins, and the phage-host-features CSV files should also be saved in this folder
`preprocessing`	Contains text files related to the preprocessing of host information and the selection of annotated RBPs
`rbp_prediction`	Contains the JSON file of the trained XGBoost model proposed by Boeckaerts et al. (2022) for the computational prediction of receptor-binding proteins. Downloaded from this repository (under the MIT License)
`temp`	Contains intermediate output files during preprocessng and performance evaluation

↑ Return to Table of Contents.

Jupyter Notebooks

Each notebook provides detailed instructions related to the required and output files, including the download links and where to save them.

Notebook	Description	Required Files	Output Files
`1. Sequence Preprocessing.ipynb`	Preprocessing of host information and selection of annotated receptor-binding proteins	GenomesDB (Partial. Complete populating following the instructions in the notebook), GenBank file of phage genomes and/or proteomes	FASTA files of genomic and protein sequences
`2. Exploratory Data Analysis.ipynb`	Exploratory data analysis	Protein embeddings (Part 1 and Part 2), Phage-host-features CSV files	–
`3. RBP Computational Prediction.ipynb`	Computational prediction of receptor-binding proteins	Protein embeddings (Part 1 and Part 2)	Protein embeddings (Part 1 and Part 2)
`3.1. RBP FASTA Generation.ipynb`	Generation of the FASTA files containing the RBP protein sequences]	Protein embeddings (Part 1 and Part 2)	FASTA files of genomic and protein sequences
`4. Protein Embedding Generation.ipynb`	Generation of protein embeddings	FASTA files of genomic and protein sequences	Protein embeddings (Part 1 and Part 2)
`5. Data Consolidation.ipynb`	Generation of phage-host-features CSV files	FASTA files of genomic and protein sequences, Protein embeddings (Part 1 and Part 2)	Phage-host-features CSV files
`6. Classifier Building & Evaluation.ipynb`	Construction of phage-host interaction model and performance evaluation	Phage-host-features CSV files	Trained models
`6.1. Additional Model Evaluation (Specificity + PR Curve).ipynb`	Addition of metrics for model evaluation	Phage-host-features CSV files	–
`7. Visualization.ipynb`	Plotting of t-SNE and UMAP projections	Phage-host-features CSV files	–

↑ Return to Table of Contents.

Python Scripts

Script	Description
`ClassificationUtil.py`	Contains the utility functions for the generation of the phage-host-features CSV files, construction of the phage-host interaction model, and performance evaluation
`ConstantsUtil.py`	Contains the constants used in the notebooks and scripts
`EDAUtil.py`	Contains the utility functions for exploratory data analysis
`RBPPredictionUtil.py`	Contains the utility functions for the computational prediction of receptor-binding proteins
`SequenceParsing.py`	Contains the utility functions for preprocessing host information and selecting annotated receptor-binding proteins
`boeckaerts.py`	Contains the utility functions written by Boeckaerts et al. (2021) for running their phage-host interaction prediction tool (with which we benchmarked our model). Downloaded from this repository (under the MIT License)

↑ Return to Table of Contents.

Folder Structure

Once you have cloned this repository and finished downloading (or generating) all the additional required files following the instructions in the Jupyter notebooks, your folder structure should be similar to the one below:

phage-host-prediction (root)
- datasets
  - inphared
    - inphared
      - GenomesDB (Downoad partial. Complete populating following the instructions here)
        
        AB002632
        
        ...
- experiments
  - inphared
    - data (Download)
      - rbp.csv
      - rbp_embeddings_esm.csv
      - ...
    - embeddings (Download Part 1 and Part 2)
      - esm
      - esm1b
      - ...
    - fasta (Download)
      - complete
      - hypothetical
      - nucleotide
      - rbp
    - 16Sep2022_data_excluding_refseq.tsv
    - 16Sep2022_phages_downloaded_from_genbank.gb (Download)
  - models (Download)
    - boeckaerts.joblib
    - esm.joblib
    - ...
  - preprocessing
  - rbp_prediction
  - temp
  - 1. Sequence Preprocessing.ipynb
  - ...
  - ClassificationUtil.py
  - ...

↑ Return to Table of Contents.

Environment & Dependencies

⚠️ UPDATE (06/12/2023): In May 2023, Google Colab switched its default runtime to Python 3.10. However, one of our project's dependencies, bio-embeddings (v0.2.3), seems to be incompatible with Python 3.10.

If the memory requirement of loading pretrained protein language models (4. Protein Embedding Generation.ipynb) is too heavy for your local machine, an alternative cloud-based service with GPU is Paperspace; you may try using either its PyTorch 1.12 runtime (which, as of writing, uses Python 3.9) or Python 3.9 runtime.

Operating System

One of our project's dependencies, bio_embeddings, was developed for Unix and Unix-like operating systems. If you are running this project on Windows, consider using Windows Subsystem for Linux (WSL) or a virtual machine.

Dependencies

We recommend using Python 3.9 to run this project. Thanks to Dr. Paul K. Yu ([email protected]) for sharing his environment configuration (environment.yaml).

The dependencies can be installed via Conda, an open-source package and environment management system. Run the following command to create a virtual environment with the dependencies installed:

conda env create -f environment.yaml

To activate this environment, run the following command:

conda activate phage-host-prediction

Click here to show/hide the complete list of Python libraries and modules used in this project (excluding those that are part of the Python Standard Library)

Library/Module	Description	License
`regex`	Provides additional functionality over the standard `re` module while maintaining backwards-compatibility	Apache License 2.0
`nltk`	Provides interfaces to corpora and lexical resources, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning	Apache License 2.0
`biopython`	Provides tools for computational molecular biology	Biopython License Agreement, BSD 3-Clause License
`ete3`	Provides functions for automated manipulation, analysis, and visualization of phylogenetic trees	GNU General Public License v3.0
`pandas`	Provides functions for data analysis and manipulation	BSD 3-Clause "New" or "Revised" License
`numpy`	Provides a multidimensional array object, various derived objects, and an assortment of routines for fast operations on arrays	BSD 3-Clause "New" or "Revised" License
`scipy`	Provides efficient numerical routines, such as those for numerical integration, interpolation, optimization, linear algebra, and statistics	BSD 3-Clause "New" or "Revised" License
`scikit-learn`	Provides efficient tools for predictive data analysis	BSD 3-Clause "New" or "Revised" License
`imbalanced-learn`	Provides tools when dealing with classification with imbalanced classes	MIT License
`pyyaml`	Supports standard YAML tags and provides Python-specific tags that allow to represent an arbitrary Python object	MIT License
`xgboost`	Implements machine learning algorithms under the gradient boosting framework	Apache License 2.0
`joblib`	Provides tools for lightweight pipelining in Python	BSD 3-Clause "New" or "Revised" License
`numba`	Translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library	BSD 2-Clause "Simplified" License
`matplotlib`	Provides functions for creating static, animated, and interactive visualizations	Matplotlib License (BSD-Compatible)
`jsonnet`	Domain-specific language for JSON	Apache License 2.0
`cudatoolkit`	Parallel computing platform and programming model for general computing on GPUs	NVIDIA Software License
`bio-embeddings`	Provides an interface for the use of language model-based biological sequence representations for transfer-learning	MIT License
`umap-learn`	Implements uniform manifold approximation and projection, a dimension reduction technique that can be used for visualisation similarly and general non-linear dimension reduction	BSD 3-Clause "New" or "Revised" License

The descriptions are taken from their respective websites.

↑ Return to Table of Contents.

Authors

Mark Edward M. Gonzales
[email protected]
Ms. Jennifer C. Ureta
[email protected]
Dr. Anish M.S. Shrestha
[email protected]

This is a research project under the Bioinformatics Laboratory, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Philippines.

This research was partly funded by the Department of Science and Technology – Philippine Council for Health Research and Development (DOST-PCHRD) under the e-Asia JRP 2021 Alternative therapeutics to tackle AMR pathogens (ATTACK-AMR) program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
experiments		experiments
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
figure.png		figure.png
test		test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein embeddings improve phage-host interaction prediction

Table of Contents

Description

Project Structure

Directories

Jupyter Notebooks

Python Scripts

Folder Structure

Environment & Dependencies

Operating System

Dependencies

Authors

About

Releases

Packages

Contributors 2

Languages

License

Meng-AnunaAI/protein_s_embedding

Folders and files

Latest commit

History

Repository files navigation

Protein embeddings improve phage-host interaction prediction

Table of Contents

Description

Project Structure

Directories

Jupyter Notebooks

Python Scripts

Folder Structure

Environment & Dependencies

Operating System

Dependencies

Authors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages