Skip to content

Learning and Aligning Large Protein Families with support of protein language models.

License

Notifications You must be signed in to change notification settings

Gaius-Augustus/learnMSA

Repository files navigation

Our tool is under active development and feedback is very much appreciated.

learnMSA2: deep protein multiple alignments with large language and hidden Markov models

Features

  • Aligns large numbers of protein sequences with above state-of-the-art accuracy
  • Can utilize protein language models (--use_language_model) for significantly improved accuracy
  • Enables ultra-large alignment of millions of sequences
  • GPU acceleration, multi-GPU support
  • Scales linear in the number of sequences (does not require a guide tree)
  • Memory efficient (depending on sequence length, aligning millions of sequences on a laptop is possible)
  • Visualize a profile HMM or a sequence logo of the consensus motif

Current limitations

  • Requires many sequences (in most cases starting at 1000, a few 100 might still be enough) to achieve high accuracy
  • Only for protein sequences
  • Increasingly slow for long proteins with a length > 1000 residues

Installation

Recommended ways to install learnMSA along with Tensorflow + GPU:

Singularity/Docker

We provide a hassle-free docker image including GPU and pLM support.

singularity build learnmsa.sif docker://felbecker/learnmsa:2.0.9
singularity run --nv learnmsa.sif learnMSA

Running the container with --nv is required for GPU support.

conda/mamba and pip

  1. Create a conda environment:
conda create -n learnMSA python=3.12
conda activate learnMSA
  1. Install learnMSA (cuda toolkit included):
pip install learnMSA
  1. Additional installs for sequence weights (recommended!):
conda install -c bioconda mmseqs2

You may have to set up Bioconda channels.

  1. Additional installs for language model support (recommended!):
pip install torch==2.2.1 tf-keras==2.17.0
  1. (optional) Verify that TensorFlow and learnMSA are correctly installed:
python3 -c "import tensorflow as tf; print(tf.__version__, tf.config.list_physical_devices('GPU'))"
learnMSA -h

Using learnMSA for alignment

Recommended way to align proteins:

learnMSA -i INPUT_FILE -o OUTPUT_FILE --use_language_model --sequence_weights

Without language model support (recommended for speed and for proteins with very high sequence similarity):

learnMSA -i INPUT_FILE -o OUTPUT_FILE --sequence_weights

Note: If you installed learnMSA via docker/singularity, you have to run singularity run --nv learnmsa.sif learnMSA -i ...

We always recommend to run learnMSA with the --sequence_weights flag to improve accuracy. This requires mmseqs2 to be installed (see above). Sequence weights (and thus the mmseqs2 requirement) are turned off by default.

To output a pdf with a sequence logo alongside the msa, use --logo. For a fun gif that visualizes the training process, you can use --logo_gif (attention, slows down training and should not be used for real alignments).

Interactive notebook with visualization:

Run the notebooks learnMSA_demo.ipynb or learnMSA_with_language_model_demo.ipynb with juypter.

Benchmark:

alt text

Publications

Becker F, Stanke M. learnMSA2: deep protein multiple alignments with large language and hidden Markov models. Bioinformatics. 2024

Becker F, Stanke M. learnMSA: learning and aligning large protein families. GigaScience. 2022

About

Learning and Aligning Large Protein Families with support of protein language models.

Resources

License

Stars

Watchers

Forks

Packages

No packages published