Releases: steineggerlab/foldseek
9-427df8a
At a glance: Foldseek release 9 features the fully benchmarked Foldseek-multimer search and structure-based sequence search using ProstT5. Both Foldseek-multimer and structure-based sequence search are also available in the Foldseek webserver.
Major Features
- Foldseek-multimer: Fully benchmarked and integrated into this release with the
easy-multimersearch
andmultimer
workflows (Thanks @Woosub-Kim). Check out our preprint explaining the algorithm.
Read more on how to get started in our README. - Search requires less memory: We optimized the memory consumption of Foldseek. It requires significant less memory now (f629bbe)
- Structure-based sequence search: Predict protein 3Di directly from amino acid sequences without the need for existing protein structures. This is roughly 400-4000x faster than predicting full protein structures with ColabFold. This feature uses the ProstT5 protein language model and runs by default on CPU:
foldseek databases ProstT5 weights tmp
foldseek databases PDB pdb tmp
foldseek easy-search QUERY.fasta pdb result.m8 tmp --prostt5-model weights
Fast inference using GPU/CUDA is also supported. Compile from source with cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_CUDA=1 -DCUDAToolkit_ROOT=Path-To-Cuda-Toolkit
and call with createdb/easy-search --prostt5-model weights --gpu 1
.
(Thanks @Victor-Mihaila).
Breaking changes
- Remove
.cif
/.pdb
from filenames and remove_MODEL_
from identifiers in.lookup
#261 (Thanks @ChaSooyoung) - Removed
--tar-include
and--tar-exclude
fromcreatedb
as they were unused (15c0516) - Not-breaking: workflows using
easy-complexsearch
andcomplexsearch
will continue to work. These are hidden modules mapping toeasy-multimersearch
andmultimersearch
internally. However, the internals have had major changes since the last release.
Other features
convert2pdb
can output separate PDB files (346c1dd)createdb
learned to read a large number of input files from a.tsv
file (e1394aa)- Force input format with
createdb --input-format
(852434a) - Compute exact TM-score with
--exact-tmscore
(493cefe) - Added CATH50 database (6893dcc)
- Update HTML output (not fully supported for multimer yet; c7e4a37, 361c22a, 1bc8d2e; Thanks @gamcil)
compressca
learned new input and output modes (8e68e86, 5d2724d, 284bc81)
Bug Fixes
- Fix broken symlinks with
databases PDB
download (9ef6d18, fa6c530). - Fix AFDB Proteome and SwissProt download check (fa6c530, Thanks @TigerWindWood)
- Fix AF3 mmCIF files crashing
createdb
- Fix
convert2pdb
creating broken PDB files for large structures (b6dac8a) - Remove ligand and alt res within chain #198 (Thanks @NatureGeorge)
- Skip residues without C-alpha #214 (75a50f7)
structurerescorediagonal
did not properly respect--tmscore-threshold
(#205; 886021d)- Fallback alignment to Smith-Waterman when block-aligner produces invalid alignments (54c271c)
Developers
8-ef4e960
At a glance: Added support for clustered, protein-complex searches (alpha-verison, feedback welcome) as well as improved HTML output.
Features
- Implement
easy-complex-search
to find similar complexes structures in a database - Implement a cluster search
--cluster-search 1
, which speeds up searches through redundant databases. It first searches only the representatives and then expands the final alignment to all cluster members. Two downloadable DBs support this search:PDB
and theAlphafold/UniProt50
. createclusearchdb
allows to build a searchable cluster database (b4d7ec5)convertalis
HTML output updated to match search.foldseek.com output (96be67c)- Introduced
Alphafold/UniProt50-minimal
and updated cluster file downloads for regularAlphafold/UniProt50
to support cluster searches (93ad1d4, 2e9da41, daad5ab) - We added two modules
scorecomplex
andcreatecomplexreport
to compute a TMscore between protein complex as well as to summarize the findings (938b591, a6c75cb)
Bug fixes
- Foldseek correctly computes coverage again (c63725d). Coverage computation was broken since release 6 (29979fb).
--alignment-type 0
(3Di-only) now correctly ignores amino-acid information (f0de872)createdb
could miss some files when recursively looking within directories on some file systems (d1d1b86)convertalis
--format-output
can outputqca
ortca
if only one of the two databases has C-alpha information (311845d)--lddt-thr
and --tmscore-thrare ignored when
--sort-by-structure-bits 0` is set (b1b4710)
Developers
- Much smaller precomputed index for
--prefilter-mode 0
(exhaustive ungapped prefiltering) with--index-exclude 1
or--sort-by-structure-bits 0
(No C-alpha) with--index-exclude 2
or both with--index-exclude 3
(8f586c0) - Enabled WebAssembly (WASM) compilation for Foldseek (408cfae; pending on Daniel-Liu-c0deb0t/block-aligner#26)
Others
- Thanks @amorehead and @KevinDuringWork for their pull request (#159, #170)
7-04e0ec8
At a glance: Downloadable pdb
database can be searched with --cluster-search 1
. Many createdb
improvements and other bug fixes.
Features
createdb
properly warns and exits if no protein chain can be extracted (a146142, #134)createdb
separates PDB/mmCIF MODEL records into different source/lookup entries (d488f4a)createdb
filters out structures that are not proteins (d48d389)databases
downloader supports cluster databases (ef768f4)pdb
database creation script has been updated to produce a cluster database that can be searched with--cluster-search 1
(8eb36a2)
Bug fixes
- Fixed a bug with block-aligner where long protein sequences would error out (0627447) Thanks @Daniel-Liu-c0deb0t!
- Foldseek can be compiled without zlib, fixed an issue with zlib linking to gemmi (0832bef, 1a038db)
- Fixed Dockerfile to drop backports as its not needed with Debian bookworm (04e0ec8)
Others
- Made
compressca
an expert tool, hiding it from the default view to avoid confusion. (e4fe5be)
Developers
Foldseek 6-29e2557
At a glance: Introduced block-aligner for faster alignments, added ungapped prefilter mode, added cluster search support
Major Features
- Introduced block-aligner, a new banded-alignment algorithm that speeds up alignments by ~2x. Check out the block-aligner preprint. Thanks @Daniel-Liu-c0deb0t!
- Added ungapped prefilter mode (
--prefilter-mode 1
). This is similar to the HHblits prefilter that exhaustively aligns without gaps all queries and targets. This mode has much lower memory requirements and should scale better for single or few query searches. However, it scales worse with many queries. - Added cluster search support, similar to the search introduced in ColabFold
Features
- Improved
README
- Added support for
qtmscore
andttmscore
inconvertalis --format-output
- LDDT computation is now faster
Bug fixes
--greedy-best-hit
search mode is now correctly exposed. Thanks @Pooryamb!- Removed ANISOU parsing of PDB
- Added missing Foldseek specific
convertalis --format-output
options to help text
Developers and Maintainers
Foldseek now requires Rust to compile. Please make sure Rust 1.68 or newer is installed, as we have observed issues with 1.64. You can pass -DIGNORE_RUST_VERSION=1
to CMake to ignore the check. Please ensure the Foldseek regression test in ./regression/run_regression.sh
passes before shipping Foldseek packages. We also require at least CMake 3.15 now.
Foldseek 5-53465f0
At a glace: Default enabled compressed C-alpha much decrease resource consumption of large databases. Otherwise, mostly house keeping in this release.
Features
- Compressed C-alpha coordinates are now enabled by default
- Foldseek now deals correctly with modified amino acids and HETATMS
- Exhaustive search mode that skips prefiltering with
--exhaustive-search 1
- TM-align speed up by replacing
score_fun8_standard
Bug fixes
- Disable gap-specific profiles for structure alignments
- C-alpha coordinates were not correctly preloaded in the alignment stage
- Reciprocal best hit search now disables new scoring and compositional bias correction for consistent scores in both directions
- Fixed various bugs around compressed C-alpha coordinates
- Computed RMSD was wrong
- Load the DB in memory before aligning (
structurealign
performance issue) - Alignment now uses the correct
--comp-bias-corr-scale
- Fix crash with highly compositionally biased sequences
Foldseek 4-645b789
Release at a glance: better hit ranking, critical bug fix, structure clustering, smaller database size and updated AlphaFold Databases.
Features
foldseek databases
now offers the AlphaFoldDB v4 databases.- We have improved hit ranking in Foldseek by multiplying the 3Di/AA bit-score by the geometric mean of alignment LDDT and TMscore, resulting in more accurate rankings.
- The
--format-output prob
parameter now returns the probability of homology. - The
--format-mode 5
flag generates PDB files with all Cα atoms superimposed based on the aligned coordinates onto the query structure. - We have added a faster computation for LDDT, available with the
--format-output lddt,lddtfull
flag. Thelddt
flag outputs the average LDDT score for all Cα, while thelddtfull
flag outputs a string of LDDT scores for each Cα. - The
--coord-store-mode 2
parameter allows for storage of C-alpha lossless in compressed format. - TMalign mode (
--alignment-type 1
) now uses the 3Di/AA as a prefilter to improve the precision and recall of TMalign, this also makes the TMalign mode much faster. - We have added support for reading in Foldcomp databases (see foldcomp.foldseek.com).
- The database module now includes an option to download ESMAtlas30.
- We have added support for
easy-cluster
, a tool to cluster structural datasets using 3Di/AA alignment, LDDT, and TMscore. - We have added support for profile searches as well as iterative searches using the
--num-iterations
flag. - TMalign results can now be sorted by qTM, tTM, min(qTM, tTM), max(qTM, tTM), and avg(qTM, tTM) using the
--sort
flag. - New module
compressca
: converts an uncompressed Cα database to compressed format. - New module
convert2pdb
: converts a Foldseek structure database to a multi-model PDB file. - We added our PDB100 update pipeline to
util/update_webserver_pdb
Breaking Change
- 3Di/AA score reported by Foldseek is now
bit-score * sqrt(alignment LDDT * alignment TMscore)
- Default sort of TMalign is now average avg(qTM,tTM).
- We do not provide the "Alphafold/UniProt-NO-CA" database anymore, Cα databases are now always required.
- AlphaFoldDB Swiss-Prot and Proteome file names have changed. Downloads for these will stop working on Foldseek versions before this one. Generally, since the Cα database format has changed and is incompatible to older Foldseek versions. None of the v4 databases will work with previous versions.
- The default E-value is now 10.
Bug fixes
- We have fixed an issue that resulted in the loss of high-scoring diagonals during the
prefilter
step. - The visualization has been fixed for cases where the alignment length is exactly 80.
- We have fixed issues with tar inputs.
Foldseek 3-915ef7d
Features
- Added
databases
downloads for the AlphaFold Uniprot Protein Structure Database.
You can choose between Alphafold/UniProt
, Alphafold/UniProt-NO-CA
and Alphafold/UniProt50
:
Alphafold/UniProt
: Contains all 214 million entries from the AlphaFold UniProt database, including C-alpha. This database is ~700GB large to download and ~950GB after extraction.
Alphafold/UniProt-NO-CA
: Excludes C-alphas and is much smaller (~70GB download, ~170GB extracted). However, TM-align based alignments do not work (search --alignment-type 1
, tmalign
, and convertalis --format-output alntmscore,u,t
).
Alphafold/UniProt50
: Alphafold/UniProt
clustered with MMseqs2 to 50% sequence identity and 80% bidirectional coverage (~190GB download). We offer this database in the web server at https://search.foldseek.com.
- Added
databases
TSV output createdb
supports downloading structures from Google Cloud Storage. Not enabled by default, see user guide on how to compile Foldseek with GCS support- PDB offered through
databases
will be updated regularly. Thanks to @jaylee2000
Known issues
prefilter
against large databases such as the AlphaFold Uniprot Protein Structure Database is executed with 6-mers (-k 6
). This is less efficient than 7-mers. We will optimize 7-mer parameters in a future release and re-enable automatic k-mer size choice
Bug fixes
- Fixed PDB download
Foldseek 2-8bd520
Features
- implemented reciprocal-best-structure-hit search (
rbh
andeasy-rbh
) similar to Monzon et al. preprint - C-alpha only structures are supported as input (backbone is completed using pulchra)
convertalis
can output a HTML based result viz (--format-mode 3
)
Example:foldseek easy-search example/d1asha_ example/ aln.html tmp --format-mode 3
- add support to read structures from
tar
andtar.gz
increatedb
,easy-search
andeasy-rbh
.
Example:foldseek easy-rbh UP000005640_9606_HUMAN_v2.tar UP000001940_6239_CAEEL_v2.tar rbh tmp --tar-include '.*pdb'
convertalis
can output C-alpha, TMscore, TM rotation matrices (--format-output qca,tca,alntmscore,u,t
respectively)
foldseek easy-search example/ example/ aln tmp --format-output query,target,alntmscore,u,t
cat aln
d2gdma_ d2gdma_ 1.000E+00 1.000,-0.000,0.000,0.000,1.000,0.000,-0.000,-0.000,1.000 -0.000,-0.000,0.000
d2gdma_ d1q1fa_ 7.971E-01 0.299,-0.746,-0.595,0.952,0.192,0.237,-0.062,-0.638,0.768 94.039,-63.738,34.804
d2gdma_ d1cqxa1 6.794E-01 0.694,-0.662,0.283,0.570,0.746,0.345,-0.439,-0.078,0.895 7.534,-93.168,-12.301
- introduce
--alt-ali
to compute additional sub-optimal alignments for a query-target pairs #12 - added Foldseek docker image (supports
linux/amd64
andlinux/arm64
)
Bug fixes
Foldseek Release 1-3c64211
First release of Foldseek
Foldseek enables fast and sensitive comparisons of large structure sets. It reaches sensitivities similar to state-of-the-art structural aligners while being at least 20,000 times faster.
Publications
Webserver
Search your protein structures against the AlphaFoldDB and PDB in seconds using our Foldseek webserver: