Paper Summary: Foldseek

Foldseek: Fast Structure Search

Fast and accurate protein structure search with Foldseek. van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, Söding J, Steinegger M. Nat Biotechnol. 2024 Feb;42(2):243-246.

if we include models from AlphaFold and ESMFold, hundreds of millions of structures are now available, making the need for fast structure comparison ever more urgent
searching for structure similarity is typically much slower than searching for sequence similarity, yet it can reveal relationships at greater evolutionary distances, where sequence similarity may be undetectable
“linearizing” a structure by describing it as a sequence of characters in a structural alphabet allows using speed optimizations developed for amino acid sequence searches (...it's actually much more complicated than that, but that's the main idea...)
others have had similar ideas of using structural alphabets for fast comparison, so it may be other details of this work that provide the secret sauce, not least the group's dominance in fast sequence search (MMseqs2)
Foldseek has a similar sensitivity to popular existing methods of structure comparison but is 4-5 orders of magnitude faster

[back to presentations list]

What Makes Searching for Similar Structures Slow?

sequence comparison methods use fast prefilters to eliminate many candidates, whereas structure comparison methods generally have not done so
most structure comparisons rely heavily on superposition rather than transformation-independent descriptors
structure similarity is non-local, in that changing the alignment of part of the structure also changes the alignment for the rest of the structure; optimizing a superposition typically uses iteration (minimization along a gradient) or stochastic methods (Monte Carlo sampling)

The Foldseek Alphabet Emphasizes 3° Interactions

the 20 states of the 3D interaction (3Di) alphabet describe the relationship of a residue i to its closest residue j (based on virtual centers optimized for search sensitivity and to emphasize 3° interactions, and defined by angles/distances relative to N,Cα,Cβ)
the 20 states were learned based on 10 features: 7 angles (cosines) between vectors relating pairs of α-carbons (i-1,i,i+1,j-1,j,j+1), the i-j distance, and 2 features based on i-j sequence separation
other structural alphabets have been more focused on backbone conformations (φ,ψ etc.) → more dependency between consecutive letters, less evenly distributed state frequencies (more false positives), and lower information density in conserved cores relative to less-conserved loops

Target Matching

MMseqs2 prefilter: find matches with two 3Di-similar k-mers (6- or 7-mers depending on database size) on the same diagonal → keep if ungapped alignment score ≥ 15 bits
prefilter-passing candidates are superimposed by either:
- Smith-Waterman local alignment with a combined 3Di and amino acid substitution score (default, "Foldseek")
- global alignment with a optimized (~1.7x faster) version of TM-align ("Foldseek-TM")
the SW bit score is corrected for compositional bias and by subtracting the score of the reverse alignment
hits are ranked by "structural bit score," the SW bit score multiplied by the geometric mean of the local distance difference test (LDDT) and alignment template modeling score (TM-score)
- LDDT is based on intramolecular Cα-Cα distances within 15 Å of a query residue versus those of its aligned counterpart in the target; the final score is the average fraction of conserved distances over the four tolerance thresholds 0.5, 1, 2, and 4 Å (details...)
- TM-score is a length-independent measure of global structural similarity, also rather complicated (details...)
Foldseek E values are calculated using an extreme-value distribution with parameters predicted by a neural network based on 3Di composition and query length; Foldseek-TM reports TM-score normalized by query length

Learning the 20 States of the 3Di Alphabet

a Vector-Quantized Variational Autoencoder (VQ-VAE) was used to learn 20 states maximally conserved in evolution (based on structurally aligned residues and their 10-dimensional vectors)
encoder network embeds the 10-D vector of a residue into a 2D continuous latent space, the embedding is discretized by the nearest centroid (representing a 3Di state), decoder network predicts the probability distribution of the aligned residue
a BLOSUM-like substitution matrix for 3Di sequences was calculated from pairs of structurally aligned residues

Foldseek Sensitivity and Speed vs. Other Methods

Foldseek and Foldseek-TM vs. six structure-alignment tools, Geometricus alignment-free structure search, and MMseqs2 sequence search:

a: sensitivity (fraction of TP superfamily members found before first FP) on SCOPe40 benchmark; Dali and TM-align are slightly more sensitive than Foldseek
b: precision-recall curve; Foldseek-TM improves over TM-align in that the prefilter suppresses high-scoring FPs
c: sensitivity vs. speed on SCOPe40 (11,211 domains)
d: coverage (fraction of residues aligned) for multidomain structures from AlphaFold DB (AFDB); times for all-by-all comparisons of 100 structures
e: alignment quality for top five matches per query in the multidomain (AFDB) and single-domain (HOMSTRAD) benchmarks. Sensitivity = TP residues in alignment / query length; precision = TP residues / alignment length.
f: alignment quality of Foldseek vs. Dali for HOMSTRAD, where F₁ is the harmonic mean between sensitivity and precision

Are False Positives Really True Positives?

AlphaFold DB search "false positives" (Foldseek bits per column >1.0 but TM-score <0.5) actually identify homologous domains in structures that do not superimpose well globally
these were handpicked (but presumably not cherry-picked) outliers from the graph of bits per column vs. TM-score
the PAE plots show the reliability of AlphaFold-predicted structures is high for the individual domains but not the global conformations (i.e. the linkers between the domains are probably disordered anyway)

Foldseek: Fast Structure Search

What Makes Searching for Similar Structures Slow?

The Foldseek Alphabet Emphasizes 3° Interactions

Target Matching

Learning the 20 States of the 3Di Alphabet

Foldseek Sensitivity and Speed vs. Other Methods

Are False Positives Really True Positives?

Are False Positives Really True Positives? (cont.)