Paper Summary: AlphaFold + Ligands, Etc.

AlphaFold + Ligands, Etc.

AlphaFill: enriching AlphaFold models with ligands and cofactors. Hekkelman ML, de Vries I, Joosten RP, Perrakis A. Nat Methods. 2023 Feb;20(2):205-213. [website] [github]

AlphaFold-latest includes modeling nucleic acids, small molecules, and covalent modifications along with proteins
[Oct 2023 press release] [drug discovery angle] [results PDF]

Re AlphaFill, several other methods/databases similarly try to build upon AlphaFold models, e.g.:

SymProFold (alphafold-multimer + symmetry → viral capsids, bacterial S-layers)
PrePPI [website] (alphafold models split into domains → complexes predicted by homology)
AF2Complex
AF-Cluster [github] (alphafold sequence clusters to predict multiple conformational states)

[back to paper list]

AlphaFill

The AlphaFill database contains AlphaFold models with ligands and ions added in based on homologous experimental structures of complexes.

Procedure: for each structure in the AlphaFold database,

run a BLAST search for homologs in PDB-REDO*; structures with > 25 % identity over an alignment of ≥ 85 residues are considered sufficiently homologous
the most common ligands and cofactors (2694 residue types) are considered as candidates for transplantation
the model and the crystal structure are superimposed:
- globally to give overall Cα RMSD
- locally using backbone atoms within 6 Å of the ligand
the ligand is transplanted into the model unless the same compound was already placed similarly (centroid within 3.5 Å) based on some other crystal structure

12,029,789 transplants were performed on 995,411 AlphaFold models and results are available together with associated validation metrics in the database.

*PDB-REDO is a database of updated and optimised versions of crystallographic structures in the PDB. It also provides a server to improve uploaded coordinates, given crystallographic data (an MTZ file). PDB-REDO is developed by the same group as AlphaFill; I don't know how much the structures therein differ from the original PDB.

AlphaFill Metrics

Local Environment Validation (LEV) = all-atom RMSD for ligand + protein within 6 Å of ligand
- can only be calculated for sequence-identical pairs, but correlates well with local backbone-only RMSD (panel b), which can be calculated for all pairs
- not really validation IMO, more like a confidence metric
Transplant Clash Score (TCS) = RMS VDW overlap over (polyatomic) ligand-protein atom pairs within 4 Å; YASARA minimization reduces TCS unless it's already very low (panel f)
Confidence based on local RMSD third quartile and interquartile range: high < 0.92 Å (65% of transplants) → medium → low > 3.1 Å (10%),
Similarly for TCS: high < 0.64 Å (81%) → medium → low > 1.27 Å (~0%)

AlphaFill Website

Website user interface shown at left. Another example:
- A0A024RA31 human aquaporin-1
- different % identity levels
- high RMSD and/or TCS highlighted
- click row to zoom in on ligand, put cursor in 3D view to see residue info
- all ligands shown initially; hide all, then show one at a time; can minimize if high TCS
List of compounds
Does not include polymeric ligands or posttranslational modifications
3D-Beacons API implemented for integration in other websites (already used in PDBe-Knowledge Base)

AlphaFold-Latest

AlphaFold-latest includes modeling nucleic acids, small molecules, and covalent modifications along with proteins
[Oct 2023 press release] [drug discovery angle] [results PDF]

Previous iterations:

AlphaFold 2 – single-chain protein structure prediction
AlphaFold-Multimer – multichain protein complexes
AlphaFold 2.3 – larger complexes

[back to paper list]

AlphaFold-Latest

They don't subscribe to modesty in describing their own results!

...from the press release:

"It unlocks new understanding and significantly improves accuracy in multiple key biomolecule classes, including ligands (small molecules), proteins, nucleic acids (DNA and RNA), and those containing post-translational modifications (PTMs)."

...from the “results-only progress report” PDF:

far greater accuracy on protein-ligand interactions than state-of-the-art docking tools
much higher accuracy on protein-nucleic acid interactions than specialist predictors like RoseTTAFold2NA
significantly higher antibody-antigen prediction accuracy than AlphaFold-Multimer

Further, "During preparation of this manuscript... RoseTTAFold All-Atom was released... This system is not available for baselining at the time of writing, but the RoseTTAFold All-Atom paper indicates their accuracy is below specialist predictors in almost all categories."

AlphaFold-Latest Input and Output

Input: sequences of biopolymer chains, SMILES strings of ligands, sequence locations of any covalently bonded ligands
A “token” is generated for each standard biopolymer residue and each heavy atom of a ligand or nonstandard biopolymer residue; systems of up to 5120 tokens were evaluated
Output: coordinates of the specified structure, with confidence levels per token, per token pair, per chain, per interface, and overall... apparently some new measure. It's confusing that they don't mention pLDDT (predicted LDDT) or PAE, although they do use [not predicted but measured] LDDT (Local Distance Difference Test) and pocket-aligned RMSD for validation comparison with “ground truth” structures
- 15-Å inclusion radius used for protein LDDT, 30-Å for nucleic acid LDDT
- pocket-aligned RMSD is ligand heavy-atom RMSD after aligning protein alpha-carbons (or all heavy atoms of nucleic acids) within 10 Å of the ligand in the experimental structure
- DockQ is another protein-protein docking score that appears in some of their results graphs, a complicated function of RMSD that includes some threshold parameters

Evaluation Sets

(All they said about the training set was that it included PDB entries with release dates up to 2021-09-30, except as noted below. The same cutoff date was used for templates.)

Recent PDB: started with the 10,192 entries released between 2022-05-01 and 2023-01-12 (later than any in the training set), expanded to biological assembly 1, filtered to non-NMR with resolution <4.5Å and <5120 tokens → 8856 complexes. The following were predicted but omitted from scoring:
- ligands deemed to be crystallization aids (detergents, glycerol, ...)
- interfaces and modifications of short peptides (<16 residues)
- DNA/RNA hybrids, peptide nucleic acids, D-polypeptides
- 11 ligand-protein or ion-protein interfaces with RMSD calculation errors
- 8 structures with too many chains to determine pairing in reasonable time
- 3 structures with chains <4 residues
- 1 structure with bad metadata
Low homology: same as Recent PDB, except omitting chains with >40% sequence ID to any in the training set and interfaces where both components have high similarity to components within the same structure in the training set (protein chain >40% sequence ID, ligand Tanimoto >0.5)
PoseBusters benchmark: 428 ligand-bound entries (compared to a separately trained AlphaFold-latest model with an earlier training-set cutoff date, 2019-09-30)
Evaluation data were also clustered based on 40% sequence identity

Examples with Nucleic Acids, Antibodies, Ligands, Glycosylation

Performance Summary

their Autodock Vina test reproduced its reported accuracy on the PoseBusters test set
RoseTTAFold2NA is another machine-learning method, used the same MSAs as AlphaFold-latest and (due to its limitations) excluded structures ≥1000 residues or containing both DNA and RNA