AlphaFill
The AlphaFill database
contains AlphaFold models with ligands and ions added in based on
homologous experimental structures of complexes.
Procedure: for each structure in the AlphaFold database,
- run a BLAST search for homologs in
PDB-REDO*;
structures with > 25 % identity over an alignment of ≥ 85 residues
are considered sufficiently homologous
- the most common ligands and cofactors (2694 residue types) are
considered as candidates for transplantation
- the model and the crystal structure are superimposed:
- globally to give overall Cα RMSD
- locally using backbone atoms within 6 Å of the ligand
- the ligand is transplanted into the model unless the same compound
was already placed similarly (centroid within 3.5 Å) based on
some other crystal structure
12,029,789 transplants were performed on 995,411 AlphaFold models and results
are available together with associated validation metrics in the database.
*PDB-REDO is
a database of updated and optimised versions of crystallographic structures
in the PDB. It also provides a server to improve uploaded coordinates,
given crystallographic data (an MTZ file).
PDB-REDO is developed by the same group as AlphaFill;
I don't know how much the structures therein differ from the original PDB.
AlphaFill Metrics
- Local Environment Validation (LEV) = all-atom RMSD for ligand +
protein within 6 Å of ligand
- can only be calculated for sequence-identical pairs, but
correlates well with local backbone-only RMSD (panel b),
which can be calculated for all pairs
- not really validation IMO, more like a confidence metric
- Transplant Clash Score (TCS) = RMS VDW overlap
over (polyatomic) ligand-protein atom pairs within 4 Å;
YASARA minimization reduces TCS unless it's already very low (panel f)
- Confidence based on local RMSD third quartile and interquartile range:
high < 0.92 Å (65% of transplants) → medium →
low > 3.1 Å (10%),
- Similarly for TCS: high < 0.64 Å (81%) → medium →
low > 1.27 Å
(~0%)
AlphaFill Website
- Website
user interface shown at left. Another example:
- A0A024RA31 human aquaporin-1
- different % identity levels
- high RMSD and/or TCS highlighted
- click row to zoom in on ligand, put cursor in 3D view to see residue info
- all ligands shown initially; hide all, then show one at a time;
can minimize if high TCS
- List
of compounds
- Does not include polymeric ligands or posttranslational modifications
- 3D-Beacons
API implemented for integration in other websites
(already used in PDBe-Knowledge Base)
AlphaFold-Latest
AlphaFold-latest includes modeling nucleic acids, small molecules, and
covalent modifications along with proteins
[Oct 2023 press release]
[drug discovery angle]
[results PDF]
Previous iterations:
- AlphaFold 2 – single-chain protein structure prediction
- AlphaFold-Multimer – multichain protein complexes
- AlphaFold 2.3 – larger complexes
[back to paper list]
AlphaFold-Latest
They don't subscribe to modesty in describing their own results!
...from the press release:
- "It unlocks new understanding and significantly improves accuracy in
multiple key biomolecule classes, including ligands (small molecules), proteins,
nucleic acids (DNA and RNA), and those containing post-translational
modifications (PTMs)."
...from the “results-only progress report” PDF:
- far greater accuracy on protein-ligand interactions than state-of-the-art
docking tools
- much higher accuracy on protein-nucleic acid interactions than specialist
predictors like
RoseTTAFold2NA
- significantly higher antibody-antigen prediction accuracy than
AlphaFold-Multimer
- Further,
"During preparation of this manuscript... RoseTTAFold All-Atom was released...
This system is not available for baselining at the time of writing, but the
RoseTTAFold All-Atom paper indicates their accuracy is
below specialist predictors in almost all categories."
AlphaFold-Latest Input and Output
- Input: sequences of biopolymer chains, SMILES strings of ligands,
sequence locations of any covalently bonded ligands
- A “token” is generated for each standard biopolymer residue
and each heavy atom of a ligand or nonstandard biopolymer residue; systems
of up to 5120 tokens were evaluated
- Output: coordinates of the specified structure, with confidence levels
per token, per token pair, per chain, per interface, and overall...
apparently some new measure. It's confusing that they don't mention pLDDT
(predicted LDDT) or PAE, although they do use [not predicted but measured]
LDDT (Local Distance Difference Test) and
pocket-aligned RMSD for validation comparison with “ground truth”
structures
- 15-Å inclusion radius used for protein LDDT,
30-Å for nucleic acid LDDT
- pocket-aligned RMSD is ligand heavy-atom RMSD after aligning
protein alpha-carbons (or all heavy atoms of nucleic acids)
within 10 Å of the ligand in the experimental structure
- DockQ is another protein-protein docking score
that appears in some of their results graphs, a complicated function of
RMSD that includes some threshold parameters
Evaluation Sets
(All they said about the training set was that it included PDB entries with
release dates up to 2021-09-30, except as noted below.
The same cutoff date was used for templates.)
- Recent PDB: started with the 10,192 entries released between
2022-05-01 and 2023-01-12 (later than any in the training set),
expanded to biological assembly 1, filtered to non-NMR with resolution
<4.5Å and <5120 tokens → 8856 complexes.
The following were predicted but omitted from scoring:
- ligands deemed to be crystallization aids (detergents, glycerol, ...)
- interfaces and modifications of short peptides (<16 residues)
- DNA/RNA hybrids, peptide nucleic acids, D-polypeptides
- 11 ligand-protein or ion-protein interfaces with RMSD calculation errors
- 8 structures with too many chains to determine pairing in reasonable time
- 3 structures with chains <4 residues
- 1 structure with bad metadata
- Low homology: same as Recent PDB, except omitting chains with
>40% sequence ID to any in the training set and interfaces where both
components have high similarity to components within the same structure
in the training set (protein chain >40% sequence ID, ligand Tanimoto >0.5)
- PoseBusters benchmark: 428 ligand-bound entries (compared to
a separately trained AlphaFold-latest model with an earlier
training-set cutoff date, 2019-09-30)
- Evaluation data were also clustered based on 40% sequence identity
Examples with Nucleic Acids, Antibodies, Ligands, Glycosylation
Performance Summary
- their Autodock Vina test reproduced its reported accuracy on
the PoseBusters test set
- RoseTTAFold2NA is another machine-learning method,
used the same MSAs as AlphaFold-latest and (due to its limitations)
excluded structures ≥1000 residues or containing both DNA and RNA
Therapeutically Relevant Structures
Low-Homology Dataset
Covalently Modified Structures
Confidence vs. Accuracy (Polymers)
“AlphaFold-latest predicted error – a model-internal
error prediction that can be thought of as an inverse confidence”
Confidence vs. Accuracy (Ligands)