Tool: AlphaFold
AlphaFold
is an artificial intelligence method for predicting protein structures
that has been highly successful in
recent tests.
The method is described in:
Highly accurate protein structure prediction with AlphaFold.
Jumper J, Evans R, Pritzel A, et al.
Nature. 2021 Aug;596(7873):583-589.
Protein complex prediction with AlphaFold-Multimer.
Evans R, O'Neill M, Pritzel A, et al. bioRxiv 2021.
The ChimeraX AlphaFold tool:
- finds and retrieves existing models
from the AlphaFold Database:
AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models.
Varadi M, Anyango S, Deshpande M, et al.
Nucleic Acids Res. 2022 Jan 7;50(D1):D439-D444.
The database contains models for sequences (as single chains, not complexes)
in UniProt:
Version 1 (Jul 2021, used by ChimeraX 1.3): ~360,000 sequences,
reference proteomes of 21 species including Homo sapiens
Version 2 (Dec 2021 and Jan 2022 releases combined,
default in ChimeraX 1.4):
~1 million sequences, v1 + most of SwissProt + sequences relevant to
neglected tropical disease or antimicrobial resistance
Version 3 (Jul 2022): >200 million sequences
Version 4 (Nov 2022;
default in ChimeraX 1.5
and later): bugfix of version 3, updating the coordinates of ~4% of the entries
- runs new AlphaFold predictions
on Google
Colab using ColabFold, an open-source, optimized version of
AlphaFold 2:
ColabFold: making protein folding accessible to all.
Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M.
Nat Methods. 2022 Jun;19(6):679-682.
AlphaFold-predicted structures vary in confidence levels
(see coloring) and should be interpreted with caution.
The related tool AlphaFold Error Plot
plots the predicted errors in interactions between different parts
of an AlphaFold structure.
The AlphaFold tool is also implemented as the
alphafold command.
Several ChimeraX
presentations and
videos show modeling with AlphaFold and related analyses.
See also:
ESMFold,
Blast Protein,
Modeller Comparative,
Model Loops,
Rotamers
AlphaFold Dialog
AlphaFold Coloring Dialog
AlphaFold Error Plot
[back to top: AlphaFold]
AlphaFold Dialog
The AlphaFold tool can be opened from the Structure Prediction
section of the Tools menu and manipulated like other panels
(more...).
The Sequence can be specified by
UniProt
name or accession number, pasted in as plain text, or chosen
from the menu of currently open protein structure chains.
Fetch
Search
Predict
Options
←
Fetch gets the most sequence-similar model available from the
AlphaFold Database
for each specified chain.
Specifying a whole model specifies all of its protein chains.
For each chain, a model is obtained for the exact UniProt entry
if available, otherwise the single top hit identified by
K-mer search of the
AlphaFold Database
(details...).
The corresponding command is
alphafold match.
If the sequence was specified by structure chain, then:
- the chain ID of the predicted structure is made the same as the
corresponding chain of the existing model
- the predicted structure is superimposed onto the existing chain using
matchmaker,
and the following are reported in a table
in the Log:
- Chain – chain ID
- UniProt Name
and UniProt Id (accession number)
- RMSD – Cα root-mean-square deviation between
the predicted and experimental structures, over all residues of the latter
- Length – number of residues in the predicted structure
- Seen – number of residues with atomic coordinates
in the experimental structure
- % Id – percent identity in the sequence alignment generated
by matchmaker
for superposition; the number of positions with identical residues
divided by the length of the shorter sequence
- by default,
the predicted structure is trimmed to the same residue range as the
existing chain (details...)
- the following attributes
are assigned to the residues of the predicted structure:
- c_alpha_distance
– Cα distance between corresponding positions of the predicted
and existing chains after their superposition (step 2 above)
- missing_structure
– positions missing from the coordinates of the existing chain
- same_sequence
– positions with different residue types than the existing chain
These attributes
can be used for coloring and other purposes.
The fetched models are stored locally in
~/Downloads/ChimeraX/AlphaFold/,
where ~ indicates a user's home directory.
If a file specified for opening is not found in this local cache,
the file will be fetched and cached.
←
Search
uses a BLAST web service
hosted by the UCSF RBVI to search the
AlphaFold Database
using default parameters: BLOSUM62
amino acid similarity matrix for scoring the hits,
similarity score cutoff e-value 1e-3,
returning a maximum of 100 of unique sequences.
However, different values of these parameters can be specified
using the corresponding command,
alphafold search.
Search differs from Fetch in that it uses BLAST instead
of fast (but low-sensitivity)
K-mer searching,
accepts only a single chain or sequence as input,
and returns a list of hits for the user to inspect,
rather than fetching the single top hit per chain automatically.
When results are returned, the hits are listed in a
Blast Protein window.
Double-clicking a hit uses
alphafold fetch
to retrieve the model,
or multiple chosen hits can be retrieved at once by using the results panel
context menu
or Load Structures button
(details...).
←
Predict runs a calculation on
Google Colab using ColabFold,
an open-source, optimized version of
AlphaFold 2.
The corresponding command is
alphafold predict.
Users should cite:
ColabFold: making protein folding accessible to all.
Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M.
Nat Methods. 2022 Jun;19(6):679-682.
For predicting a complex (multimer), the sequences of all chains in the complex
must be given. The same sequence must be given multiple times if it occurs in
multiple copies in the complex. The sequences can be specified
either collectively as a model number chosen from the menu of currently
open models (e.g. when that model contains multiple chains), or
individually within a comma-separated list of
UniProt
identifiers or pasted-in amino acid sequences.
Prediction may only be feasible for smaller complexes
(details...).
A warning will appear saying that this Colab notebook is from github
(was not authored by Google), with a button to click to run anyway.
Users will need to have a Google account and to sign into it via
a browser. Once that is done, the sign-in may be remembered depending
on the user's browser settings; it is not kept in the
ChimeraX preferences.
See the example video for an explanation of the images/plots
from ColabFold that appear in the Colab window
and where to find downloaded files.
←
The Options button shows/hides additional options:
- Results directory
(default ~/Downloads/ChimeraX/AlphaFold/prediction_[N])
– the pathname (name and location) of a folder or directory
in which to store prediction results. Clicking Browse
brings up a file browser window for choosing it interactively.
The directory does not need to exist already, as it will be
created by running the prediction.
As shown in the default, the pathname can include [N] to indicate
substitution with the smallest positive integer that makes a new directory.
If the specified pathname does not include [N]
but a directory of that name and location already exists
and contains a results.zip file, _[N] will be appended automatically
to avoid overwriting the existing directory.
- Use PDB templates when predicting structures (default off)
– AlphaFold can use up to four structures as templates;
when this option is on, ColabFold will search the PDB
sequences for similarity to the target and report in the Colab log
which entries (if any) are used as templates
- Energy-minimize predicted structures (default off)
– turning this off allows for faster job completion and/or
avoidance of failures that may occur during minimization
- Trim fetched structure to the aligned structure sequence
(default on)
– whether to trim a fetched structure to the same residue range
as the structure to which it is aligned
(details...)
The model for a sequence that was specified by structure chain will
be superimposed on that chain and assigned structure-comparison attributes
for further analysis
(details...).
Please note the following caveats of running a prediction:
- Results may be lost if the local computer goes to sleep.
Google intends Colab to be for interactive use. Even if the Colab job completes,
the results may fail to download to the local computer if it has gone to
sleep. It is recommended to turn off the option to enter sleep mode
(meant to conserve power after some amount of idle time)
before running a prediction.
- The process includes
installing various software packages on a virtual machine,
searching sequence databases, generating a multiple sequence alignment,
predicting atomic coordinates, and optionally,
energy-minimizing the best structure. In addition,
predicting a multimer (complex) structure may take longer than predicting
the structure of a monomer with the same total number of residues.
The free version of Colab limits jobs to 12 hours
and may terminate them at shorter times at Google's
discretion (see the FAQ).
Those who want to run longer and/or more frequent calculations
may need to sign up for one of the paid
Colab plans.
- Each chain must contain at least 16 residues.
Shorter sequences are not accepted because they cannot be used to generate
a reliable multiple sequence alignment.
- Total sequence length cannot be very large.
AlphaFold runs out of graphics memory for long sequences (~1200 amino acids
on old Google Colab GPUs with 16 GB memory).
Multimer predictions face the same limit on the total number of residues,
so only smaller complexes can be predicted.
As mentioned above, paid Colab plans provide more computational resources
than the free plan. Structures with up to 3000 amino acids can be predicted
using an Nvidia A100 GPU on Google Colab, costing about $1.50 for a
2000-residue prediction (May 2023); this
video explains how.
Coloring
shows the Alphafold Coloring dialog
for applying different color schemes to the predicted structures,
as well as hiding, showing, and
selecting their residues based on
attribute value.
Error plot draws the AlphaFold Error Plot,
in which color gradations show (for each pairwise combination of residues)
the expected error in position of one residue when the
true and predicted structures are aligned based on the other residue.
See also: batch predictions
[back to top: AlphaFold]
AlphaFold Coloring Dialog
Clicking the Coloring button on the main
AlphaFold tool shows the
AlphaFold Coloring dialog for applying different color schemes
to the predicted structures, as well as hiding, showing, and
selecting their residues based on
attribute value.
When first opened, AlphaFold-predicted structures are automatically colored by
the pLDDT
confidence measure in the B-factor field:
- 100
to 90
– high accuracy expected
- 90
to 70
– backbone expected to be modeled well
- 70
to 50
– low confidence, caution
- 50
to 0
– should not be interpreted, may be disordered
...in other words, using
color bfactor
palette alphafold
The Color Key graphical interface
or a command can be used to draw a corresponding color key, for example:
key red:low orange: yellow: cornflowerblue: blue:high
[other-key-options]
In the AlphaFold Coloring dialog,
the Residues to act on are specified by using the menus to choose an
AlphaFold-predicted model and one of the following:
- all
– all residues
- confidence below [N]
– based on the bfactor
atom attribute
(the confidence value is read from the
B-factor field of the PDB file)
- C-alpha distance greater than [d]
– based on the c_alpha_distance
residue attribute of AlphaFold models
fetched by existing structure chain;
Cα distance between corresponding positions of the predicted and
existing chains after their automatic superposition
- missing structure
– based on the missing_structure
residue attribute of AlphaFold models
fetched by existing structure chain;
positions missing from the coordinates of the existing chain
- different sequence
– based on the same_sequence
residue attribute of AlphaFold models
fetched by existing structure chain;
positions with different residue types than the existing chain
- confidence above [N]
– based on the bfactor
atom attribute
(the confidence value is read from the
B-factor field of the PDB file)
- C-alpha distance less than [d]
– based on the c_alpha_distance
residue attribute of AlphaFold models
fetched by existing structure chain;
Cα distance between corresponding positions of the predicted and
existing chains after their automatic superposition
- paired structure
– based on the missing_structure
residue attribute of AlphaFold models
fetched by existing structure chain;
positions present in the coordinates of the existing structure chain
- same sequence
– based on the same_sequence
residue attribute of AlphaFold models
fetched by existing structure chain;
positions with identical residue types as the existing chain
Buttons act on the designated residues:
- Color buttons:
- Custom for choosing a color interactively
using the system color picker
- a series of square buttons for specific colors:
- Hide – hide the specified residues
- Show – show the specified residues
- Select – select
the specified residues
The AlphaFold Coloring dialog does not color continuously
along a gradient to show the attribute values.
For coloring along a gradient, see
Render by Attribute and/or the commands
color bfactor
(for the confidence value, which is read from the
B-factor field of the PDB file) and
color byattribute
(for other numerical attributes).
[back to top: AlphaFold]
AlphaFold Error Plot
Besides the per-residue
pLDDT
confidence measure, AlphaFold gives
for each pair of structural entities (X,Y) the expected position error
at entity X if the predicted and true structures were aligned on Y.
Structural entities include standard biopolymer residues as well as the
individual atoms of other types of residues: ligands, ions, glycans,
and post-translationally modified residues.
Only AlphaFold 3 (not earlier versions) generates
predictions that include these other types of residues.
The “predicted aligned error” or PAE values
can be shown with AlphaFold Error Plot, which can be
opened from the Structure Prediction
section of the Tools menu and manipulated like other panels
(more...).
See also:
the AlphaFold Error Estimates example and
video,
alphafold contacts
Choosing the corresponding AlphaFold structure from the menu of open
atomic models associates it with the plot. This association allows
coloring the structure as described below, and for selections
on the plot to highlight the corresponding parts of the structure.
The PAE values can be either:
- fetched from the AlphaFold Database
by giving the UniProt name or accession number of an entry in that database
– or –
- read from a json or pkl file previously output by AlphaFold
The PAE plot can also be shown by clicking the Error plot button
on the AlphaFold dialog or by using
the command alphafold pae,
the command
alphafold fetch
or alphafold match
with the option pae true, or the
open command.
When the mouse cursor is over the plot, the residue pair and PAE value at
its current position are reported in the bottom right corner of the window.
Clicking Color PAE Domains clusters the entities into
coherent domains (sets with relatively low PAE values)
and uses randomly chosen colors to distinguish these domains in the structure
(details...).
Clicking Color pLDDT returns the structure to
the default confidence coloring.
The plot's context menu includes:
- Dragging box colors structure (initial default checked on)
– whether dragging a box on the plot highlights the corresponding parts
of the 3D structure with bright colors and makes everything else gray;
if this option is unchecked, highlighting will be done with
selection instead of coloring
- Color plot from structure
– color the plot to match the 3D structure where the pair of entities
represented by an X,Y point have the same ribbon color; show the rest of the
plot in shades of gray
- Color plot rainbow
– use the pae
palette (default)
to color the plot, with colors assigned to values as follows:
- Color plot green
– use the paegreen
palette to color the plot:
- Show chain divider lines (initial default checked on)
– for multimer predictions, draw lines on the plot demarcating the
end of one chain and the start of another; the lines may obscure a few
chain-terminal residues in the plot, and can be hidden if this is problematic.
For predictions that include nonstandard residues and/or covalent
modifications, divider lines also segregate the entire set of such
entities from the biopolymer chain(s).
- Save image – save the plot as a PNG file
The Color Key graphical interface
or a command can be used to draw (in the main graphics window)
a color key for the PAE plot. For example, to make a color key that
matches the pae or paegreen scheme, respectively:
key pae :0 : : :15 : : :30
showTool true
key paegreen :0 : : :15 : : :30
showTool true
A title for the color key (e.g., “Predicted Aligned
Error (Å)”) would need to be created separately with
2dlabels.
UCSF Resource for Biocomputing, Visualization, and Informatics /
May 2024