low-dimensional embedding for the analysis of
molecular dynamics (MD) trajectories
each snapshot is described as a residue interaction network (RIN)
residue closeness centrality → N-dimensional feature vector for each
time point
EncoderMap embedding of these feature vectors represents the structure
at each time point as a single dot in a 2D plot
this captures important aspects of structure and its transitions
at multiple scales without requiring system-specific
expert knowledge of which measurables to use
extracting meaning from MD trajectories can be challenging:
large-scale global motions, medium-range changes, and local fluctuations may
all be relevant to function
expert knowledge of the specific system is typically needed to identify
which measurables to track or to use for dimensionality reduction
(“featurization”)
global descriptors include radius of gyration and fraction of native contacts
otherwise, specific distances or angles may be used
reduced-dimensionality views should be as interpretable as possible
RINs are mesoscopic in that they capture structural changes on
multiple scales ranging from local to global
prior MD analyses with RINs often used averages or only a few snapshots,
thus losing much of the temporal information and obscuring conformational
heterogeneity
Implementation: Embedding RIN Time Series
an edge is drawn between any two non-sequence-adjacent
residues with any atoms within ≤ 6 Å
residue closeness centrality (reciprocal mean length of
the shortest paths from that node to all other nodes)
→ N-dimensional feature vector for each time point
(“closeness fingerprint”)
EncoderMap couples a neural network autoencoder with a
multidimensional-scaling-like embedding based on pairwise distances
previous work used dihedrals, highlighted reverse decoding (regenerating
conformations):
Lemke T et al., J Chem Inf Model (2019)
uses Euclidean distance between feature vectors, but with
sigmoid cost function to deemphasize the smallest and largest distances
Small Protein Example: Trp-Cage (20 Residues)
superlong simulation (208 µs, over a million frames)
by D.E. Shaw Research
above: native structure 2jof, its RIN with force-directed layout,
a plot of the simulation
right: EncoderMap embedding colored by various properties,
where e-f mostly emphasize folding of the N-terminal helix (residues 2-8)
folding pathways in e: A is hydrophobic collapse first
(“nucleation-condensation”), B is
formation of N-terminal helix first (“diffusion-collision”)
Supp Data: Trp-Cage PCA and UMAP
Authors say that compared to EncoderMap,
PCA is more global and UMAP is more local.
Multidomain Example: FAT10 (165 Residues)
FAT10 has two ubiquitin-like domains with a flexible linker and tails
50 simulations, each 50 ns and started in the open conformation,
commonly collapsing to a closed conformation
left: the closed conformation and its RIN, a plot of one simulation
the black rectangle on the plot indicates the C-terminal tail
interacting with the rest of the protein
Multidomain Example (Cont.)
above: EncoderMap embedding colored by various properties
FAT10 has multiple families of closed conformations
the authors say (hypothesize?) that different closed conformations offer
different interaction sites for FAT10's many binding partners
the map resolves functionally different conformations (the various lobes)
of a complex system over time yet retains an interpretable global organization
(open vs. closed)
Supp Data: FAT10 PCA and UMAP
Supp Data: Alternative Featurizations
the contact map is the RIN edge list;
the degree centrality of a node is how many edges it has
these local representations give mostly circular embeddings in which
distinguishing the different high-contact conformations is difficult
Discussion
the protein graph can be defined at different granularities
and/or with different criteria
RINs are suitable for protein systems that undergo strong changes
in residue interactions (whereas finer local changes are not distinguished)
the closeness fingerprint extracts a compact, N-dimensional feature set
from the full (N × N-dimensional) residue–residue contact map
graph data structures are noneuclidean, which presents a challenge to
machine learning
EncoderMap relies on the deterministic graph representation provided by
RINs, namely a fixed size for a given protein, with known node identity
and ordering
it does not involve a training or message-passing process
its results retain high interpretability
EncoderMap offers a nonlinear, multidimensional-scaling-like behavior with
the ability to handle large amounts of data, allowing a focus on mesoscopic
(intermediate-scale) structure as represented by closeness fingerprints
to summarize, EncoderMap dimensionality reduction of the closeness
fingerprint gives a robust and interpretable picture of protein conformational
ensembles without requiring system-specific expert knowledge or any prior
training of the method