Paper Summary: Embedding Residue Interaction Networks

Embedding Residue Interaction Networks

Visualizing the residue interaction landscape of proteins by temporal network embedding. Franke L, Peter C. J Chem Theory Comput. 2023 May 23;19(10):2985-2995.

low-dimensional embedding for the analysis of molecular dynamics (MD) trajectories
each snapshot is described as a residue interaction network (RIN)
residue closeness centrality → N-dimensional feature vector for each time point
EncoderMap embedding of these feature vectors represents the structure at each time point as a single dot in a 2D plot
this captures important aspects of structure and its transitions at multiple scales without requiring system-specific expert knowledge of which measurables to use

[back to paper list]

Background

extracting meaning from MD trajectories can be challenging: large-scale global motions, medium-range changes, and local fluctuations may all be relevant to function
expert knowledge of the specific system is typically needed to identify which measurables to track or to use for dimensionality reduction (“featurization”)
global descriptors include radius of gyration and fraction of native contacts
otherwise, specific distances or angles may be used
reduced-dimensionality views should be as interpretable as possible
RINs are mesoscopic in that they capture structural changes on multiple scales ranging from local to global
prior MD analyses with RINs often used averages or only a few snapshots, thus losing much of the temporal information and obscuring conformational heterogeneity

Implementation: Embedding RIN Time Series

an edge is drawn between any two non-sequence-adjacent residues with any atoms within ≤ 6 Å
residue closeness centrality (reciprocal mean length of the shortest paths from that node to all other nodes) → N-dimensional feature vector for each time point (“closeness fingerprint”)
EncoderMap couples a neural network autoencoder with a multidimensional-scaling-like embedding based on pairwise distances
- previous work used dihedrals, highlighted reverse decoding (regenerating conformations): Lemke T et al., J Chem Inf Model (2019)
- code: EncoderMap github
uses Euclidean distance between feature vectors, but with sigmoid cost function to deemphasize the smallest and largest distances

Small Protein Example: Trp-Cage (20 Residues)

superlong simulation (208 µs, over a million frames) by D.E. Shaw Research
above: native structure 2jof, its RIN with force-directed layout, a plot of the simulation
right: EncoderMap embedding colored by various properties, where e-f mostly emphasize folding of the N-terminal helix (residues 2-8)
folding pathways in e: A is hydrophobic collapse first (“nucleation-condensation”), B is formation of N-terminal helix first (“diffusion-collision”)

Supp Data: Trp-Cage PCA and UMAP

Authors say that compared to EncoderMap, PCA is more global and UMAP is more local.

Multidomain Example: FAT10 (165 Residues)

FAT10 has two ubiquitin-like domains with a flexible linker and tails
50 simulations, each 50 ns and started in the open conformation, commonly collapsing to a closed conformation
left: the closed conformation and its RIN, a plot of one simulation
the black rectangle on the plot indicates the C-terminal tail interacting with the rest of the protein

Multidomain Example (Cont.)

above: EncoderMap embedding colored by various properties
FAT10 has multiple families of closed conformations
the authors say (hypothesize?) that different closed conformations offer different interaction sites for FAT10's many binding partners
the map resolves functionally different conformations (the various lobes) of a complex system over time yet retains an interpretable global organization (open vs. closed)

Supp Data: FAT10 PCA and UMAP

Supp Data: Alternative Featurizations

the contact map is the RIN edge list; the degree centrality of a node is how many edges it has
these local representations give mostly circular embeddings in which distinguishing the different high-contact conformations is difficult

Discussion

the protein graph can be defined at different granularities and/or with different criteria
RINs are suitable for protein systems that undergo strong changes in residue interactions (whereas finer local changes are not distinguished)
the closeness fingerprint extracts a compact, N-dimensional feature set from the full (N × N-dimensional) residue–residue contact map
graph data structures are noneuclidean, which presents a challenge to machine learning
EncoderMap relies on the deterministic graph representation provided by RINs, namely a fixed size for a given protein, with known node identity and ordering
- it does not involve a training or message-passing process
- its results retain high interpretability
EncoderMap offers a nonlinear, multidimensional-scaling-like behavior with the ability to handle large amounts of data, allowing a focus on mesoscopic (intermediate-scale) structure as represented by closeness fingerprints
to summarize, EncoderMap dimensionality reduction of the closeness fingerprint gives a robust and interpretable picture of protein conformational ensembles without requiring system-specific expert knowledge or any prior training of the method
- code: EncoderMap github