Embedding Residue Interaction Networks

Visualizing the residue interaction landscape of proteins by temporal network embedding. Franke L, Peter C. J Chem Theory Comput. 2023 May 23;19(10):2985-2995.

  • low-dimensional embedding for the analysis of molecular dynamics (MD) trajectories

  • each snapshot is described as a residue interaction network (RIN)

  • residue closeness centrality → N-dimensional feature vector for each time point

  • EncoderMap embedding of these feature vectors represents the structure at each time point as a single dot in a 2D plot

  • this captures important aspects of structure and its transitions at multiple scales without requiring system-specific expert knowledge of which measurables to use


[back to paper list]

Background


  • extracting meaning from MD trajectories can be challenging: large-scale global motions, medium-range changes, and local fluctuations may all be relevant to function

  • expert knowledge of the specific system is typically needed to identify which measurables to track or to use for dimensionality reduction (“featurization”)

  • global descriptors include radius of gyration and fraction of native contacts

  • otherwise, specific distances or angles may be used

  • reduced-dimensionality views should be as interpretable as possible

  • RINs are mesoscopic in that they capture structural changes on multiple scales ranging from local to global

  • prior MD analyses with RINs often used averages or only a few snapshots, thus losing much of the temporal information and obscuring conformational heterogeneity

Implementation: Embedding RIN Time Series

  1. an edge is drawn between any two non-sequence-adjacent residues with any atoms within ≤ 6 Å

  2. residue closeness centrality (reciprocal mean length of the shortest paths from that node to all other nodes) → N-dimensional feature vector for each time point (“closeness fingerprint”)

  3. EncoderMap couples a neural network autoencoder with a multidimensional-scaling-like embedding based on pairwise distances


  4. uses Euclidean distance between feature vectors, but with sigmoid cost function to deemphasize the smallest and largest distances

Small Protein Example: Trp-Cage (20 Residues)

  • superlong simulation (208 µs, over a million frames) by D.E. Shaw Research

  • above: native structure 2jof, its RIN with force-directed layout, a plot of the simulation

  • right: EncoderMap embedding colored by various properties, where e-f mostly emphasize folding of the N-terminal helix (residues 2-8)

  • folding pathways in e: A is hydrophobic collapse first (“nucleation-condensation”), B is formation of N-terminal helix first (“diffusion-collision”)

Supp Data: Trp-Cage PCA and UMAP

Authors say that compared to EncoderMap, PCA is more global and UMAP is more local.

Multidomain Example: FAT10 (165 Residues)


  • FAT10 has two ubiquitin-like domains with a flexible linker and tails

  • 50 simulations, each 50 ns and started in the open conformation, commonly collapsing to a closed conformation

  • left: the closed conformation and its RIN, a plot of one simulation

  • the black rectangle on the plot indicates the C-terminal tail interacting with the rest of the protein

Multidomain Example (Cont.)

  • above: EncoderMap embedding colored by various properties

  • FAT10 has multiple families of closed conformations

  • the authors say (hypothesize?) that different closed conformations offer different interaction sites for FAT10's many binding partners

  • the map resolves functionally different conformations (the various lobes) of a complex system over time yet retains an interpretable global organization (open vs. closed)

Supp Data: FAT10 PCA and UMAP

Supp Data: Alternative Featurizations

  • the contact map is the RIN edge list; the degree centrality of a node is how many edges it has
  • these local representations give mostly circular embeddings in which distinguishing the different high-contact conformations is difficult

Discussion


  • the protein graph can be defined at different granularities and/or with different criteria

  • RINs are suitable for protein systems that undergo strong changes in residue interactions (whereas finer local changes are not distinguished)

  • the closeness fingerprint extracts a compact, N-dimensional feature set from the full (N × N-dimensional) residue–residue contact map

  • graph data structures are noneuclidean, which presents a challenge to machine learning

  • EncoderMap relies on the deterministic graph representation provided by RINs, namely a fixed size for a given protein, with known node identity and ordering
    • it does not involve a training or message-passing process
    • its results retain high interpretability

  • EncoderMap offers a nonlinear, multidimensional-scaling-like behavior with the ability to handle large amounts of data, allowing a focus on mesoscopic (intermediate-scale) structure as represented by closeness fingerprints

  • to summarize, EncoderMap dimensionality reduction of the closeness fingerprint gives a robust and interpretable picture of protein conformational ensembles without requiring system-specific expert knowledge or any prior training of the method