Navigation Bar

Using sequence similarity networks for visualization of relationships across diverse protein superfamilies

Atkinson HJ, Morris JH, Ferrin TE, Babbitt PC. Using sequence similarity networks for visualization of relationships across diverse protein superfamilies. PLoS One 2009; 4(2): e4345.
The dramatic increase in heterogeneous types of biological data -- in particular, the abundance of new protein sequences -- requires fast and user-friendly methods for organizing this information in a way that enables functional inference. The most widely used strategy to link sequence or structure to function, homology-based function prediction, relies on the fundamental assumption that sequence or structural similarity implies functional similarity. New tools that extend this approach are still urgently needed to associate sequence data with biological information in ways that accommodate the real complexity of the problem, while being accessible to experimental as well as computational biologists. To address this, we have examined the application of sequence similarity networks for visualizing functional trends across protein superfamilies from the context of sequence similarity. Using three large groups of homologous proteins of varying types of structural and functional diversity -- GPCRs and kinases from humans, and the crotonase superfamily of enzymes -- we show that overlaying networks with orthogonal information is a powerful approach for observing functional themes and revealing outliers. In comparison to other primary methods, networks provide both a good representation of group-wise sequence similarity relationships and a strong visual and quantitative correlation with phylogenetic trees, while enabling analysis and visualization of much larger sets of sequences than trees or multiple sequence alignments can easily accommodate. We also define important limitations and caveats in the application of these networks. As a broadly accessible and effective tool for the exploration of protein superfamilies, sequence similarity networks show great potential for generating testable hypotheses about protein structure-function relationships.

Available files:

  1. Supplementary figures referenced in the text
  2. Supplementary tables referenced in the text
  3. IDs and accessions for sequences labeled in figures
  4. Movie demonstrating how network topology changes with threshold
  5. Datafiles generated in the analysis: sequence files, HMMs, trees, sequence similarity networks, etc
    1. GPCRs (including Amine-binding and Class A: Rhodopsin-like datasets)
    2. Kinases (including STE/WNK dataset)
    3. Crontonase/Enoyl-CoA Hydratase Superfamily datasets

1. Supplementary figures referenced in the text. All files are in Portable Document Format (PDF).

File Description
SF1_FD_layout_comparison.pdf
Supplementary Fig. 1
Network distances are similar between the Organic and Cytoscape force-directed layout weighted by E-value
SF2_blastclust_comparison.pdf
Supplementary Fig. 2
Comparison of network layout and clustering with BLASTCLUST
SF3_missing_data_thumbnails.pdf
Supplementary Fig. 3
Graphic showing how network topology is affected by missing data. The correlation is high between the topology of the Class A GPCR network and networks with 20% of the sequences removed at random.
SF4_kinase_NJtreeAndNet.pdf
Supplementary Fig. 4
Comparison of trees and networks: STE and WNK kinases
SF5_both_trees.pdf
Supplementary Fig. 5
The archaeal bECH ECH domain is more similar to the sECH domain than the non-archaeal bECH ECH domain
SF6_blastAsymmetry.pdf
Supplementary Fig. 6
Asymmetery in BLAST E-values: How large is the difference between the E-values calculated between sequence pair A,B when A is used as query, or B is used as query?
SF7_three_plotsGPCR_1e-2.pdf
Supplementary Fig. 7
Example percent identity and length of alignment quartile plots

2. Supplementary tables referenced in the text. All files are in Portable Document Format (PDF).

File Description
ST1-3_allSuppTables.pdf Supplementary Table 1
Summary of network statistics: Correlation between organic laid-out network distances and the mathematically ideal BLAST E-value distances
Supplementary Table 2
Comparison of mathematically ideal and displayed pairwise network distances between 51 human STE and WNK kinases
Supplementary Table 3
Comparison of mathematically ideal and displayed pairwise distances between networks of the crotonase superfamily, using either full-length sequences or just the crotonase domain

3. IDs and accessions for sequences labeled in figures. File is a text file.

File Description
labeled_protein_ids.txt Each protein sequence specifically labled with text in a figure in Atkinson et al. is associated with a public database ID in this file.

4. Movie demonstrating how network topology changes with threshold. File is in Quicktime format (mov).

File Description
hCaspases_ids_ed.mov [13M]

This 57 second movie roughly corrresponds to the toy network in Fig. 1 (see right). Note that colors in the figure at right and in the movie do not correspond.

Using 12 human caspase sequences as an example of a tiny protein family, edges (pairwise relationships) are added to the display, one-by-one, in order of significance as determined by pairwise BLAST alignments. This is equivalent to continuously progressing from a very stringent E-value threshold to a permissive threshold. (Nodes are not visible until they participate in an edge that is better than the current threshold.)

As each edge is displayed, information about the underlying BLAST alignment is printed: number of identical residues / length of alignment (% identity).

The resulting network is laid out using a force-directed layout: nodes (sequences) are modeled as small masses that repel one another, and edges (alignments) are modeled as springs that pull nodes together. As a group of nodes becomes more interconnected, it is pulled together into a cluster. (This layout algorithm is not the Cytoscape Organic Layout used in the published figures.)

Near the beginning of the movie, when a number of high- significance edges are displayed, the warm-toned (red, orange, yellow) nodes group together. A bit later, the cool-toned nodes (blue, green) group together. Warm-toned nodes correspond to inflammation-related caspases, and cool-toned nodes correspond to apoptosis- related caspases. Inflammation-related caspases tend to be more similar to one another than the apoptosis-related group, which is why they are more tightly grouped at a more significant threshold.

By the end of the movie, the threshold is very permissive, so all of the nodes are nearly completely connected, and subgroup clustering is no longer evident.

This movie was created using Processing.


5. Datafiles generated in the analysis: sequence files, HMMs, sequence similarity networks

  1. GPCRs (including Amine-binding and Class A: Rhodopsin-like datasets)
  2. Kinases (including STE/WNK dataset)
  3. Crontonase/Enoyl-CoA Hydratase Superfamily datasets

1. GPCRs
File Description
HMM:
clan_GPCR_A.pfam.afa.hmm
GPCR_A TM domain HMM
HMM:
clan_FOCS_TMdom.pfam.afa.hmm
FOCS ("Class B GPCRs") TM domain HMM
sequences:
nava99nr_GPCRdom.fa
766 99% non-redundant human GPCR domain sequences
sequences:
classA_rho-like_dom.fa
605 99% non-redundant human GPCR domain sequences (Class A: Rhodopsin-like)
sequences:
17randomHumanSeqs_289res.fa
17 "non-GPCR" human protein sequences
sequences:
ClassA_Amine_dom_99nr.fa
42 99% non-redundant human GPCR domain sequences (amine-binding)
sequence alignment:
ClassA_Amine_dom_99nr.afa
alignment of the above 42 amine-binding GPCR domains
tree:
amine_nj.nolabels.tree
Figure 2A: amine-binding GPCR tree Neighbor-Joining tree describing the interrelationships of 42 amine-binding human GPCR domains. [based on the above alignment]
network statistics:
three_plots_nava1.22.08_99nr.fa.pdf
Full length GPCRs: quartile plots showing how pairwise BLAST alignments change with E-value
network statistics:
three_plots_nava99nr_GPCRdom.fa.pdf
Domain-only GPCRs: quartile plots showing how pairwise BLAST alignments change with E-value
network:
ClassA_Amine_dom_99nr.fa_1e-33.xgmml [473K]
Fig. 2B: amine-binding GPCR network
Sequence similarity network including the same 42 GPCR domain sequences as in (A). This network was thresholded at a BLAST E-value of 1e-33: only edges associated with E-values more significant than 1e-33 are included in the network. This network contains 324 edges; the worst edges displayed correspond to a median of 30% identity over an alignment length of 280 amino acids.
AttributeDescription
IDSwissProt accession
nameSwissProt accession
DEDefinition line from SwissProt record ("full name")
speciesspecies
sequencefull length sequence
seq_lengthfull length sequence length
struc_ctnumber of structures in the PDB associated with this sequence in SwissProt
Level00 - Level04functional annotation at different levels of the NaVa DB functional hierarchy
bestModelname of GPCR HMM that best matched sequence
log_hmm_eval-log10(E-value) for alignment to HMM (bestModel)
dom_seqGPCR domain sequence
domain_lenlength of GPCR domain
seq_stindex in full len seq for beginning of GPCR domain
seq_end" end of GPCR domain
network:
classA_rho_pctrl.fa_1e-11.xgmml [83M]
Fig. 4A: Class A: Rhodopsin-like GPCR network
605 human Class A: Rhodopsin-like GPCR domains. This sequence set includes the 42 amine-binding sequences from Table 2 and Fig. 2. This network was thresholded at a BLAST E-value of 1e-11; the worst edges displayed correspond to a median of 24% identity over an alignment length of 210 amino acids.
*** see amine-binding GPCR network above for xgmml file sequence attributes
network:
nava99nr_GPCRdom_pctrl.fa_0.01.xgmml [65M]
Fig. 4B: All GPCR superfamilies
766 human GPCR domains. This sequence set includes all of the 605 Class A sequences from (A). This network was thresholded at an E-value of 1e-2, and the worst edges displayed correspond to a median of 22% identity over an alignment length of 120 amino acids.
*** see amine-binding GPCR network above for xgmml file sequence attributes
*** Because this network contains many more edges than the other networks from this paper, this xgmml format file does not contain information about node color or positioning, as most of the others do -- the exported file becomes inconveniently large. In order to duplicate the look of Fig. 4B:
  1. Import the xgmml file into Cytoscape using File: Import network (this should take minute or more
  2. Maximize the Cytoscape window and the window containing the network view (titled "nava99nr_GPCRdom_pctrl.fa_0.01")
  3. Select Layout: yFiles: Organic (this should take another minute or two)
  4. Define a "Visual Style" that colors nodes according to the value of attribute 'Level01'

2. Kinases
File Description
sequences:
hKD_nps_99nr.fa
513 99% non-redundant human kinase domain sequences
sequences:
hKD_STE_WNK.fa
51 STE and WNK kinase domain sequences (from above sequence set)
sequence alignment:
hKD_STE_WNK.afa
alignment of the above human STE/WNK kinase domains
tree:
hKD_STE_WNK.fa.tree
Supplementary Fig. 4A: STE and WNK kinase tree Neighbor-Joining tree describing the interrelationships of 51 human STE/WNK kinase domains. [based on the above alignment]
network statistics:
three_plots_hKD_nps_99nr.fa.pdf
513 human kinase domains: quartile plots showing how pairwise BLAST alignments change with E-value
network:
hKD_STE_WNK.fa_1e-27.xgmml [970K]
Supplementary Fig. 4B: STE and WNK kinase network
Sequence similarity network including the same 51 STE/WNK kinase domains as in S4A. This network was thresholded at an E-value of 1e-27 and contains 821 edges; the worst edges displayed correspond to a median of 30% identity over an alignment length of 270 amino acids.
AttributeDescription
IDKinBase (http://kinase.com/kinbase) identifier
L01KinBase "Group"
L02KinBase "Family"
L03KinBase "Subfamily"
L02:L03Concatenated L01:L02 annotations
sequences aligned to HMM database:
hKD_nps_99nr.fa.kinase_hmms.hmmpfam
The HMMER hmmpfam command was used to identify which PFAM model (Pkinase or Pkinase_Tyr) best fit each of the 513 kinase domains, and to estimate the identity of the catalytic Lys. This is the output of that command. (text file)
network:
hKD_nps_99nr.fa_1e-25.xgmml [19M]
Fig. 3: Sequence similarity network of the kinase superfamily
Network of 513 human kinase domains: The network is thresholded at a BLAST E-value of 1e-25. The worst edges displayed correspond to a median of 29% identity over alignments of 260 residues
AttributeDescription
IDKinBase (http://kinase.com/kinbase) identifier
L01KinBase "Group"
L02KinBase "Family"
L03KinBase "Subfamily"
bestGrpwhich PFAM model fit best? (Pkinase or Pkinase_Tyr)
hmm_eval-log10(E-value) for alignment to HMM (bestGrp)
AvKIdentity of the residue in this sequence aligning to the "VAIK" catalytic Lys position in the "bestGrp" HMM model
tAvK"thresholded" AvK: Identity of the residue in this sequence aligning to the "VAIK" catalytic Lys position in the "bestGrp" HMM model, if the sequence matched the model with an E-value better than 1e-50.

3. Crontonase/Enoyl-CoA Hydratase Superfamily
File Description
sequences:
crotPub01.16.08_99nr.fa
1,170 99% non-redundant crotonase/Enoyl-CoA hydratase superfamily sequences (full-length)
sequences:
ECH_crotPub01.16.08_99nr.fa
410 full length sequences annotated as members of the SFLD enoyl-CoA hydratase family (subset of the crotonase superfamily)
sequence alignment:
crotDomain3HBCH_clpped.afa
Structure-based sequence alignment of crotonase domain from crotonase superfamily, with a couple of 3-hydroxyisobutyryl-CoA hydrolase sequences added in
HMM:
crotDomain.hmm
Crotonase domain HMM based on the above sequence alignment
sequences:
crotDomain.fa
1,170 99% non-redundant crotonase/Enoyl-CoA hydratase superfamily domain sequences isolated from full-length seqs as defined by crotDomain.hmm
network statistics:
three_plots_crotPub01.16.08_99nr.fa.pdf
1,170 full-length crotonase superfamily sequences: quartile plots showing how pairwise BLAST alignments change with E-value
network statistics:
three_plots_crotDomain.fa.pdf
1,170 domain-only crotonase superfamily sequences: quartile plots showing how pairwise BLAST alignments change with E-value
network:
crotPub01.16.08_99nr.fa_1e-30.xgmml [36M]
Fig. 5A,B: Crotonase superfamily: full-length network
1,170 sequences from the crotonase superfamily, involving full-length sequences, thresholded at an E-value of 1e-30. The worst edges displayed correspond to a median of 33% identity over alignments of 250 residues.
*** Because this network contains many more edges than the most of other networks from this paper, this xgmml format file does not contain information about node color or positioning, as most of the others do -- the exported file becomes inconveniently large. In order to duplicate the look of Fig. 5A:
  1. Import the xgmml file into Cytoscape using File: Import network (this should take a minute or more)
  2. Maximize the Cytoscape window and the window containing the network view (titled "crotPub01.16.08_99nr.fa_1e-30")
  3. Select Layout: yFiles: Organic (this should take another minute or two)
  4. Define a "Visual Style" that colors nodes according to the value of attribute 'family_name'
AttributeDescription
IDNCBI protein gi (general identifier)
family_name SFLD family annotation
sequencefull-length sequence
seq_lenfull-length sequence length
speciesspecies
 taxonomic attributes: note that standard hierarchy is:
superkingdom, kingdom, phylum, class, order, family, genus, species
kingToSuperkingIf species is associated with a taxonomic kingdom, this is the attribute value. Otherwise, superkingdom value is used
phylumToClassIf species is associated with a taxonomic phylum, this is the attribute value. Otherwise, class value is used
BactPhylumIf species is in the superkingdom Bacteria, the species Phylum is used as attribute. Otherwise, it is listed as 'NotBacteria'
network:
crotDomain.fa_1e-29.xgmml [48M]
Fig. 5C: Crotonase superfamily: domain-only network
1,170 sequences from the crotonase superfamily, involving just the crotonase domain, thresholded at 1e-29. The worst edges displayed correspond to a median of 38% identity over alignments of 180 residues.
*** Includes all of the attributes in the full-length crotonase superfamily network, and has some additional attributes:
AttributeDescription
domain_seqsequence of just the crotonase domain
domain_lenlength of crotonase domain
log_hmm_eval-log10(E-value) for this sequence aligned to crotDomain.hmm
seq_stindex in full len seq for beginning of crotonase domain
seq_end" end of crotonase domain
network:
ECH_crotPub01.16.08_99nr.fa_1e-50.xgmml [18M]
Fig. 6: Enoyl-CoA hydratase family: full-length network
The 410-sequence network is thresholded at a BLAST E-value of 1x10-50; the worst edges displayed correspond to a median of 40% identity over alignments of 260 amino acids.
*** Includes the same attributes as the full-length crotonase superfamily network above
sequence alignment:
tree.ids.afa
MUSCLE alignment of 15 representative full-length sequences from the ECH family and superfamily
tree:
tree.ids.nex.con.tree
Figure S5C: Bayesian phylogeny, full length ECH sequences Bayesian consensus tree consisting of 15 representative sequences, based on the alignment above
sequence alignment:
tree.ids.crotDom.afa
MUSCLE alignment of 15 representative ECH-domain-only sequences from the ECH family and superfamily
tree:
tree.ids.crotDom.nex.con.tree
Figure S5D: Bayesian phylogeny, ECH domain sequences Bayesian consensus tree consisting of 15 representative sequences, based on the alignment above


Laboratory Overview | Research | Outreach & Training | Available Resources | Visitors Center | Search