Atkinson HJ, Morris JH, Ferrin TE, Babbitt PC. Using sequence similarity networks for visualization of relationships across diverse protein superfamilies. PLoS One 2009; 4(2): e4345.
| The dramatic increase in heterogeneous types of biological data -- in particular, the abundance of new protein sequences -- requires fast and user-friendly methods for organizing this information in a way that enables functional inference. The most widely used strategy to link sequence or structure to function, homology-based function prediction, relies on the fundamental assumption that sequence or structural similarity implies functional similarity. New tools that extend this approach are still urgently needed to associate sequence data with biological information in ways that accommodate the real complexity of the problem, while being accessible to experimental as well as computational biologists. To address this, we have examined the application of sequence similarity networks for visualizing functional trends across protein superfamilies from the context of sequence similarity. Using three large groups of homologous proteins of varying types of structural and functional diversity -- GPCRs and kinases from humans, and the crotonase superfamily of enzymes -- we show that overlaying networks with orthogonal information is a powerful approach for observing functional themes and revealing outliers. In comparison to other primary methods, networks provide both a good representation of group-wise sequence similarity relationships and a strong visual and quantitative correlation with phylogenetic trees, while enabling analysis and visualization of much larger sets of sequences than trees or multiple sequence alignments can easily accommodate. We also define important limitations and caveats in the application of these networks. As a broadly accessible and effective tool for the exploration of protein superfamilies, sequence similarity networks show great potential for generating testable hypotheses about protein structure-function relationships. |
1. Supplementary figures referenced in the text. All files are in Portable Document Format (PDF).
| File | Description |
SF1_FD_layout_comparison.pdf
|
Supplementary Fig. 1 Network distances are similar between the Organic and Cytoscape force-directed layout weighted by E-value |
SF2_blastclust_comparison.pdf
|
Supplementary Fig. 2 Comparison of network layout and clustering with BLASTCLUST |
|
SF3_missing_data_thumbnails.pdf
|
Supplementary Fig. 3 Graphic showing how network topology is affected by missing data. The correlation is high between the topology of the Class A GPCR network and networks with 20% of the sequences removed at random. |
SF4_kinase_NJtreeAndNet.pdf
|
Supplementary Fig. 4 Comparison of trees and networks: STE and WNK kinases |
SF5_both_trees.pdf
|
Supplementary Fig. 5 The archaeal bECH ECH domain is more similar to the sECH domain than the non-archaeal bECH ECH domain |
SF6_blastAsymmetry.pdf
|
Supplementary Fig. 6 Asymmetery in BLAST E-values: How large is the difference between the E-values calculated between sequence pair A,B when A is used as query, or B is used as query? |
SF7_three_plotsGPCR_1e-2.pdf
|
Supplementary Fig. 7 Example percent identity and length of alignment quartile plots |
2. Supplementary tables referenced in the text. All files are in Portable Document Format (PDF).
| File | Description |
| ST1-3_allSuppTables.pdf | Supplementary Table 1 Summary of network statistics: Correlation between organic laid-out network distances and the mathematically ideal BLAST E-value distances Supplementary Table 2 Comparison of mathematically ideal and displayed pairwise network distances between 51 human STE and WNK kinases Supplementary Table 3 Comparison of mathematically ideal and displayed pairwise distances between networks of the crotonase superfamily, using either full-length sequences or just the crotonase domain |
3. IDs and accessions for sequences labeled in figures. File is a text file.
| File | Description |
| labeled_protein_ids.txt | Each protein sequence specifically labled with text in a figure in Atkinson et al. is associated with a public database ID in this file. |
4. Movie demonstrating how network topology changes with threshold.
File is in Quicktime format (mov).
| File | Description |
hCaspases_ids_ed.mov [13M]
|
This 57 second movie roughly corrresponds to the toy network in Fig. 1 (see right). Note that colors in the figure at right and in the movie do not correspond. Using 12 human caspase sequences as an example of a tiny protein family, edges (pairwise relationships) are added to the display, one-by-one, in order of significance as determined by pairwise BLAST alignments. This is equivalent to continuously progressing from a very stringent E-value threshold to a permissive threshold. (Nodes are not visible until they participate in an edge that is better than the current threshold.) As each edge is displayed, information about the underlying BLAST alignment is printed: number of identical residues / length of alignment (% identity). The resulting network is laid out using a force-directed layout: nodes (sequences) are modeled as small masses that repel one another, and edges (alignments) are modeled as springs that pull nodes together. As a group of nodes becomes more interconnected, it is pulled together into a cluster. (This layout algorithm is not the Cytoscape Organic Layout used in the published figures.) Near the beginning of the movie, when a number of high- significance edges are displayed, the warm-toned (red, orange, yellow) nodes group together. A bit later, the cool-toned nodes (blue, green) group together. Warm-toned nodes correspond to inflammation-related caspases, and cool-toned nodes correspond to apoptosis- related caspases. Inflammation-related caspases tend to be more similar to one another than the apoptosis-related group, which is why they are more tightly grouped at a more significant threshold. By the end of the movie, the threshold is very permissive, so all of the nodes are nearly completely connected, and subgroup clustering is no longer evident. This movie was created using Processing. |
5. Datafiles generated in the analysis: sequence files, HMMs, sequence similarity networks
1. GPCRs
| File | Description | ||||||||||||||||||||||||||||||
| HMM: clan_GPCR_A.pfam.afa.hmm |
GPCR_A TM domain HMM | ||||||||||||||||||||||||||||||
| HMM: clan_FOCS_TMdom.pfam.afa.hmm |
FOCS ("Class B GPCRs") TM domain HMM | ||||||||||||||||||||||||||||||
| sequences: nava99nr_GPCRdom.fa |
766 99% non-redundant human GPCR domain sequences | ||||||||||||||||||||||||||||||
| sequences: classA_rho-like_dom.fa |
605 99% non-redundant human GPCR domain sequences (Class A: Rhodopsin-like) | ||||||||||||||||||||||||||||||
| sequences: 17randomHumanSeqs_289res.fa |
17 "non-GPCR" human protein sequences | ||||||||||||||||||||||||||||||
| sequences: ClassA_Amine_dom_99nr.fa |
42 99% non-redundant human GPCR domain sequences (amine-binding) | ||||||||||||||||||||||||||||||
| sequence alignment: ClassA_Amine_dom_99nr.afa |
alignment of the above 42 amine-binding GPCR domains | ||||||||||||||||||||||||||||||
| tree: amine_nj.nolabels.tree
|
Figure 2A: amine-binding GPCR tree Neighbor-Joining tree describing the interrelationships of 42 amine-binding human GPCR domains. [based on the above alignment] | ||||||||||||||||||||||||||||||
| network statistics: three_plots_nava1.22.08_99nr.fa.pdf |
Full length GPCRs: quartile plots showing how pairwise BLAST alignments change with E-value | ||||||||||||||||||||||||||||||
| network statistics: three_plots_nava99nr_GPCRdom.fa.pdf |
Domain-only GPCRs: quartile plots showing how pairwise BLAST alignments change with E-value | ||||||||||||||||||||||||||||||
| network: ClassA_Amine_dom_99nr.fa_1e-33.xgmml [473K]
|
Fig. 2B: amine-binding GPCR network Sequence similarity network including the same 42 GPCR domain sequences as in (A). This network was thresholded at a BLAST E-value of 1e-33: only edges associated with E-values more significant than 1e-33 are included in the network. This network contains 324 edges; the worst edges displayed correspond to a median of 30% identity over an alignment length of 280 amino acids.
|
||||||||||||||||||||||||||||||
| network: classA_rho_pctrl.fa_1e-11.xgmml [83M]
|
Fig. 4A: Class A: Rhodopsin-like GPCR network 605 human Class A: Rhodopsin-like GPCR domains. This sequence set includes the 42 amine-binding sequences from Table 2 and Fig. 2. This network was thresholded at a BLAST E-value of 1e-11; the worst edges displayed correspond to a median of 24% identity over an alignment length of 210 amino acids. *** see amine-binding GPCR network above for xgmml file sequence attributes |
||||||||||||||||||||||||||||||
| network: nava99nr_GPCRdom_pctrl.fa_0.01.xgmml [65M]
|
Fig. 4B: All GPCR superfamilies 766 human GPCR domains. This sequence set includes all of the 605 Class A sequences from (A). This network was thresholded at an E-value of 1e-2, and the worst edges displayed correspond to a median of 22% identity over an alignment length of 120 amino acids. *** see amine-binding GPCR network above for xgmml file sequence attributes *** Because this network contains many more edges than the other networks from this paper, this xgmml format file does not contain information about node color or positioning, as most of the others do -- the exported file becomes inconveniently large. In order to duplicate the look of Fig. 4B:
|
2. Kinases
| File | Description | ||||||||||||||||||
| sequences: hKD_nps_99nr.fa |
513 99% non-redundant human kinase domain sequences | ||||||||||||||||||
| sequences: hKD_STE_WNK.fa |
51 STE and WNK kinase domain sequences (from above sequence set) | ||||||||||||||||||
| sequence alignment: hKD_STE_WNK.afa |
alignment of the above human STE/WNK kinase domains | ||||||||||||||||||
| tree: hKD_STE_WNK.fa.tree
|
Supplementary Fig. 4A: STE and WNK kinase tree Neighbor-Joining tree describing the interrelationships of 51 human STE/WNK kinase domains. [based on the above alignment] | ||||||||||||||||||
| network statistics: three_plots_hKD_nps_99nr.fa.pdf |
513 human kinase domains: quartile plots showing how pairwise BLAST alignments change with E-value | ||||||||||||||||||
| network: hKD_STE_WNK.fa_1e-27.xgmml [970K]
|
Supplementary Fig. 4B: STE and WNK kinase network
Sequence similarity network including the same 51 STE/WNK kinase domains as in S4A. This network was thresholded at an E-value of 1e-27 and contains 821 edges; the worst edges displayed correspond to a median of 30% identity over an alignment length of 270 amino acids.
|
||||||||||||||||||
| sequences aligned to HMM database: hKD_nps_99nr.fa.kinase_hmms.hmmpfam |
The HMMER hmmpfam command was used to identify which PFAM model (Pkinase or Pkinase_Tyr) best fit each of the 513 kinase domains, and to estimate the identity of the catalytic Lys. This is the output of that command. (text file) | ||||||||||||||||||
| network: hKD_nps_99nr.fa_1e-25.xgmml [19M]
|
Fig. 3: Sequence similarity network of the kinase
superfamily Network of 513 human kinase domains: The network is thresholded at a BLAST E-value of 1e-25. The worst edges displayed correspond to a median of 29% identity over alignments of 260 residues
|
3. Crontonase/Enoyl-CoA Hydratase Superfamily
| File | Description | ||||||||||||||||||||
| sequences: crotPub01.16.08_99nr.fa |
1,170 99% non-redundant crotonase/Enoyl-CoA hydratase superfamily sequences (full-length) | ||||||||||||||||||||
| sequences: ECH_crotPub01.16.08_99nr.fa |
410 full length sequences annotated as members of the SFLD enoyl-CoA hydratase family (subset of the crotonase superfamily) | ||||||||||||||||||||
| sequence alignment: crotDomain3HBCH_clpped.afa |
Structure-based sequence alignment of crotonase domain from crotonase superfamily, with a couple of 3-hydroxyisobutyryl-CoA hydrolase sequences added in | ||||||||||||||||||||
| HMM: crotDomain.hmm |
Crotonase domain HMM based on the above sequence alignment | ||||||||||||||||||||
| sequences: crotDomain.fa |
1,170 99% non-redundant crotonase/Enoyl-CoA hydratase superfamily domain sequences isolated from full-length seqs as defined by crotDomain.hmm | ||||||||||||||||||||
| network statistics: three_plots_crotPub01.16.08_99nr.fa.pdf |
1,170 full-length crotonase superfamily sequences: quartile plots showing how pairwise BLAST alignments change with E-value | ||||||||||||||||||||
| network statistics: three_plots_crotDomain.fa.pdf |
1,170 domain-only crotonase superfamily sequences: quartile plots showing how pairwise BLAST alignments change with E-value | ||||||||||||||||||||
| network: crotPub01.16.08_99nr.fa_1e-30.xgmml [36M]
|
Fig. 5A,B: Crotonase superfamily: full-length
network
1,170 sequences from the crotonase superfamily, involving full-length sequences, thresholded at an E-value of 1e-30. The worst edges displayed correspond to a median of 33% identity over alignments of 250 residues. *** Because this network contains many more edges than the most of other networks from this paper, this xgmml format file does not contain information about node color or positioning, as most of the others do -- the exported file becomes inconveniently large. In order to duplicate the look of Fig. 5A:
|
||||||||||||||||||||
| network: crotDomain.fa_1e-29.xgmml [48M]
|
Fig. 5C: Crotonase superfamily: domain-only
network
1,170 sequences from the crotonase superfamily, involving just the crotonase domain, thresholded at 1e-29. The worst edges displayed correspond to a median of 38% identity over alignments of 180 residues. *** Includes all of the attributes in the full-length crotonase superfamily network, and has some additional attributes:
|
||||||||||||||||||||
| network: ECH_crotPub01.16.08_99nr.fa_1e-50.xgmml [18M]
|
Fig. 6: Enoyl-CoA hydratase family: full-length
network
The 410-sequence network is thresholded at a BLAST E-value of 1x10-50; the worst edges displayed correspond to a median of 40% identity over alignments of 260 amino acids. *** Includes the same attributes as the full-length crotonase superfamily network above |
||||||||||||||||||||
| sequence alignment: tree.ids.afa |
MUSCLE alignment of 15 representative full-length sequences from the ECH family and superfamily | ||||||||||||||||||||
| tree: tree.ids.nex.con.tree
|
Figure S5C: Bayesian phylogeny, full length ECH sequences Bayesian consensus tree consisting of 15 representative sequences, based on the alignment above | ||||||||||||||||||||
| sequence alignment: tree.ids.crotDom.afa |
MUSCLE alignment of 15 representative ECH-domain-only sequences from the ECH family and superfamily | ||||||||||||||||||||
| tree: tree.ids.crotDom.nex.con.tree
|
Figure S5D: Bayesian phylogeny, ECH domain sequences Bayesian consensus tree consisting of 15 representative sequences, based on the alignment above |
Laboratory Overview | Research | Outreach & Training | Available Resources | Visitors Center | Search