home overview research resources outreach & training outreach & training visitors center visitors center search search

Current Plugins

Typographic Conventions

Mouse button
Dialog button
Keyboard button
Menu item
Dialog label
UCSF RBVI Cytoscape Plugins

clusterMaker: Creating and Visualizing Cytoscape Clusters

Figure 1. clusterMaker in action. In this screenshot, the expression data in the sampleData file galFiltered.cys has been clustered using the hierarchical method and displayed as a heatmap with associated dendrogram. The groups created by clustering are shown on the network.

UCSF clusterMaker is a Cytoscape plugin that unifies different clustering techniques and displays into a single interface. Current clustering algorithms include Hierarchical and k-Means for clustering expression or genetic data, and MCL and FORCE for clustering similarity networks to look for protein families (and putative functional similarities). Hierarchical and k-Means clusters may be displayed as hierarchical groups of nodes or as heat maps. MCL and FORCE both create collapsible "meta nodes" to allow interactive exploration of the putative family associations within the Cytoscape network, and MCL results may also be shown as a separate network containing only the intra-cluster edges. Another plugin, clusterExplorer, may be used after MCL or FORCE clustering to explore cluster statistics and relationships. clusterMaker requires version 2.6.1 or newer of Cytoscape and is available from the Cytoscape plugin manager under the Analysis category.

Instructions

Installation

clusterMaker is available through the Cytoscape plugin manager or by downloading the source directly from the Cytoscape svn repository (see Cytoscape Subversion Server information, or browse the csplugins/ucsf/scooter/clusterMaker sources). To download clusterMaker using the plugin manager, you must be running Cytoscape 2.6.1 or newer. clusterMaker is available in the Analysis group of plugins. To install it, bring up the Manage Plugins dialog (Plugins→Manage Plugins) and select Analysis under Available for Install. Select clusterMaker and click the Install button.

Figure 2. clusterMaker plugin menu.

Starting ClusterMaker

Once clusterMaker is installed, it will install a new Cluster menu hierarchy under the Plugins main menu. Each of the supported clustering algorithms appears as a separate menu item underneath the Cluster menu. To cluster your data, simply select Plugins→Cluster→algorithm where algorithm is the clustering algorithm you wish to use (see Figure 2). This will bring up the settings dialog for the selected algorithm (see below).

The Cluster menu also contains a subset of the visualization options, including showing a heat map of the data (without clustering), and options appropriate for displaying Hierarchical or k-Means clusters if either of those methods had been performed on the current network. Because information about clusters is saved in Cytoscape attributes, the Eisen TreeView and Eisen KnnView options will be available in a session that was saved after clustering.

Hierarchical Clustering

Figure 3. clusterMaker Hierarchical cluster dialog.

Hierarchical clustering builds a dendrogram (binary tree) such that more similar nodes are likely to connect more closely into the tree. Hierarchical clustering is useful for organizing the data to get a sense of the pairwise relationships between data values and between clusters. The clusterMaker hierarchical clustering dialog is shown in Figure 3. There are several options for tuning hierarchical clustering:
Linkage:
In agglomerative clustering techniques such as hierarchical clustering, at each step in the algorithm, the two closest groups are chosen to be merged. In hierarchical clustering, this is how the dendrogram (tree) is constructed. The measure of "closeness" is called the linkage between the two groups. Four linkage types are available:
  • pairwise average-linkage: the mean distance between all pairs of elements in the two groups
  • pairwise single-linkage: the smallest distance between all pairs of elements in the two groups
  • pairwise maximum-linkage: the largest distance between all pairs of elements in the two groups
  • pairwise centroid-linkage: the distance between the centroids of all pairs of elements in the two groups

Distance Matrix:
There are several ways to calculate the distance matrix that is used to build the cluster. In clusterMaker these distances represent the distances between two rows (usually representing nodes) in the matrix. clusterMaker currently supports eight different metrics:
  • Euclidean distance: this is the simple two-dimensional Euclidean distance between two rows calculated as the square root of the sum of the squares of the differences between the values.
  • City-block distance: the sum of the absolute value of the differences between the values in the two rows.
  • Pearson correlation: the Pearson product-moment coefficient of the values in the two rows being compared. This value is calculated by dividing the covariance of the two rows by the product of their standard deviations.
  • Pearson correlation, absolute value: similar to the value above, but using the absolute value of the covariance of the two rows.
  • Uncentered correlation: the standard Pearson correlation includes terms to center the sum of squares around zero. This metric makes no attempt to center the sum of squares.
  • Uncentered correlation, absolute value: similar to the value above, but using the absolute value of the covariance of the two rows.
  • Spearman's rank correlation: Spearman's rank correlation (ρ) is a non-parametric measure of the correlation between the two rows. This metric is useful in that it makes no assumptions about the frequency distribution of the values in the rows, but it is relatively expensive (i.e., time-consuming) to calculate.
  • Kendall's tau: Kendall tau rank correlation coefficient (τ) between the two rows. As with Spearman's rank correlation, this metric is non-parametric and computationally much more expensive than the parametric statistics.

Array sources:
This area contains the list of all numeric node and edge attributes that can be used for hierarchical clustering. At least one edge attribute or one or more node attributes must be selected to perform the clustering. If an edge attribute is selected, the resulting matrix will be symmetric across the diagonal with nodes on both columns and rows. If multiple node attributes are selected, the attributes will define columns and the nodes will be the rows.

Only use selected nodes/edges for cluster:
Under certain circumstances, it may be desirable to cluster only a subset of the nodes in the network. Checking this box limits all of the clustering calculations and results to the currently selected nodes or edges.

Cluster attributes as well as nodes:
If this box is checked, the clustering algorithm will be run twice, first with the rows in the matrix representing the nodes and the columns representing the attributes. The resulting dendrogram provides a hierarchical clustering of the nodes given the values of the attributes. In the second pass, the matrix is transposed and the rows represent the attribute values. This provides a dendrogram clustering the attributes. Both the node-based and the attribute-base dendrograms can be viewed, although Cytoscape groups are only formed for the node-based clusters.

Ignore nodes with no data:
A common use of clusterMaker is to map expression data onto a pathway or protein-protein interaction network. Often the expression data will not cover all of the nodes in the network, so the resulting dendrogram will have a number of "holes" that might make interpretation more difficult. If this box is checked, only nodes that have values for at least one of the attributes will be included in the matrix.

Create groups from clusters:
If this button is checked, hierarchical Cytoscape groups will be created from the clusters. Hierarchical groups can be very useful for exploring the clustering in the context of the network. However, if you intend to perform multiple runs to try different parameters, be aware that repeatedly removing and recreating the groups can be very slow.

Figure 4. clusterMaker K-Means cluster dialog.

K-Means Clustering

K-Means clustering is a partitioning algorithm that divides the data into k non-overlapping clusters, where k is an input parameter. The clusterMaker k-Means implementation is simple and fast and can be used for moderately large data sets. The clusterMaker k-Means clustering dialog is shown in Figure 4. The options for k-Means clustering are the same as for hierarchical clustering except the Linkage option has been replaced with the Number of clusters option (the value for k). The user can also specify the Number of iterations of the k-Means algorithm. One of the challenges in k-Means clustering is that the number of clusters must be chosen in advance. A simple rule of thumb for choosing the number of clusters is to take the square root of ½ of the number of nodes. Beginning with clusterMaker version 1.6, this value is provided as the default value for the number of clusters. If Only use selected nodes/edges for cluster is checked, the number of nodes used is the number of selected nodes; otherwise, it is the total number of nodes. This number is not recalculates automatically when the checkbox state changes, so you will need to change the checkbox and close/reopen the dialog to see the updated value.

MCL

Figure 5. clusterMaker MCL cluster dialog.

Markov Clustering Algorithm (MCL) is a fast divisive clustering algorithm for graphs based on simulation of the flow in the graph. MCL has been applied to complex biological networks such as protein-protein similarity networks. As with all of the clustering algorithms, the first step is to create a matrix of the values to be clustered. For MCL, these values must be stored in edge attributes. Once the matrix is created, the MCL algorithm is applied for some number of iterations. There are two basic steps in each iteration of MCL. First is the expansion phase where the matrix is expanded by calculating the linear algebraic matrix-matrix multiplication of the original matrix times an empty matrix of the same size. The next step is the inflation phase where the each non-zero value in the matrix is raised to a power followed by performing a diagonal scaling of the result. Any values below a certain threshold are dropped from the matrix after the normalization (scaling) step in each iteration. This process models the spreading out of flow during expansion, allowing it to become more homogeneous, then contracting the flow during inflation, where it becomes thicker in regions of higher current and thinner in regions of lower current.

The MCL dialog is shown in Figure 5. There are four parameters and three options, plus the selection of the edge attribute to use to build the matrix. Each parameter is discussed below:

Figure 6. The network created by clusterMaker's MCL clustering algorithm for MS/TAP protein-protein interaction data from yeast.

Density Parameter
The Density Parameter is also known as the inflation parameter. This is the power used to inflate the matrix. Reasonable values range from ~1.8 to about 2.5. A good starting point for most networks is 2.0.
Weak EdgeWeight Pruning Threshold
After each inflation pass, very small edge weights are dropped out. This should be a very small number: 1x10-10 or so. The algorithm is more sensitive to tuning this parameter than it is to tuning the inflation parameter. Changes in this parameter also significantly impact the performance.
Number of iterations
This is the maximum number of iterations to execute the algorithm.
The maximum residual value
After each iteration, the residuals are calculated, and if they are less than this value, the algorithm is terminated. Set this value very small to ensure that you get sufficient iterations.
Take the -LOG of Edge Weights in Network
This is probably self-explanatory, but if your edge weights are e-values of some sort (e.g., BLAST values), then you probably want to take the -LOG of the weights before creating the matrix.
Create a new network with independent clusters
Checking this option will result in the creation of a new Cytoscape network that shows only the intra-cluster edges. Figure 6 shows an example.
Cluster only selected nodes
On occasion, it might be desirable to cluster only the selected nodes (and their connected edges). This box provides that capability.
Array sources:
This area contains the list of all numeric edge attributes that can be used for MCL clustering. At least one edge attribute attributes must be selected to perform the clustering.

Visualizing Results

Besides the MCL option to Create a new network with independent clusters, clusterMaker provides three types of heat map display: HeatMapView (unclustered), Eisen TreeView, and Eisen KnnView. The user interfaces for the heat map displays are very similar. The most complicated is the Eisen TreeView, which will be discussed in detail first. The Eisen KnnView and HeatMapView (unclustered) will then be discussed briefly, with an emphasis on how they differ from the Eisen TreeView.

Figure 7. clusterMaker's Eisen TreeView. The larger image shows the results of hierarchically clustering the nodes and five node attributes (expression data from a heat shock experiment). The inset shows the results of hierarchical clustering using an edge attribute. The resulting network is symmetrical across the diagonal, and the dendrograms at the left and top are the same.

Eisen TreeView

Hierarchical clustering results are usually displayed with the Eisen TreeView (see Figure 7). This view provides a color-coded "Heat Map" of the data values and the dendrogram from clustering. The Eisen TreeView can be created by clicking on the Visualize Clusters button in the Hierarchical cluster dialog (Figure 3) or by selecting Plugin→Cluster→Eisen TreeView from the Cytoscape tool bar. Note that both of these methods will be "grayed out" unless hierarchical clustering has been performed on the current network. The information necessary to create the TreeView is retained across sessions (stored in network attributes), so these options should be available when you reload a session that had been saved after hierarchical clustering.

The basic TreeView window has four main vertical windows: Node Dendrogram, Global HeatMap, Zoom HeatMap, and the Node List. These windows may be resized to emphasize different portions of the TreeView. Each of the windows is discussed in detail below. Note that selection of a row in TreeView will select the corresponding node in the current network view in Cytoscape (if that node exists). The reverse is also true -- selection in Cytoscape will select the corresponding nodes in TreeView. This is an important feature of clusterMaker: multiple views (current network, multiple heat maps if present) respond simultaneously to a selection in any one view.

Node Dendrogram
The leftmost pane displays the node dendrogram for the heat map. At the top of the pane is a Status window that changes depending on the location of the pointer. With the pointer over the node dendrogram, the Status window will display the ID and correlation for the currently selected branch of the dendrogram (if any). If the cursor is over the Global HeatMap window the Status window displays the number of genes (nodes) and arrays (attributes) selected and the range of the selections. Finally, if the cursor is over the Zoom HeatMap window, the Status window displays the node and attribute name as well as the value of the spot under the pointer.
Mouse and keyboard actions in node dendrogram pane
ActionTargetResult
click Dendrogram branch Select that branch of the dendrogram and all children
up arrow   If there is a currently selected branch, select its parent and all subsequent children
down arrow   If there is a currently selected branch, move to the top branch and deselect the bottom branch
left arrow   If there is a currently selected branch, move to the top branch and deselect the bottom branch
right arrow   If there is a currently selected branch, move to the bottom branch and deselect the top branch
Global HeatMap
The Global HeatMap is the next pane over, divided into two parts. The upper part contains the dendrogram for the attributes (if they were clustered) and the lower part contains the entire heat map in a scrolling window. Horizontal and vertical scroll bars will be provided as needed. Selection of branches in the dendrogram in the upper window is similar to that in the node dendrogram (see above). Selections in the Global HeatMap pane are shown with a thin yellow outline. The area corresponding to the Zoom HeatMap view is shown with a thin blue outline.
Mouse and keyboard actions in global heatmap pane
ActionTargetResult
click Heat map Select that row of the heat map
shift-click Heat map Select that cell of the heat map
drag Heat map Select the rows encompassed by the dragged-out region
shift-drag Heat map Select the region encompassed by the dragged-out area
up arrow   If there is a current selection, move that selection up one row
down arrow   If there is a current selection, move that selection down one row
left arrow   If there is a current selection, move that selection left one column
right arrow   If there is a current selection, move that selection right one column
control-up arrow   If there is a current selection, expand that selection by two rows (one on the top and one on the bottom)
control-down arrow   If there is a current selection, contract that selection by two rows (one on the top and one on the bottom)
control-left arrow   If there is a current selection, expand that selection by two columns (one on the left and one on the right)
control-right arrow   If there is a current selection, contract that selection by two columns (one on the left and one on the right)
Zoom HeatMap
The Zoom HeatMap view shows the nodes and attributes selected in the Global HeatMap window. It has three sections: the top section lists the names of the attributes that correspond to the columns in the heat map, the next section down contains the dendrogram for the columns (if one was calculated), and the bottom section contains the heat map itself. There are no mouse or keyboard actions in the top or bottom windows, but if a dendrogram is present, it will respond to mouse and keyboard actions in the same way as the Global HeatMap dendrogram.
Node List
Finally, the right-most pane lists each node shown in the Zoom HeatMap pane. The list is sized to correspond exactly to the rows in the Zoom HeatMap pane and scrolls along with it so that the names stay aligned with the rows.
In addition to the various windows, each heat map dialog provides a series of buttons:

Figure 8. Pixel Settings Dialog.

Settings...
The Settings... button brings up the Pixel Settings dialog, which allows users to customize the dimensions of heat map cells in the Global and Zoom panes. The dimensions can be specified as pixel values (Fixed scale) for X (width) and Y (height), or to automatically fill the available space (Fill).

Users can specify which color scheme should be used: a red-green (RedGreen) continuum or the default yellow-blue (YellowBlue). Color schemes may also be customized by setting the Positive, Zero, Negative, and Missing values. Once these values have been assigned, they can be saved as presets (Make Preset). The Load... and Save.. buttons are used to load and save color sets, respectively.

The Pixel Settings dialog also provides a Contrast slider to adjust the contrast of the colors. This is useful to emphasize more subtle differences in heat map values. Finally, LogScale rather than a linear mapping of values to colors can be used, and the center point set to improve the display of single-tailed data.

Save Data...
In order to facilitate data exchange and analysis by other software, the Save Data... button will export the current data in Cluster format, including the .cdt, .gtr, and .atr files, as appropriate.

Figure 9. Export Graphics Dialog.

Export Graphics...
The Export Graphics... button brings up the Export Graphics Dialog (see Figure 9), which provides an interface to export the heat map to a variety of different graphics formats:
Graphics Formats Supported by clusterMaker
FormatTypeQuality
pngBitmaphighest bitmap quality
jpgBitmapreasonable bitmap quality, but aberrations visible at high scales
bmpBitmapvery good bitmap quality
pdfVectorexcellent quality
svgVectorexcellent quality, but not widely supported
epsVectorexcellent quality, but will need to be processed by a separate program

Figure 10. Example export of a portion of a TreeView heat map showing both the Node and Attribute dendrograms (click on the image to see a larger version).

Generally, vector formats yield a higher-quality appearance, as they can be scaled. Particularly for use in a graphics package such as Adobe Illustrator or Adobe Photoshop, vector formats are much preferred. For inclusion in a web page or presentation, png is a reasonable choice if you are not planning on doing any significant zooming and cropping (see Figure 10).

Options for what is included in the output depend on the type of display. For TreeView heat maps that are symmetric (i.e., created using an edge attribute), the Left Node Tree and the Top Node Tree may be included in the output, and you will almost always want to include the Heat Map itself. For TreeView heat maps with both nodes and attributes clustered, you will be able to include the Node Tree, Attribute Tree, and the Heat Map. If the attributes were not clustered, the Attribute Tree will not be available.

If only part of the heat map is desired, you can choose to save just the selected portion (Selection Only). Note that to include dendrograms in the output, you will need to select a full subtree.

Flip Tree Nodes
The Flip Tree Nodes button will flip the order of the trees in the top dendrogram, if it exists. At this time, there is no corresponding way to flip the left dendrogram.

Map Colors Onto Network...
The Map Colors Onto Network... button provides a method for mapping the colors from the heat map back onto Cytoscape nodes (and edges for symmetric heat maps). If a single column (attribute) is selected, a new VizMap will be created and the colors corresponding to that attribute will be assigned to the nodes in the network view. If multiple columns are selected, the Map Colors to Network dialog (shown in Figure 11 below) will be displayed. From this dialog, you will be able to select a single attribute and create the VizMap for that attribute, or select multiple attributes to create a VizMap for each attribute and animate through them. An Animation Speed slider allows the user to select the speed of the animation. The initial pass will take slightly longer as the VizMap for each attribute needs to be created, but after that, the animation speed should correspond closely to the slider.

NOTE: At this time, there is no way to save the animation as a movie, although this is a much-requested feature and will be implemented in the future.

Figure 11. The Map Colors to Network Dialog

Close
The Close button closes the dialog.

Figure 12. The Eisen KnnView dialog showing the results of visualizing a k-Means cluster with k=30.

Eisen KnnView

The results of k-Means clustering can be shown with the Eisen KnnView (Figure 12). This is much the same as the Eisen TreeView discussed above, except that the dendrogram areas are empty and the clusters are separated by a blank space the width of one cell. All of the features discussed as part of the Eisen TreeView are available in the Eisen KnnView except the Flip Tree Nodes button, and the Node Tree and Attribute Tree options are not available in the Export Graphics dialog.

HeatMapView (unclustered)

Any attribute or group of attributes may be shown as a heat map using the clusterMaker HeatMapView (unclustered). The dialog is identical to the Eisen KnnView except that since there are no clusters, there are no blank spaces separating the clusters.

Acknowledgements

The Hierarchical and k-Means implementations of clusterMaker are based on the Cluster 3.0 C implementation (from Michiel de Hoon while at the Laboratory of DNA Information Analysis at the University of Tokyo), which was based on the original Cluster program written by Michael Eisen. The heatmap/dendrogram visualization is based on Java TreeView implemented by Alok Saldanha while at Stanford University. The MCL cluster algorithm was written based on the original thesis by Stejn van Dongen, with reference to the Java implementation by Gregor Heinrichi (see http://www.arbylon.net/projects/knowceans-mcl/doc/).

References

  1. M. B. Eisen, P. T. Spellman, P. O. Brown, and David Botstein: Cluster analysis and display of genome-wide expression patterns. PNAS, 95(25):14863-8 (1998) [PMID:9843981]
  2. A. J. Saldanha: Java Treeview--extensible visualization of microarray data. Bioinformatics, 20(17):3246-8 (2004). [PMID:15180930]
  3. M. J. L. de Hoon, S. Imoto, J. Nolan, and S. Miyano: Open Source Clustering Software. Bioinformatics, 20 (9):1453--1454 (2004). [PMID:14871861]
  4. A. J. Enright, S. Van Dongen, C. A. Ouzounis: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research, 30(7):1575-1584 (2002). [PMID:11917018]
  5. T. Wittkop, J. Baumbach, FP Lobo, S Rahmann: Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing. BMC Bioinformatics 8:396 (2007) [PMID:17941985]
  6. S. van Dongen: Graph clustering by flow simulation [PhD dissertation]. Utrecht (The Netherlands): University of Utrecht. 169 p. (2000)


Laboratory Overview | Research | Outreach & Training | Available Resources | Visitors Center | Search