Overview

A central focus of cancer genetics is the study of mutations that are causally implicated in tumorigenesis. These mutations not only provide insights into cancer biology but also present anti-cancer therapeutic targets and diagnostic markers.  Several major cancer genome projects are currently underway to profile genetic changes in large collections of tumor samples. However, a majority of the changes identified in such screens are non-functional passenger mutations. Even with current technologies, it remains a challenge to distinguish causal cancer mutations from other harmless genetic alterations or passenger mutations.

We have developed a novel method that uses known cancer-associated variants to build a general description of a cancer mutation. This method is described in detail in our Cancer Research paper. Three different algorithms are used to generate measurements describing cancer-associated variants. These measurements are then used to construct a random forest classifier. The classifier takes these three measurements for a particular variant as input and provides a prediction as to whether the mutation is likely to be cancer-associated or not. The diagram below outlines our method.

 

Measuring the characteristics of a cancer mutation: the SIFT, Pfam-based logR.E-value and Gene Ontology Similarity Score metrics

The SIFT and Pfam-based logR.E-value algorithms predict which changes are likely to affect protein function. SIFT uses sequence homology between closely related protein species to measure this effect, and low SIFT scores (<0.05) are predictive of intolerant changes. The median information content of all variants used must be less than 3.25. The LogR.E-value algorithm determines the difference between a wildtype and variant protein by measuring their fit to a Pfam model. The LogR.E-value score is derived from the E-value provided by HMMER2 software and scores that are greater than 0.5 are predicted to alter protein function. The underlying scoring systems behind LogR.E-value and SIFT are different, and it has been shown that these algorithms produce distinct metrics that can be used to independently analyze variant data. As shown in the diagram below, these methods predict that cancer variants are more likely to be deleterious than common snps or variants involved in complex disease.

 

The last element of the classifier is the Gene Ontology Similarity Score (GOSS). Using a standard set of Gene Ontology (GO) annotations, a log-odds score was calculated to measure the difference in frequency that a particular GO term was used to annotate genes which were either known to be involved in cancer, or, not known to be involved in cancer and represented by a RefSeq mRNA. For example, genes annotated with the term ion transport are under-represented in the cancer dataset and this term has a negative log-odds score of -2.30, while genes annotated with the term cell cycle are over-represented in the cancer dataset and this term has a positive log-odds score of 2.06. For a gene of interest annotated with a set of GO terms, one can sum the log-odds score for each GO term. This cumulative score reveals inherent differences between genes containing common variants and genes containing cancer variants. See diagram below for more details.

 

Prediction using a Random Forest Classifier

A random forest (RF) classifier uses multiple decision trees (a forest of them) to build a predictive model that can be used to classify novel data. In the decision tree method, given a set of data points from known classes and a number of descriptive features for each data point, a decision tree is built by recursively splitting the data points on the basis of these descriptive features. Each splitting decision is optimized to best segregate the data into groups by class. The set of splitting decisions can be visualized as a tree and used to predict the class of a new data point given its descriptive features. The random forest method improves on this procedure by building a large number decision trees (a forest of them) using random subsets of the training data for each. The resulting forest of trees can then be used to 'vote' the most likely class of new data points. This use of random subsets of the training data avoids the problem of over fitting in which the classifier too closely follows the peculiarities of the training data to accurately classify new datasets. A very good description of this method can be found on Leo Breiman's website at UC Berkeley.

The RF classifier was built using the package randomForest 4.5-16 for the R statistical environment to classify variants as cancer or non-cancer. The classifier was trained on 200 cancer mutations and 800 non-cancer mutations using the SIFT score, LogR.E-value score, and GO log-odds score for each variant. The classifier was built using the parameters ntree = 1000 and mtry = 2. The mutation/SNP ratio was empirically determined based on the numbers of somatic mutations and background polymorphisms.

Below are data from a 10-fold cross validation experiment, shown to illustrate the performance of the CanPredict random forest classifier. We generated 10 unique sets of training data by randomly selecting cancer and non-cancer changes from our larger set of training data. These training data were used to generate 10 independent classifiers. For each experiment, the mutations not used for training were then classified. These distinct classifiers have very similar characteristics in their ability to distinguish known cancer-causing mutations. The values shown in the table below are averaged over the 10 independent experiments.

 

Explanation of Scores

SIFT Score: The SIFT algorithm "takes a query sequence and uses multiple alignment information to predict tolerated and deleterious substitutions for every position of the query sequence. SIFT is a multistep procedure that, given a protein sequence, (1) searches for similar sequences, (2) chooses closely related sequences that may share similar function, (3) obtains the multiple alignment of these chosen sequences, and (4) calculates normalized probabilities for all possible substitutions at each position from the alignment." (Ng and Henikoff). This normalized probability is reported by the SIFT algorithm as the SIFT Score. Substitutions with a SIFT Score (normalized probability) of less than .05 are predicted to be deleterious by SIFT.

SIFT Median Information Content: The median sequence information is determined by the SIFT algorithm. This is done by first calculating the information at each position in the alignment (available for download) and then obtaining the median over all positions (for further details see Ng and Henikoff). Only scores with a median sequence information content < 3.25 were used in this analysis. Higher scores would likely indicate an alignment based on closely related species, thus generating an unreliable SIFT Score for a particular change.

Pfam Domain Affected: This indicates the name of the domain altered by a particular mutation. In those cases where more than one domain is altered by a mutation, the change providing the largest (most deleterious) LogR.E-value score is used.

Wildtype Expect/Mutant Expect: The logR.E-value score is derived from data provided by the HMMER 2.3.2 software. Each search using the HMMER software generates an expect (E-value). The E-value reflects the number of sequences expected to have a score equal to or better than the current sequence. Sequences generating an E-value > 1 are not evaluated by CanPredict as such scores could indicate spurious hits to the Pfam database. The Wildtype Expect is the E-value obtained when searching the Pfam database using the wildtype version of the provided protein. The Mutant Expect is the E-value obtained when searching the Pfam database using the mutated version of the provided protein. These two values are used to calculate the LogR.E-value score.

LogR.E-value score: This score is described by Clifford et al., and is determined by the following calculation: LogR.E-value = log 10 (E-value variant / E-value wildtype ). As stated in Clifford et al., LogR.E-value scores > .5 are likely to indicate deleterious changes.

GOSS score: The Gene Ontology Similarity Score (GOSS) provides an indication of how similar a particular gene is to known cancer genes and was first described by Kaminker et al. Using a standard set of Gene Ontology (GO) annotations, a log-odds score was calculated to measure the difference in frequency that a particular GO term was used to annotate genes which were either known to be involved in cancer, or, not known to be involved in cancer and represented by a RefSeq mRNA. For a gene of interest annotated with a set of GO terms, one can sum the log-odds score for each GO term. This cumulative score reveals inherent differences between genes containing common variants and genes containing cancer variants. While there are no absolute cutoffs for GOSS scores for a cancer or a non-cancer gene, the cancer genes tend to have higher scores than the non-cancer genes; the training data reveal an average GOSS score of 14.04 for the cancer genes and 3.23 for the non-cancer genes.

Training Data

Files containing training data as well as other supplementary data from our initial Cancer Research paper are freely available here.

Common variants were downloaded from dbSNP resources at NCBI and overall minor allele frequencies were determined from the file SNPAlleleFreq.bcp also from NCBI ftp site.

Cancer-associated variants were collected from the COSMIC FTP site and includes only variants in those genes most likely to be involved in oncogenesis

Referencing CanPredict

This site is completely free for all users. If you use CanPredict, please reference our paper describing the CanPredict algorithm as well as this web application. Please note that licensing will become an issue if you intend to include part of this application into a commercial software package.