Overview

GeneHub-GEPIS is a bioinformatics tool for inferring gene expression patterns in normal and cancer tissues based on human and mouse EST sequence abundance. Its GeneHub component is designed to define gene boundaries based on mRNA transcript sequences from major databases and to establish extensive cross references for commonly used gene identifiers. Based on the underlying gene definition, the GEPIS component maps EST sequences to genomics sequences and therefore associate ESTs to genes reliably. Accumulative relative abundance of ESTs in various tissues is used to infer gene expression levels in both normal and cancer tissues of mouse and human.


  • What is the difference between GeneHub-GEPIS and the previous version of GEPIS? top

    The GEPIS application was originally developed in 2004 to compute gene expression levels over different many normal and cancerous tissues using EST data. The application has been widely accessed and is an extremely valuable tool for expression analysis. The application was described by Zhang et al (Bioinformatics, 2004). Despite its success, the previous version of GEPIS has several limitations. First, its algorithm relies on the BLAST program to assign EST sequences to a given input mRNA sequence. We subsequently found that BLAST frequently link erroneous ESTs to input sequences due to high-percentage regional matches. As a result, a given EST could be matched to multiple genes, thus leading to false expression data. In addition, there were insufficient data for performing reliable analysis for mouse genes. Furthermore, the design of the algorithm did not allow easy access to the expression data through URL linking. Moreover, the detailed information about ESTs and their source libraries could not be easily obtained for further investigation.

    In the newly designed GeneHub-GEPIS algorithm, we used a completely different mechanism to map ESTs to genes. The new mechanism not only overcomes the limitations we previously encountered, it also enables us to add many useful features that will make this application more powerful and accessible . We first mapped ESTs to their precise genomic locations using the GMAP algorithm. The genomic coordinates (NCBI Human build 36, Mouse build 35) were compared with the gene structure information (intron/exon boundary) of GeneHub genes generated above. An EST was considered to be the product of a gene if the two entities were mapped to the same locus and share at least one exon (with minimum match length of 30 bp). This greatly reduced incorrect matches between ESTs and genes with sequence similarity. The addition of GeneHub components enables text search and ortholog matching capability and allows easy access to the expression data through URL linking. In addition, the detailed information about ESTs and their source libraries are saved and made available via URL linking. Furthermore, the new system is extensible. To illustrate this important feature, we were able to quickly add the function to derive expression profile for microRNAs. Moreover, to take advantage of the vast volume of mouse EST data, we derived a collection of mouse genes and their EST-based expression results. Since it can be very useful to cross reference human and mouse expression data, we added an important feature to easily navigate the expression results between human and mouse. At the same time, we inherited some of the useful features of GEPIS such as Regional GEPIS Atlas and statistical analysis so that GeneHub-GEPIS can completely replace the GEPIS application. In summary, GeneHub-GEPIS is a new application that builds on a novel integrated gene database and delivers reliable gene expression results for diverse input values.

     
  • How widely is GeneHub-GEPIS used? top
  • GeneHub-GEPIS was only released in early December, 2006, so we do not have the access log data yet. However, the previous GEPIS application has been widely accessed since its release in 2004. The access log can be found at http://www.cgl.ucsf.edu/reports/webstats/genentech/. We have provided expression results for over 30,000 successful requests, and user base spreads over many different countries. Numerous users have contacted the authors for compliments, bug report and suggestions for new features. The enthusiastic support from our user base and their desire for new features has been the major driving force for building the GeneHub-GEPIS application. So we would also like to take this opportunity to thank our users for suggestions, bug reports, and encouragement. With the release of GeneHub-GEPIS, we anticipate a significant increase in the number of access to our application. We are working with some of the major gene databases to provide convenient URL links from their sites to the GeneHub-GEPIS expression results directly, which will dramatically increase the visibility and use of this application. Luckily, our collaborators at UCSF, Dr. Tom Ferrin in particular, have made their system very robust to handle the increased web traffic.

  • Does this site require login or license? top
  • No. This site is completely free for all users. We encourage you to use our site, and we would appreciate your citing our site and the paper (manuscript in preparation) describing this application. Licensing only becomes an issue if you intend to include part of this application into your commercial software package.
  • How to contact us? top
  • We need your feedback to improve the site. We would also appreciate your effort to test the application. We would like to hear anything you would like to say about this site: bug report, suggestions, questions, features you like and don’t like, etc. We can also provide all of the code and related data if you would like to implement GeneHub-GEPIS at your own site. Many people have contributed to this application. Yan Zhang in Zemin Zhang’s group at Genentech is the main programmer. Please feel free to contact either one by email: Yan (yz5@gene.com) or Zemin (zemin@gene.com). Our mailing address is:

    Department of Bioinformatics, M.S. 93, Genentech Inc., 1 DNA Way, South San Francisco, CA 94080, USA

GeneHub Component

  • What is GeneHub?

    GeneHub is an integrated gene database that provides gene information and cross references among several commonly used databases. It defines gene structures based on mRNA transcript sequences from major databases and overlays additional gene and protein identifiers to the GeneHub genes. It provides gene annotation and basic searching capabilities. It also provides the foundation for GeneHub-GEPIS expression analysis.

  • What are the date sources used by GeneHub? top
  • Table: Number of records in each of the database sources used by GeneHub.
    Database Type Database Name Human Mouse
    mRNA RefSeq * 29,521 45,309
    Ensembl Known * 29,387 27,617
    Proteome * 19,343 20,622
    FANTOM Known * - 54,854
         
    GenBank 138,360 110,359
    Ensembl Novel 4,482 8,854
    Protein UniProt 69,842 59,391
    PDB 16,735 3,595
    Gene Entrez Gene 36,013 51,065
    Microarray Affy Human U133P 54,675 -
    Affy Human U133A 22,283 -
    Affy Human U133B 22,645 -
    Affy Human U95A 12,479 -
    Affy Human U95B 12,620 -
    Affy Human U95C 12,646 -
    Affy Human U95D 12,644 -
    Affy Human U95E 12,639 -
    Affy Human U35KA 8,934 -
    Affy Human U35KB 8,924 -
    Affy Human U35KC 8,928 -
    Affy Human U35KD 8,928 -
    Affy Human FL68 6,633 -
    Affy Mouse Genome 430 2.0 - 45,101
    Affy Mouse 430A - 22,626
    Affy Mouse 430B - 22,511
    Affy Murine 74A - 12,488
    Affy Murine 74B - 12,477
    Affy Murine 74C - 11,934
    Affy Murine 11KA - 6,584
    Affy Murine 11KB - 6,002
    Agilent Human 1A 17,086 -
    Agilent Human 1Av2 20,173 -
    Agilent Human 1B 19,673 -
    Agilent Whole Human Genome 41,000 -
    Agilent Mouse V2 - 20,868
    Agilent Whole Mouse Genome - 41,174
    Combined GeneHub 31,999 34,794
    (* core transcript sets used for deriving GeneHub genes)
  • How are GeneHub genes derived?top
  • The genomic structures of human protein-coding genes were first defined using several reliable sources of transcripts. The collection of such high-quality transcripts, also called the core gene set, contains 29,521 mRNA sequences from RefSeq, 29,387 from the Known gene set of Ensemble genes, and 19,343 from Proteome. Likewise, the core gene set for mouse consists of 45,309 mRNA sequences from RefSeq, 27,617 from the Known gene set of Ensembl genes, 20,622 from Proteome, and 54,854 from FANTOM. Each of the core gene set sequences was aligned to the respective genome (human NCBI Build 36 and mouse NCBI Build 35) using the GMAP algorithm {Wu, 2005 #153}, and only the genomic match with the best percent identity and percent coverage was chosen. A minimum of 90% coverage of the entire transcript was required.

    The genomic coordinates for each of the matching exons were recorded for gene comparison. When two transcripts were mapped to the overlapping genomic regions, we evaluated the exon structure of these two transcripts to determine whether they should be clustered into one gene. First, orientations of the transcript should be the same. Second, the exon sequences should overlap. Third, the two transcripts should share at least one exact exon boundary or splice site. Transcripts that failed in one of the above criteria would be considered as a separate gene. Each group (or cluster) of transcripts was considered a GeneHub gene. Using this approach, we defined 31,999 non-redundant GeneHub genes for the human and 34,794 for mouse.

    After the initial human and mouse gene maps were built, additional sequences were mapped to the GeneHub genes. For this purpose, we used sequences from the mRNA collection of GenBank, the Ensembl Novel collection. As above, the exonic coordinates of the transcript sequences were computed by the GMAP algorithm and then compared against those GeneHub genes derived above. Again, for a transcript to be linked to a GeneHub gene, at least one of the splicing junctions was expected to match perfectly with those of the GeneHub gene.

  • How are microarray probes linked to gene?top
  • For Affymetrix human (U95 set, U133 set, U133p) and mouse (U74 set, 430 set) expression arrays, we used the target sequences obtained from Affymetrix to link to known genes. For Agilent oligo-based expression microarrays (H1A, H1Av2, H1B, Whole Human Genome Arrays), we directly used the 60-mer oligo-nucleotide sequences for gene linking. Using the GMAP program, we determined whether the array probe sequences overlap with the exon sequences collected above. Next, for sequences that did not overlap with any exons, we examined whether they were located in the vicinity of any GeneHub genes. It has been found that the expression of a given gene could usually be represented by microarray probe sequences in the vicinity of the gene (data not shown). Therefore, we assigned probe sequences to a closest gene in the same orientation if the probe sequence was located within 2 kb to the 3'end or 1 kb to the 5' end of the gene.

  • How are protein sequences linked?top
  • Human or mouse protein sequences from UniProt (69,842 human and 59,391 mouse protein sequences) and PDB (16,739 human and 3,595 mouse) were compared with the gene DNA sequences using the BLASTx program. For a protein record to be linked to a GeneHub gene, it was required to have >98% identity with a stretch of at least 35 amino-acid long region.

  • How are gene annotations collected?top
  • Once we built the GeneHub gene collection associated with gene/protein identifiers from various databases, it became straightforward to accumulate and integrate gene annotation information from diverse databases. Useful information such as gene description, accession, name, and synonyms were extracted from each member of a GeneHub gene and was linked to the GeneHub gene itself. Such information was stored in a common database field for text searching purposes.

  • How are ortholog links built?top
  • To build orthologous linking between human and mouse GeneHub genes, we downloaded the hmlg_ftp.txt file from HomoloGene Release 50.1. We used the orthologous Entrez Gene pairs of human and mouse if they were established by reciprocal best match between three or more organisms, or reciprocal best match or sequence similarity with match score greater than 70%. There were 15,868 pairs. Based on the Entrez Gene components of GeneHub genes, we linked human GeneHub genes to the appropriate mouse GeneHub gene counterpart

  • How many human and mouse genes are collected?top
  • There are 31,999 non-redundant human genes and 34,794 mouse genes.

GEPIS component

  • What is the basis for EST based expression predication?
  • EST clone frequency is in principle proportional to the corresponding gene's expression level (Adams et al., 1993), ESTs have been successfully used previously for studying expression signatures or differential expression analysis (Audic and Claverie, 1997; Ewing et al., 1999; Hishiki et al., 2000; Scheurle et al., 2000; Schmitt et al., 1999). The application o f ESTs in studying the human cancer transcriptome has been summarized by reported large-scale efforts by the Cancer Genome Anatomy Project (CGAP) and Human Cancer Genome Project (HCGP) (Brentani et al., 2003). We have also found previously that EST-based digital expression levels (DEU) correlated with mRNA transcript levels determined by quantitative RT-PCR analyses (Zhang, et al, Bioinformatics, 2004).

  • How do EST-based expression results compared with microarray based results?top
  • Although the concept of using ESTs for expression analysis has been known for over a decade, the full value of EST based expression has been somewhat overshadowed by the emerging microarray technology. While microarrays can be an extremely powerful method, EST data can be a strong alternative or even an advantageous method in some cases. Depending on the nature of specific studies, ESTs potentially offer several advantages. First, the extensive gene coverage by ESTs (Brentani et al., 2003) allows expression analysis of almost all genes, and gene representation is not affected by the sequence picking and designing process required by microarray chips. Furthermore, all the raw EST and tissue source data are already publicly available; therefore, much of the laborious laboratory work can be bypassed when studying the expression in many diverse tissues. Moreover, since the expression level is represented by mRNA abundance over total transcripts and is independent of probe selection and hybridization intensity, EST-based analysis can be a more quantitative and direct measurement of gene expression than microarray-based results. In fact, based on the EST data, we were able to perform large-scale screening for therapeutic tumor antigen targets, namely cell surface proteins with high and specific expression in tumor cells, and many of these targets have been experimentally validated by a variety of methods including tissue microarray, in situ hybridization and quantitative reverse transcription–PCR RT–PCR.

  • What are the EST sources?top
  • The human and mouse EST collections consist of all sequences in the EST division of GenBank (http://www.ncbi.nlm.nih.gov/dbEST/) . Quality information contained in the GenBank file was used to trim low-quality bases from sequences. As of July 11, 2006, there are 7810159 human EST sequences in 8598 libraries and 4637070 mouse EST sequences in 1208 libraries. Based on tissue and histology data for library information from the National Cancer Institute (NCI) and CGAP (http://cgap.nci.nih.gov/Tissues/, ftp://ftp1.nci.nih.gov/pub/CGAP/Hs_LibData.dat and ftp://ftp1.nci.nih.gov/pub/CGAP/Mm_LibData.dat , dated as July 11th, 2006), each EST was assigned a tissue source value and a disease value. There are 4426 cancer libraries, 2387 normal libraries for human. There are 51 cancer libraries, 741 normal libraries for mouse.

  • What are the ESTs available for analysis?top
  • Since the classification and selection of EST libraries were expected to be critical for reliable expression analysis, we established a number of quality control steps to identify usable libraries and to group them in the appropriate tissue categories as described in Zhang Y. et al, 2004. These data cleansing steps led to a lookup table that contains linking information from ESTs to GEPIS usable libraries (we only keep normal and cancer libraries) to tissue types. For human, there are total 4175880 EST counts in all usable libraries, 1912573 of them in 1995 normal libraries, 2263307 of them in 3812 cancer libraries. For mouse, there are total 1879315 EST counts, 1461621 in 285 normal libraries, 407382 in 37 cancer libraries. For previous GEPIS build, there were total 3081582 EST counts in all human usable libraries, 1144417 of them in 1878 normal libraries, 1937165 in 3878 tumor libraries. There is a 35% increase in total usable EST counts.

    Distribution of ESTs in usable libraries:
    Organ Total Normal Tumor Total Normal Tumor
      Human Mouse
    ADIPOSE_TISSUE 15533 13834 1699 14 14 0
    ADRENAL 36922 19061 17861      
    BONE 52633 12668 39965 30319 30319 0
    BONE_MARROW 48113 20311 27802 113285 113285 0
    BRAIN 705566 515939 189627 229874 229874 0
    BREAST 195527 59033 136494 249503 10733 238770
    CARTILAGE 36300 16211 20089      
    CERVIX 39427 2576 36851      
    COLON 221020 38203 182817 50336 49555 781
    EAR       32199 32199 0
    ESOPHAGUS 17717 87 17630      
    EYE 104400 52863 51537 39643 39643 0
    GENITOURINARY 43705 9586 34119 11696 9955 1741
    GI_TRACT 43267 28590 14677 31945 31945 0
    HEAD_N_NECK 183837 45646 138191 31869 31869 0
    HEART 21198 21198 0 24734 24734 0
    KIDNEY 128658 45782 82876 120192 117954 2238
    LIVER 196342 108706 87636 83206 73368 9838
    LUNG 219169 74876 144293 91011 23263 67748
    LYMPH_NODE 46659 25462 21197 1578 1578 0
    LYMPHORETICULAR 66580 19355 47225 9345 3390 5955
    MUSCLE 65922 34215 31707 25555 25555 0
    NERVOUS 34602 34602 0 125927 123715 2212
    OVARY 116612 8902 107710 17901 17901 0
    PANCREAS 211295 94668 116627 61115 61115 0
    PARATHYROID 53 0 53      
    PERIPHERAL_NERV 21290 20092 1198 14637 14637 0
    PINEAL_GLAND 8534 8534 0 2987 2987 0
    PITUITARY_GLAND 18649 16360 2289 8047 8047 0
    PLACENTA 219794 176385 43409 27639 27639 0
    PROSTATE 144727 60159 84568 24768 24768 0
    RETINA 23182 23182 0 37368 37368 0
    SALIVARY_GLAND 13568 2647 10921 20853 19880 973
    SKIN 182179 44078 138101 75179 41525 33654
    SOFT_TISSUE 225 225 0      
    SPLEEN 49015 49015 0 73328 54477 18851
    STOMACH 107774 32681 75093 14481 14481 0
    SYNOVIUM 2542 278 2264 17859 17859 0
    TESTIS 228165 95411 132754 121236 118272 2964
    THYMUS 3099 2958 141 43108 21451 21657
    THYROID 78482 18650 59832 4546 4546 0
    UTERUS 174205 10151 164054 1119 1119 0
    VASCULAR 49393 49393 0 601 601 0
    Total: 4175880 1912573 2263307 1869003 1461621 407382
    *The tissues that are represented by less than 1,000 total ESTs are italicized and are not analyzed.
  • How are ESTs mapped to genes?top
  • An EST was considered to be transcript of a gene if the two are mapped to the same locus and share at least one exon ((minimum requirement of match length of >30 bp). This eliminated wrong matches between ESTs and genes with sequence similarity. Multiple EST reads from the same clone were reduced to a single read if clone information is available.

  • How are the expression ratios determined?top
  • GeneHub-GEPIS calculates gene expression level and tissue distribution based on EST data and library informat ion. First , the sequence alignment program GMAP or BLAST was used to find EST sequences that matched a human gene of interest. Multiple EST reads from the same clone were reduced to a single read. The library identifier of each EST clone was then used to associate its tissue source based on the library-tissue lookup table. Multiple libraries of the same tissue type were aggregated in both the number of gene-matched EST clones and total number of EST clones. This allowed a digital expression unit (DEU) value to be derived for each tissue category, calculated as the total number of matching EST clones divided by the sum of library sizes and multiplied by 1,000,000 . Even though a typical EST library contained only a few thousand-clone sequences, multiple libraries from the same tissue type were aggregated to form a composite data source that became more information-rich and better represented rare genes. DEU calculation thus approximates EST representation in libraries of a pool of one million sequences. DEU values were calculated iteratively for each tissue type to profile expression levels across all tissues. The Z-test was applied to determine whether DEU in sample type A was statistically higher than DEU in sample type B. Comparisons could be made between normal and cancer samples from the same tissue, or between two different types of tissues. For a given gene, the common relative abundance p was computed in all libraries by taking the sum of clone counts for the gene over all libraries and dividing by the total number of clones over all libraries. The relative abundance for the gene in type A( pA) and in type B( pB) libraries was also calculated. The test statistic Z-score was calculated as follows: , where NA and NB are the total numbers of EST clones derived from A and B. The Z-score can then be referred to the normal distribution to yield a P-value.

  • How does the microRNA expression work? top

    Genehub-GEPIS can report miRNA expression based on EST data. This was motivated from the observation that miRNA precursor sequences can be found among ESTs ( Smalheiser, 2003, Zhang et al., 2005 ), and that pri-miRNA expression levels correlate with mature miRNA expression levels ( Thomson et al., 2006 ). Given this relationship, we could use EST data to approximate miRNA expression levels in various tissues. This was computed using the following steps: 1) We collected genomic locations of the miRNA stem-loop sequences from version 9.0 of miRBase; 2) We then obtained all EST sequences that had any overlap with the miRNA stem-loops; and finally 3) We computed the GEPIS results as described. These data can be explored from the Text Search interface by miRNA name or by the miRBase accession number.

Features

The addition of GeneHub components enables text search and orthologue matching capability and allows easy access to the expression data through URL linking. In addition, the detailed information about ESTs and their source libraries are saved and made available via URL linking.

How to Interpret Results?

  • What do those column headers mean?
  • Header Description
    Organs Site of expression
    #Hits Number of matched EST clones, listed by descending order
    Total Number of all ESTs in each category
    Normal Number of matched ESTs in normal libraries
    NorTotal Number of all ESTs in normal libraries
    Tumor Number of matched ESTs in tumor libraries
    TumTotal Number of all ESTs in total libraries
    Nor-DEL relative expression level in normal libraries in terms of number of hits per 100,000 total ESTs
    Tum-DEL relative expression level in tumor libraries in terms of number of hits per 100,000 total ESTs
    Tum/Nor ratio of relative expression levels between tumor and normal libraries
    T/N ztest: ztest P value, showing the statistical significance of upregulation in tumor
    Tiss. Spec. ztest: ztest P value, showing the statistical significance of selective expression in that normal tissue.
     
  • Statistics involved: top
  • The Z-test was applied to determine whether DEL in sample type A was statistically higher than DEL in sample type B. This comparison could be made between normal and cancer samples from the same tissue, or between two different types of tissues.

    • T/N z-test:
    • The comparison is made between normal and cancer samples from the same tissue. Value p is function of four parameters, number of occurrence in normal tissue, the normal library sample size, number of occurrence in tumors, and the tumor library sample size.

    • Tiss. Spec. z-test:

      For each tissue, we compared its normal DEL level with that of each of the rest of the tissues. The highest p value is chosen to represent the statistical significance of gene's differential expression in normal tissues.

Implementation Details

  • How is text search implemented?

    The system tries to match query string to a unique GeneHub gene index. After a best match is found, the GeneHub gene index is used to retrieve pre-computed GEPIS result. The gene attributes and synonyms are stored in two tables: GENE and GENE_SYNONYMS, respectively. The cross-references between GeneHub gene indexes and database records are saved in DBXREF table. DBXREF and GENE_SYNONYMS tables are consulted in turn to find an exact match to the given query string. A begin-search is automatically performed if there is no exact match at first round.

    There are limitations in MySQL text search:

    • It doesn't support function index.
    • Hyphenated words are treated as two words in MySQL.
    • MySQL comes with a default stop word list and the number in the query is ignored by default.

    To overcome the limitations and make text search case-insensitive and consistent (e.g. IL-8, il 8 and IL8, should all return same result), we added additional columns, SEARCH_TEXT and XREF_ID_SEARCH in the GENE_SYNONYMS and DBXREF table, respectively. All the space, hyphen and underscore characters in the gene synonyms are removed and the words are transformed into upper case before saved into “SEARCH_TEXT” column. XREF_ID_SEARCH stores the upper case version of database accessions.

     

  • How is ortholog search implemented? top

    Both types of cross-reference (homolog or ortholog) between GeneHub gene indexes and database records are saved in DBXREF table.

    • Accession based search
    • It simply searches for gene index of target organism using GENE and DBXREF table, regardless of the species of input record.

    • DNA Sequence Based Ortholog Search
    • The input sequence is first BLASTed against the sequence database of same origin, if a hit is found, the program consults DBXREF table to find the hit's ortholog. Otherwise, the input sequence is BLAST directly against the target organism's EST sequence database.