Overview
|
GeneHub Component
|
GEPIS Component
|
|
Features
|
|
How to Interpret Results?
|
|
Implementation Details
|
|
Overview
GeneHub-GEPIS is a bioinformatics tool for inferring gene expression patterns in normal
and cancer tissues based on human and mouse EST sequence abundance. Its GeneHub component is designed to
define gene boundaries based on mRNA transcript sequences from major databases and to establish extensive
cross references for commonly used gene identifiers. Based on the underlying gene definition,
the GEPIS component maps EST sequences to genomics sequences and therefore associate ESTs to genes reliably.
Accumulative relative abundance of ESTs in various tissues is used to infer gene expression levels in both normal
and cancer tissues of mouse and human.
-
What is the difference between GeneHub-GEPIS and the previous version of GEPIS? top
The GEPIS application was originally developed in 2004 to compute gene expression levels over different many normal and cancerous tissues using EST data. The application has been widely accessed and is an extremely valuable tool for expression analysis. The application was described by Zhang et al (Bioinformatics, 2004). Despite its success, the previous version of GEPIS has several limitations. First, its algorithm relies on the BLAST program to assign EST sequences to a given input mRNA sequence. We subsequently found that BLAST frequently link erroneous ESTs to input sequences due to high-percentage regional matches. As a result, a given EST could be matched to multiple genes, thus leading to false expression data. In addition, there were insufficient data for performing reliable analysis for mouse genes. Furthermore, the design of the algorithm did not allow easy access to the expression data through URL linking. Moreover, the detailed information about ESTs and their source libraries could not be easily obtained for further investigation.
In the newly designed GeneHub-GEPIS algorithm, we used a completely different mechanism to map ESTs to genes. The new mechanism not only overcomes the limitations we previously encountered, it also enables us to add many useful features that will make this application more powerful and accessible . We first mapped ESTs to their precise genomic locations using the GMAP algorithm. The genomic coordinates (NCBI Human build 36, Mouse build 35) were compared with the gene structure information (intron/exon boundary) of GeneHub genes generated above. An EST was considered to be the product of a gene if the two entities were mapped to the same locus and share at least one exon (with minimum match length of 30 bp). This greatly reduced incorrect matches between ESTs and genes with sequence similarity. The addition of GeneHub components enables text search and ortholog matching capability and allows easy access to the expression data through URL linking. In addition, the detailed information about ESTs and their source libraries are saved and made available via URL linking. Furthermore, the new system is extensible. To illustrate this important feature, we were able to quickly add the function to derive expression profile for microRNAs. Moreover, to take advantage of the vast volume of mouse EST data, we derived a collection of mouse genes and their EST-based expression results. Since it can be very useful to cross reference human and mouse expression data, we added an important feature to easily navigate the expression results between human and mouse. At the same time, we inherited some of the useful features of GEPIS such as Regional GEPIS Atlas and statistical analysis so that GeneHub-GEPIS can completely replace the GEPIS application. In summary, GeneHub-GEPIS is a new application that builds on a novel integrated gene database and delivers reliable gene expression results for diverse input values.
- How widely is GeneHub-GEPIS used? top
GeneHub-GEPIS was only released in early December, 2006, so we do not have the access log data yet. However, the previous GEPIS application has been widely accessed since its release in 2004. The access log can be found at http://www.cgl.ucsf.edu/reports/webstats/genentech/. We have provided expression results for over 30,000 successful requests, and user base spreads over many different countries. Numerous users have contacted the authors for compliments, bug report and suggestions for new features. The enthusiastic support from our user base and their desire for new features has been the major driving force for building the GeneHub-GEPIS application. So we would also like to take this opportunity to thank our users for suggestions, bug reports, and encouragement.
With the release of GeneHub-GEPIS, we anticipate a significant increase in the number of access to our application. We are working with some of the major gene databases to provide convenient URL links from their sites to the GeneHub-GEPIS expression results directly, which will dramatically increase the visibility and use of this application. Luckily, our collaborators at UCSF, Dr. Tom Ferrin in particular, have made their system very robust to handle the increased web traffic.
- Does this site require login or license? top
No. This site is completely free for all users. We encourage you to use our site, and we would appreciate your citing our site and the paper (manuscript in preparation) describing this application. Licensing only becomes an issue if you intend to include part of this application into your commercial software package.
- How to contact us? top
We need your feedback to improve the site. We would also appreciate your effort to test the application. We would like to hear anything you would like to say about this site: bug report, suggestions, questions, features you like and don’t like, etc. We can also provide all of the code and related data if you would like to implement GeneHub-GEPIS at your own site.
Many people have contributed to this application. Yan Zhang in Zemin Zhang’s group at Genentech is the main programmer. Please feel free to contact either one by email: Yan (yz5@gene.com) or Zemin (zemin@gene.com). Our mailing address is:
Department of Bioinformatics, M.S. 93, Genentech Inc., 1 DNA Way, South San Francisco, CA 94080, USA
GeneHub Component
- What is GeneHub?
GeneHub is an integrated gene database that provides gene information and cross references among several commonly used databases. It defines gene structures based on mRNA transcript sequences from major databases and overlays additional gene and protein identifiers to the GeneHub genes. It provides gene annotation and basic searching capabilities. It also provides the foundation for GeneHub-GEPIS expression analysis.
- What are the date sources used by GeneHub? top
Table: Number of records in each of the database sources used by GeneHub.
| Database Type |
Database Name |
Human |
Mouse |
| mRNA |
RefSeq * |
29,521 |
45,309 |
| Ensembl Known * |
29,387 |
27,617 |
| Proteome * |
19,343 |
20,622 |
| FANTOM Known * |
- |
54,854 |
| |
|
|
| GenBank |
138,360 |
110,359 |
| Ensembl Novel |
4,482 |
8,854 |
|
| Protein |
UniProt |
69,842 |
59,391 |
| PDB |
16,735 |
3,595 |
| Gene |
Entrez Gene |
36,013 |
51,065 |
| Microarray |
Affy Human U133P |
54,675 |
- |
| Affy Human U133A |
22,283 |
- |
| Affy Human U133B |
22,645 |
- |
| Affy Human U95A |
12,479 |
- |
| Affy Human U95B |
12,620 |
- |
| Affy Human U95C |
12,646 |
- |
| Affy Human U95D |
12,644 |
- |
| Affy Human U95E |
12,639 |
- |
| Affy Human U35KA |
8,934 |
- |
| Affy Human U35KB |
8,924 |
- |
| Affy Human U35KC |
8,928 |
- |
| Affy Human U35KD |
8,928 |
- |
| Affy Human FL68 |
6,633 |
- |
| Affy Mouse Genome 430 2.0 |
- |
45,101 |
| Affy Mouse 430A |
- |
22,626 |
| Affy Mouse 430B |
- |
22,511 |
| Affy Murine 74A |
- |
12,488 |
| Affy Murine 74B |
- |
12,477 |
| Affy Murine 74C |
- |
11,934 |
| Affy Murine 11KA |
- |
6,584 |
| Affy Murine 11KB |
- |
6,002 |
| Agilent Human 1A |
17,086 |
- |
| Agilent Human 1Av2 |
20,173 |
- |
| Agilent Human 1B |
19,673 |
- |
| Agilent Whole Human Genome |
41,000 |
- |
| Agilent Mouse V2 |
- |
20,868 |
| Agilent Whole Mouse Genome |
- |
41,174 |
| Combined |
GeneHub |
31,999 |
34,794 |
(* core transcript sets used for deriving GeneHub genes)
- How are GeneHub genes derived?top
The genomic structures of human protein-coding genes were first defined using several reliable sources of transcripts. The collection of such high-quality transcripts, also called the core gene set, contains 29,521 mRNA sequences from RefSeq, 29,387 from the Known gene set of Ensemble genes, and 19,343 from Proteome. Likewise, the core gene set for mouse consists of 45,309 mRNA sequences from RefSeq, 27,617 from the Known gene set of Ensembl genes, 20,622 from Proteome, and 54,854 from FANTOM. Each of the core gene set sequences was aligned to the respective genome (human NCBI Build 36 and mouse NCBI Build 35) using the GMAP algorithm {Wu, 2005 #153}, and only the genomic match with the best percent identity and percent coverage was chosen. A minimum of 90% coverage of the entire transcript was required.
The genomic coordinates for each of the matching exons were recorded for gene comparison.
When two transcripts were mapped to the overlapping genomic regions, we evaluated the exon structure of these two transcripts to determine whether they should be clustered into one gene. First, orientations of the transcript should be the same. Second, the exon sequences should overlap. Third, the two transcripts should share at least one exact exon boundary or splice site. Transcripts that failed in one of the above criteria would be considered as a separate gene. Each group (or cluster) of transcripts was considered a GeneHub gene. Using this approach, we defined 31,999 non-redundant GeneHub genes for the human and 34,794 for mouse.
After the initial human and mouse gene maps were built, additional sequences were mapped to the GeneHub genes. For this purpose, we used sequences from the mRNA collection of GenBank, the Ensembl Novel collection. As above, the exonic coordinates of the transcript sequences were computed by the GMAP algorithm and then compared against those GeneHub genes derived above. Again, for a transcript to be linked to a GeneHub gene, at least one of the splicing junctions was expected to match perfectly with those of the GeneHub gene.
- How are microarray probes linked to gene?top
For Affymetrix human (U95 set, U133 set, U133p) and mouse (U74 set, 430 set) expression arrays, we used the target sequences obtained from Affymetrix to link to known genes. For Agilent oligo-based expression microarrays (H1A, H1Av2, H1B, Whole Human Genome Arrays), we directly used the 60-mer oligo-nucleotide sequences for gene linking. Using the GMAP program, we determined whether the array probe sequences overlap with the exon sequences collected above. Next, for sequences that did not overlap with any exons, we examined whether they were located in the vicinity of any GeneHub genes. It has been found that the expression of a given gene could usually be represented by microarray probe sequences in the vicinity of the gene (data not shown). Therefore, we assigned probe sequences to a closest gene in the same orientation if the probe sequence was located within 2 kb to the 3'end or 1 kb to the 5' end of the gene.
- How are protein sequences linked?top
Human or mouse protein sequences from UniProt (69,842 human and 59,391 mouse protein sequences) and PDB (16,739 human and 3,595 mouse) were compared with the gene DNA sequences using the BLASTx program. For a protein record to be linked to a GeneHub gene, it was required to have >98% identity with a stretch of at least 35 amino-acid long region.
- How are gene annotations collected?top
Once we built the GeneHub gene collection associated with gene/protein identifiers from various databases, it became straightforward to accumulate and integrate gene annotation information from diverse databases. Useful information such as gene description, accession, name, and synonyms were extracted from each member of a GeneHub gene and was linked to the GeneHub gene itself. Such information was stored in a common database field for text searching purposes.
- How are ortholog links built?top
To build orthologous linking between human and mouse GeneHub genes, we downloaded the hmlg_ftp.txt file from HomoloGene Release 50.1. We used the orthologous Entrez Gene pairs of human and mouse if they were established by reciprocal best match between three or more organisms, or reciprocal best match or sequence similarity with match score greater than 70%. There were 15,868 pairs. Based on the Entrez Gene components of GeneHub genes, we linked human GeneHub genes to the appropriate mouse GeneHub gene counterpart
- How many human and mouse genes are collected?top
There are 31,999 non-redundant human genes and 34,794 mouse genes.
GEPIS component
- What is the basis for EST based expression predication?
EST clone frequency is in principle proportional to the corresponding gene's expression level (Adams et al., 1993), ESTs have been successfully used previously for studying expression signatures or differential expression analysis (Audic and Claverie, 1997; Ewing et al., 1999; Hishiki et al., 2000; Scheurle et al., 2000; Schmitt et al., 1999). The application o f ESTs in studying the human cancer transcriptome has been summarized by reported large-scale efforts by the Cancer Genome Anatomy Project (CGAP) and Human Cancer Genome Project (HCGP) (Brentani et al., 2003). We have also found previously that EST-based digital expression levels (DEU) correlated with mRNA transcript levels determined by quantitative RT-PCR analyses (Zhang, et al, Bioinformatics, 2004).
- How do EST-based expression results compared with microarray based results?top
Although the concept of using ESTs for expression analysis
has been known for over a decade, the full value of EST based
expression has been somewhat overshadowed by the
emerging microarray technology. While microarrays can be an
extremely powerful method, EST data can be a strong alternative
or even an advantageous method in some cases. Depending
on the nature of specific studies, ESTs potentially offer several
advantages. First, the extensive gene coverage by ESTs
(Brentani et al., 2003) allows expression analysis of almost all genes, and gene representation is not affected by the sequence
picking and designing process required by microarray chips.
Furthermore, all the raw EST and tissue source data are
already publicly available; therefore, much of the laborious
laboratory work can be bypassed when studying the expression
in many diverse tissues. Moreover, since the expression
level is represented by mRNA abundance over total transcripts
and is independent of probe selection and hybridization intensity,
EST-based analysis can be a more quantitative and direct
measurement of gene expression than microarray-based results.
In fact, based on the EST data, we were able to perform
large-scale screening for therapeutic tumor antigen targets,
namely cell surface proteins with high and specific expression
in tumor cells, and many of these targets have been
experimentally validated by a variety of methods including tissue
microarray, in situ hybridization and quantitative reverse
transcription–PCR RT–PCR.
- What are the EST sources?top
The human and mouse EST collections consist of all sequences in the EST division of GenBank (http://www.ncbi.nlm.nih.gov/dbEST/) . Quality information contained in the GenBank file was used to trim low-quality bases from sequences. As of July 11, 2006, there are 7810159 human EST sequences in 8598 libraries and 4637070 mouse EST sequences in 1208 libraries. Based on tissue and histology data for library information from the National Cancer Institute (NCI) and CGAP (http://cgap.nci.nih.gov/Tissues/, ftp://ftp1.nci.nih.gov/pub/CGAP/Hs_LibData.dat and ftp://ftp1.nci.nih.gov/pub/CGAP/Mm_LibData.dat , dated as July 11th, 2006), each EST was assigned a tissue source value and a disease value. There are 4426 cancer libraries, 2387 normal libraries for human. There are 51 cancer libraries, 741 normal libraries for mouse.
- What are the ESTs available for analysis?top
Since the classification and selection of EST libraries were expected to be critical for reliable expression analysis, we established a number of quality control steps to identify usable libraries and to group them in the appropriate tissue categories as described in Zhang Y. et al, 2004. These data cleansing steps led to a lookup table that contains linking information from ESTs to GEPIS usable libraries (we only keep normal and cancer libraries) to tissue types. For human, there are total 4175880 EST counts in all usable libraries, 1912573 of them in 1995 normal libraries, 2263307 of them in 3812 cancer libraries. For mouse, there are total 1879315 EST counts, 1461621 in 285 normal libraries, 407382 in 37 cancer libraries. For previous GEPIS build, there were total 3081582 EST counts in all human usable libraries, 1144417 of them in 1878 normal libraries, 1937165 in 3878 tumor libraries. There is a 35% increase in total usable EST counts.
Distribution of ESTs in usable libraries:
| Organ |
Total |
Normal |
Tumor |
Total |
Normal |
Tumor |
| |
Human |
Mouse |
| ADIPOSE_TISSUE |
15533 |
13834 |
1699 |
14 |
14 |
0 |
| ADRENAL |
36922 |
19061 |
17861 |
|
|
|
| BONE |
52633 |
12668 |
39965 |
30319 |
30319 |
0 |
| BONE_MARROW |
48113 |
20311 |
27802 |
113285 |
113285 |
0 |
| BRAIN |
705566 |
515939 |
189627 |
229874 |
229874 |
0 |
| BREAST |
195527 |
59033 |
136494 |
249503 |
10733 |
238770 |
| CARTILAGE |
36300 |
16211 |
20089 |
|
|
|
| CERVIX |
39427 |
2576 |
36851 |
|
|
|
| COLON |
221020 |
38203 |
182817 |
50336 |
49555 |
781 |
| EAR |
|
|
|
32199 |
32199 |
0 |
| ESOPHAGUS |
17717 |
87 |
17630 |
|
|
|
| EYE |
104400 |
52863 |
51537 |
39643 |
39643 |
0 |
| GENITOURINARY |
43705 |
9586 |
34119 |
11696 |
9955 |
1741 |
| GI_TRACT |
43267 |
28590 |
14677 |
31945 |
31945 |
0 |
| HEAD_N_NECK |
183837 |
45646 |
138191 |
31869 |
31869 |
0 |
| HEART |
21198 |
21198 |
0 |
24734 |
24734 |
0 |
| KIDNEY |
128658 |
45782 |
82876 |
120192 |
117954 |
2238 |
| LIVER |
196342 |
108706 |
87636 |
83206 |
73368 |
9838 |
| LUNG |
219169 |
74876 |
144293 |
91011 |
23263 |
67748 |
| LYMPH_NODE |
46659 |
25462 |
21197 |
1578 |
1578 |
0 |
| LYMPHORETICULAR |
66580 |
19355 |
47225 |
9345 |
3390 |
5955 |
| MUSCLE |
65922 |
34215 |
31707 |
25555 |
25555 |
0 |
| NERVOUS |
34602 |
34602 |
0 |
125927 |
123715 |
2212 |
| OVARY |
116612 |
8902 |
107710 |
17901 |
17901 |
0 |
| PANCREAS |
211295 |
94668 |
116627 |
61115 |
61115 |
0 |
| PARATHYROID |
53 |
0 |
53 |
|
|
|
| PERIPHERAL_NERV |
21290 |
20092 |
1198 |
14637 |
14637 |
0 |
| PINEAL_GLAND |
8534 |
8534 |
0 |
2987 |
2987 |
0 |
| PITUITARY_GLAND |
18649 |
16360 |
2289 |
8047 |
8047 |
0 |
| PLACENTA |
219794 |
176385 |
43409 |
27639 |
27639 |
0 |
| PROSTATE |
144727 |
60159 |
84568 |
24768 |
24768 |
0 |
| RETINA |
23182 |
23182 |
0 |
37368 |
37368 |
0 |
| SALIVARY_GLAND |
13568 |
2647 |
10921 |
20853 |
19880 |
973 |
| SKIN |
182179 |
44078 |
138101 |
75179 |
41525 |
33654 |
| SOFT_TISSUE |
225 |
225 |
0 |
|
|
|
| SPLEEN |
49015 |
49015 |
0 |
73328 |
54477 |
18851 |
| STOMACH |
107774 |
32681 |
75093 |
14481 |
14481 |
0 |
| SYNOVIUM |
2542 |
278 |
2264 |
17859 |
17859 |
0 |
| TESTIS |
228165 |
95411 |
132754 |
121236 |
118272 |
2964 |
| THYMUS |
3099 |
2958 |
141 |
43108 |
21451 |
21657 |
| THYROID |
78482 |
18650 |
59832 |
4546 |
4546 |
0 |
| UTERUS |
174205 |
10151 |
164054 |
1119 |
1119 |
0 |
| VASCULAR |
49393 |
49393 |
0 |
601 |
601 |
0 |
| Total: |
4175880 |
1912573 |
2263307 |
1869003 |
1461621 |
407382 |
*The tissues that are
represented by less than 1,000 total ESTs are italicized and are not
analyzed.
- How are ESTs mapped to genes?top
An EST was considered to be transcript of a gene if the two are mapped to the same locus and share at least one exon ((minimum requirement of match length of >30 bp). This eliminated wrong matches between ESTs and genes with sequence similarity. Multiple EST reads from the same clone were reduced to a single read if clone information is available.
- How are the expression ratios determined?top
GeneHub-GEPIS calculates gene expression level and tissue distribution based on EST data and library informat ion. First , the sequence alignment program GMAP or BLAST was used to find EST sequences that matched a human gene of interest. Multiple EST reads from the same clone were reduced to a single read. The library identifier of each EST clone was then used to associate its tissue source based on the library-tissue lookup table. Multiple libraries of the same tissue type were aggregated in both the number of gene-matched EST clones and total number of EST clones. This allowed a digital expression unit (DEU) value to be derived for each tissue category, calculated as the total number of matching EST clones divided by the sum of library sizes and multiplied by 1,000,000 .
Even though a typical EST library contained only a few thousand-clone sequences, multiple libraries from the same tissue type were aggregated to form a composite data source that became more information-rich and better represented rare genes. DEU calculation thus approximates EST representation in libraries of a pool of one million sequences. DEU values were calculated iteratively for each tissue type to profile expression levels across all tissues. The Z-test was applied to determine whether DEU in sample type A was statistically higher than DEU in sample type B. Comparisons could be made between normal and cancer samples from the same tissue, or between two different types of tissues. For a given gene, the common relative abundance p was computed in all libraries by taking the sum of clone counts for the gene over all libraries and dividing by the total number of clones over all libraries. The relative abundance for the gene in type A( pA) and in type B( pB) libraries was also calculated. The test statistic Z-score was calculated as follows: , where NA and NB are the total numbers of EST clones derived from A and B. The Z-score can then be referred to the normal distribution to yield a P-value.
- How does the microRNA expression work?
top
Genehub-GEPIS can report miRNA expression based on EST data. This was motivated from the observation that miRNA precursor sequences can be found among ESTs ( Smalheiser, 2003, Zhang et al., 2005 ), and that pri-miRNA expression levels correlate with mature miRNA expression levels ( Thomson et al., 2006 ). Given this relationship, we could use EST data to approximate miRNA expression levels in various tissues. This was computed using the following steps: 1) We collected genomic locations of the miRNA stem-loop sequences from version 9.0 of miRBase; 2) We then obtained all EST sequences that had any overlap with the miRNA stem-loops; and finally 3) We computed the GEPIS results as described. These data can be explored from the Text Search interface by miRNA name or by the miRBase accession number.
Features
The addition of GeneHub components enables text search and orthologue matching capability and allows easy access to the expression data through URL linking. In addition, the detailed information about ESTs and their source libraries are saved and made available via URL linking.
- How to make a URL link to expression result?
The URL link should be constructed as:
http://www.cgl.ucsf.edu/cgi-bin/genentech/genehub-gepis/web_search.pl?intype=1&xrefid=accession&species=human_or_mouse
with optional chromosome selection:
http://www.cgl.ucsf.edu/cgi-bin/genentech/genehub-gepis/web_search.pl?intype=1&xrefid=accession&species=human_or_mouse&chr=chromosome
Chromosome Name: 1-22, X, Y, MT (Mitochondria).
Examples:
- Text Search (case-insensitive)
top
- Search By Accession
We support accessions of all the following sources:
UniProt, GenBank, RefSeq,
EntrezGene, Ensembl Transcript, Affymetrix or Agilent Probe, miRNA names.
- Search By Gene Symbol
All the space, hyphen and underscore characters in the synonyms are automatically removed before
search to make search result consistent.
- What if multiple hits are found?
Multiple matches are displayed along with their genomic coordinates.
User has to choose one to display its GEPIS result.
- What if no exact match is found?
A begin-search is automatically performed if there is no exact match.
- Sequence Search top
users can either upload a single sequence fasta file into the web server (via a browse button) or pasting
the sequence data into a standard text box.
- Homolog Search: sequence origin is the same as target organism
The input sequence is BLASTed against the sequence database or EST database of same origin.
- Ortholog Search: sequence origin is different from target organism
- miRNA Search
top
Simply type miRBase miRNA accession, e.g. MI0000112 or its identifier, e.g. hsa-mir-105-2 into the text search box and click "Submit" button.
- Result Download top
Click the link to export data into excel spreadsheet.
- List EST hits by Libraries top
Click to open the link that displays the number of EST hits in each library for the given gene. The libraries are grouped by their tissue type. To save space, only libraries with matched ESTs are displayed on the web. To view all the EST libraries click .
- Draw Regional Atlas top
Regional GEPIS Atlas is a composite of visual transcriptome maps depicting the expression level of all neighboring genes in selected tissues.
Select the organs, and specify the range/size of genomic region you want to draw the atlas, then click ,
which opens a new browser window to display a set of charts, one for each organ you have selected.
- Display Tissue Expression Profile
top
Tissue Expression Profile is a bar-chart depicting the Nor-DEL (green) and Tum-DEL (red) value of each organ.
Click , which opens a new browser window to display Tissue Expression Profiling chart
How to Interpret Results?
Implementation Details
- How is text search implemented?
The system tries to match query string to a unique GeneHub gene index. After a best match is found, the GeneHub gene index is used to retrieve pre-computed GEPIS result. The gene attributes and synonyms are stored in two tables: GENE and GENE_SYNONYMS, respectively. The cross-references between GeneHub gene indexes and database records are saved in DBXREF table. DBXREF and GENE_SYNONYMS tables are consulted in turn to find an exact match to the given query string. A begin-search is automatically performed if there is no exact match at first round.
There are limitations in MySQL text search:
- It doesn't support function index.
- Hyphenated words are treated as two words in MySQL.
- MySQL comes with a default stop word list and the number in the query is ignored by default.
To overcome the limitations and make text search case-insensitive and consistent (e.g. IL-8, il 8 and IL8, should all return same result), we added additional columns, SEARCH_TEXT and XREF_ID_SEARCH in the GENE_SYNONYMS and DBXREF table, respectively. All the space, hyphen and underscore characters in the gene synonyms are removed and the words are transformed into upper case before saved into “SEARCH_TEXT” column. XREF_ID_SEARCH stores the upper case version of database accessions.
- How is ortholog search implemented? top
Both types of cross-reference (homolog or ortholog) between GeneHub gene indexes and database records are saved in DBXREF table.
- Accession based search
It simply searches for gene index of target organism using GENE and DBXREF table, regardless of the species of input record.
- DNA Sequence
Based Ortholog Search
The input sequence is first BLASTed against the sequence database of same origin, if a hit is found, the program consults DBXREF table to find the hit's ortholog. Otherwise, the input sequence is BLAST directly against the target organism's EST sequence database.
|