Stephen G. Young¹, Dean Sheppard², Patricia C. Babbitt³, Conrad C. Huang³, Susan Jean Johns³, Michiko Kawamoto³, Doug Stryke³, John "Scooter" Morris³, and Thomas E. Ferrin³,
¹ University of California, Los Angeles
³ University of California, San Francisco
³ Resource for Biocomputing, Visualization, and Informatics University of California, San Francisco
This project is funded under the auspices of the National Heart, Lung, and Blood Institute (NHLBI), which launched the Programs for Genomic Applications (PGAs) on September 30, 2000. The program is a major initiative to advance functional genomics research pertaining to cardiopulmonary development and disease. BayGenomics (Stryke et al., 2003) is one of nine research programs that were established as a result of the PGA initiative. The major goal of BayGenomics is to identify and characterize new genes that are important for cardiopulmonary disease. This is accomplished through the use of “gene trap” vectors to inactivate thousands of genes in mouse embryonic stem (ES) cells for the purpose of generating knockout mice.
To date, BayGenomics has used custom gene-trapping vectors to inactivate thousands of genes in mouse ES cells and has made those cells freely available to the research community. Our initial goal was to freeze down 2,500 ES cell lines with well-characterized insertional mutations per year. Thus far, we have trapped and frozen more than 8,000 ES cell lines, many corresponding to unknown genes. During the past five years, well over 100 insertional mutations with high-quality genomic sequences have been transmitted through the germline. This project went through a successful competitive renewal in August 2004.
BayGenomics cell lines yield exceptionally useful knockout mice. Our ES cell lines yield germline-transmitting chimeric mice with high efficiency. During the past few years, Dr. Young's laboratory has microinjected 35 BayGenomics ES cell into blastocysts, and all but two yielded germline-transmitting chimeras. Importantly, all of our ES cell lines contain a histological marker gene, which is essential for getting the most out of a knockout. The insertion of our trapping vector leads to the production of a β-galactosidase (β-gal) fusion protein, which makes it possible to define the temporal and spatial expression patterns of the trapped gene.
BayGenomics has distributed many ES cell clones to outside investigators, for the purpose of generating knockout mice. During the first year of funding, when we were just getting under way, BayGenomics sent 75 ES cell lines to outside investigators. With each passing year, the requests have increased substantially, and we have now responded to more than 930 requests for ES cell lines. The "increased business" is probably due in part to the effectiveness of the BayGenomics web site and the fact that we have been speaking more frequently about BayGenomics at national meetings.
Central to the BayGenomics project is an active informatics component, developed in collaboration with the Resource for Biocomputing, Visualization, and Informatics. This informatics effort initially focused on development of a sophisticated relational database, using MySQL, and an accompanying web site to provide public access to cell lines and their sequence tags used to identify the associated genes and insertion sites. The resulting software infrastructure consists of a number of major components: 1) several automated dataprocessing protocols for transforming the raw cell line sequence data, 2) a relational database for the storage, querying and retrieval of cell-line sequence tags and their associated annotation information, 3) many "middleware" programs for transforming user input made via HTML forms into database queries, and then subsequently transforming the query results into HTML for presentation in a web browser. The Python programming language (www.python.org) has been used to implement these programs, and has proven both efficient and effective for this purpose, providing for the rapid development of new features and functionality on the BayGenomics web site. Because the underlying data-processing infrastructure is entirely automated, it has proved easily scalable as BayGenomics has grown. Many of these programs have also been used or adapted for use in the International Gene Trap Consortium (IGTC) database and web site that has recently been implemented by the RBVI as part of this project (see below). The database schema initially developed for BayGenomics provided the model for the schema currently used for the Sanger Institute Gene Trap Resource (SIGTR, http://www.sanger.ac.uk/PostGenomics/genetrap) and for the new IGTC resource. Our in-house vector sequence identification and removal programs are superior to those used at dbGSS (http://www.ncbi.nlm.nih.gov/dbGSS/) and by some other IGTC resource centers.
BayGenomics Gene Identification Protocol
BayGenomics Web Site
Since the mouse genome was not available in a form that enabled accurate localization of the majority of mouse genes until relatively recently, we developed an automated protocol to identify the genes that have been knocked out using transcript-based identification. Our AutoIdent program uses BLAST (Altschul, 1990) to query each sequence tag against the NCBI non-redundant (NR) sequence database, parses the output to collect names, E values, and a small set of high scoring alignments. Specific criteria are used to provide high confidence identification of the gene associated with each knockout sequence tag. Failing high quality identification, nearest homologs in any species are identified. In both cases, the evidence by which the identification was made is provided on each cell line annotation page. Only ~ 10% of the >10,000 sequence tags currently available at BayGenomics cannot be identified using AutoIdent, usually because of lesions in the sequence tag data or experiment, or more rarely, because no associated transcript exists in the NR database. Redundancy is resolved by generating synonym lists and prioritizing which will be reported, with curated mRNAs (NM) sequences always reported if available. Because the BayGenomics identification and annotation pipelines are entirely automated, we can frequently update information about each gene in the database, taking advantage of new sequences that appear continually in the public databases.
A web site was developed for the BayGenomics project with four entry points providing access to the user: browse, search, download and BLAST. A user can browse the information according to the identification categories and sort by gene name, the standard mouse gene symbol (obtained by automated scripts querying data at the Mouse Genome Informatics [MGI] resource), cell line name, chromosome number, or for data unique to BayGenomics such as in situ hybridization images. Options were implemented to allow a user to obtain only the data created or updated since a user-specified date. Other software developed to enhance usability includes, for example, options to allow a user access to in situ hybridization images either by low resolution browsing of "thumbnail" versions or by viewing high-resolution images along with their annotations. The Search page is accessed using a similar interface to that provided for the Browse page. Other programs enhance the usefulness of the Search page. For example, if a user attempts to search using a GenBank Identifier (GI) that does not match that stored in the database, the corresponding sequence is automatically obtained from GenBank and BLASTed against our identified gene sequences to retrieve and present the correct gene record. Browse and search access methods return data in a tabular format, providing the names of cell lines and their associated genes, along with other information including mouse gene symbols (JAX symbols from MGI), chromosome numbers, and centimorgan positions. FASTA-formatted sequences for all cell lines (or cell lines added since a user-specified date) can be downloaded from the BayGenomics database using a very simple interface available from the Data Access page. An Advanced Data Download option is also available. For this, a web form with check boxes is provided to allow users to obtain in tab-delimited format any or all data in our database. The Advanced Data Download feature has proven to be a popular means for scientists to obtain access to bulk data formatted for their specific needs and ready for import into a local database or Excel spreadsheet. Finally, users can query our database using one or more sequences of interest by running BLAST and retrieving all cell lines or genes with matching sequence. Both cell line and gene annotation pages are provided on the BayGenomics Web site, allowing a user to view our data in either a cell line- or gene-centric view. These pages are generated dynamically from the BayGenomics database so that the most up-to-date information is always presented. The combination of MySQL and Python as the underlying engine for delivering web content on the BayGenomics site has proven to be efficient, and hence browsing and data queries have remained responsive even as the use of the site has grown significantly over time to ~10,000 page requests per week. Our current infrastructure is able to support BayGenomics as well as the IGTC with no identifiable resource constraints. The cell line annotation pages include, in addition to the annotation information listed above, information about how the annotation was made, the vector that was used to generate the knockout, and other information such as EST matches and update date. The respective identifiers for the identified mouse gene are provided as links to annotation pages at MGI, GenBank, dbEST, dbGSS, and Entrez Gene. The genomic localization of each cell line is provided both as Ensembl and UCSC browser IDs that link a user directly to the genome browsers with BayGenomics cell lines displayed as an annotation track. Links to the original trace files are provided along with instructions for obtaining software for viewing trace files. We have also created software that allows users to compare any user-provided trace file to an existing BayGenomics trace file. This feature has proven useful for verification when ES cell lines are thawed and re-sequenced prior to injection. Finally, availability and instructions for ordering the ES cells from the Mutant Mouse Regional Resource Center (MMRRC, www.mmrrc.org) at UC Davis and a link to the ordering form for each cell line are provided. Cell line posting and annotation histories are also available from each cell line annotation page. The gene annotation page provides a gene-centric view of BayGenomics data. For each gene for which a knockout cell line is available, the gene name and gene symbol are provided along with identifiers linking a user to information about that gene at GenBank, Entrez Gene, and MGI. When available, synonyms for the gene name are provided, along with chromosome and centimorgan positions. Every BayGenomics cell line matching that gene is listed, linked to its cell line annotation page. Alignment images are provided on gene annotation pages to provide user-friendly graphical representations of cell lines available for each gene. As shown in Figure 1, these images allow a user to quickly see how many knockout cell lines are available at BayGenomics and where they fall relative to the reference gene (NM identifier at GenBank). Alignment to the coding strand, on which exons have been delineated, is also presented. These images are generated and updated automatically from the data in our database.![]()
Figure 1: Image showing alignment of two BayGenomics cell line sequence tags RRJ050 and STA057 to the gene NM_020567. Ticks represent intervals of 200 bases. The coding region of this gene is shown on the top line in magenta with the horizontal black line denoting the length of the gene in the alignment. The coding strand (CDS) representation is shown in blue with the predicted exon boundaries delineated by the offsets. The cyan regions show genomic sequence outside of the CDS, such as 5' and 3' untranslated regions. The sequence tags are shown in red. The vertical black lines on the sequence tags denote mismatches in the alignment between the gene and the sequence tag sequences.
International Gene Trap Consortium (IGTC) Informatics (http://www.genetrap.org/)
The IGTC was originally established as a subgroup of the International Mouse Mutagenesis Consortium to 1) create an international resource of embryonic stem cells with gene trap insertions, 2) disseminate this information through a common web site and 3) facilitate incorporation of gene trap information into the primary web browsers (Skarnes, 2004). In April 2005, Drs. Babbitt and Ferrin organized an IGTC informatics workshop in San Francisco to initiate additional interactions among informatics scientists from IGTC organizations and make plans for a new IGTC database and web site to provide enhanced access to the scientific community. Participants included six gene-trapping centers from the United States, Canada, the United Kingdom, Germany, and Japan. In addition, representatives from Mouse Genome Informatics (MGI), National Center for Biotechnology Information (NCBI), and the Univ. of California at Santa Cruz (UCSC) and Ensembl genome browsers attended and agreed to collaborate with us, providing useful advice regarding integration of IGTC data into their resources. A plan for a new IGTC database and Web site was presented by the BayGenomics informatics group and unanimously adopted by the participants. Using the SIGTR, BayGenomics, and initial IGTC informatics efforts created at the Sanger Center (www.igtc.org.uk) and in Canada (www.igtc.ca) as starting points, we implemented a new database and web site, which was released publicly in August 2005 (Nord, 2006). The IGTC site now includes approximately 45,000 well-characterized ES cell lines representing nearly 40% of known mouse genes. Our efforts to build this new IGTC informatics resource are described below.Two primary goals of the new IGTC informatics effort are to generate a standardized informatics pipeline for data management and annotation of gene trap cells lines and to provide in one place easy access and sophisticated tools for analysis of resource data. To ensure a common format and public availability of gene trap sequences, IGTC data is provided by the individual experimental centers to the NCBI dbGSS database (www.ncbi.nlm.nih.gov/dbGSS/). Using automated scripts, we download the data weekly into the IGTC MySQL database and apply the IGTC annotation pipeline, which then automatically populates our local IGTC database with the annotation information presented on the IGTC Web site. Interfaces are provided to enable BLAST searching, text-based search and browse capabilities, and to allow viewing of trapped genes in the context of biological pathways. For the IGTC, gene trap mutations are characterized by sequencing, followed by in silico identification and annotation of the gene that has been trapped. Using automated scripts, information is obtained from the genome browsers and many other informatics sites to annotate the knockout loci. Although this approach has been acceptable, a number of data dependencies are "hard coded" into our processing scripts. In the future, we see a significant advantage of using the FlexPipe data processing pipeline to create a more modular system that has fewer built-in dependencies and is thus easier to modify and maintain. This is important as gene annotation methods are still evolving and the BayGenomics and IGTC gene identification protocols will, in all likelihood, need to be updated in the future to keep pace with the latest methodologies.
IGTC Gene Identification Protocol
The identification protocol we developed for the IGTC improves upon that used at BayGenomics and at Sanger in that it uses both genomic localization and transcript-based identification as independent approaches to confirm the identification of a gene knockout. Figure 2 shows the identification and annotation pipeline, which was designed to be both robust and flexible. Although localization using MapTag (http://www.sanger.ac.uk/PostGenomics/genetra p/maptag.shtml) and transcript identification using AutoIdent are the only annotation protocols shown in Figure 2, the pipeline was designed so that other methods in addition to or in place of these can easily be incorporated.Currently, the pipeline compares results from the two identification protocols using overlap between the direct genomic localization of a cell line and the genomic localization of its associated transcripts to reconcile and confirm the identification. We have found this step to be critically important for high quality localization of gene insertion sites since the major genome browsers do not always agree on the exact location of a mouse knockout or even the gene associated with it. For example, in a recent study our group performed to compare the performance of the genome localization algorithms MegaBLAST, (Zhang et al., 2000) SSAHA, (Ning et al., 2001) and BLAT (Kent, 2002) used at the NCBI, Ensembl (MapTag protocol), and UCSC browsers, respectively, only about 7,000 of our test set of 30,000 IGTC sequence tags could be associated with approximately 3,000 genes assigned exactly the same coordinates at all three of the genome browsers (Harper et al., manuscript submitted).
For the IGTC resource, confirmed genomic coordinates are used as the primary identification data and these map coordinates are then used to query the Ensembl and NCBI databases and MGI to obtain features and other annotation information for the identified locus. Results from cell line localization are updated with each Ensembl build and AutoIdent regularly BLASTs all the sequences in the database against the GenBank NR database to update accession numbers and related information. All information in the database is available for download in a tab-delimited format.
IGTC Web Site
The database was structured to optimize information access via web queries and the IGTC web site was designed to provide userfriendly access to all of the data in a variety of ways. Although the site is organized around the gene and cell line annotation pages, other interfaces are provided for viewing data as well. The most important of these are browserbased views enabled through mapping IGTC cell lines to genomic sequence at the NCBI, Ensembl, and UCSC browsers. Currently, these data are provided at these sites either as standard map tracks or as userconfigured tracks. Finally, we are implementing protocols to enable linking to IGTC gene trap from other major mouse gene and mouse strain resources, including MGI.The RBVI Resource Center has absolutely been instrumental in developing the informatics technology used by the BayGenomics and IGTC projects. The co-PIs of BayGenomics are M.D. researchers/molecular biologists, with little expertise in information technology. Drs. Babbitt and Ferrin, and the rest of the RBVI team, have spearheaded the informatics component of BayGenomics and created a first-class international resource for use by gene trap researchers worldwide. The fact that our BayGenomics web site averages ~10,000 page requests per week is a testament to the impact this project is having in the scientific community.
References:
- D. Stryke, M. Kawamoto, C.C. Huang, S.J. Johns, L.A. King, C.A. Harper, E.C. Meng, R.E. Lee, L. L'Italien, P.T. Chuang, S.G. Young, W.C. Skarnes, P.C. Babbitt, and T.E. Ferrin, "BayGenomics: A Resource for Gene- Trapped Mouse Embryonic Stem Cells," Nucleic Acids Res., 31(1):278-281, 2003.
- S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, "Basic local alignment search tool," J. Mol. Biol., 215(3),403-10, 1990.
- W.C. Skarnes, H. von Melchner, W. Wurst, G. Hicks, A.S. Nord, T. Cox, S.G. Young, P. Ruiz, P. Soriano, M. Tessier-Lavigne et al., "A public gene trap resource for mouse functional genomics," Nat. Genet., 36(6):543-4, 2004.
- A.S. Nord, P.J. Chang, B.R. Conklin, T. Cox, C.A. Harper, G. Hicks, C.C. Huang, M. Kawamoto, S. Liu, E.C. Meng, J.H. Morris, W.C. Skarnes, P. Soriano, W.L. Stanford, D. Stryke, W. Wurst, K.I. Yamamura, S.G. Young, P.C. Babbitt, and T.E. Ferrin, "The International Gene Trap Consortium Website: a Collaborative Functional Genomics Resource Representing All Publicly Available Gene Trap Cell Lines in Mouse," Nucleic Acids Res., 34(database issue):D642-D648, 2006.
- Z. Zhang, S. Schwartz, L. Wagner, and W. Miller, "A greedy algorithm for aligning DNA sequences," J. Comput. Biol., 7(1-2):203-14, 2000.
- Z. Ning, A.J. Cox, and J.C. Mullikin, "SSAHA: a fast search method for large DNA databases," Genome Res., 11(10):1725-9, 2001.
- W.J. Kent, "BLAT - the BLAST-like alignment tool," Genome Res., 12(4):656-64, 2002.
- C.A. Harper, C.C. Huang, D. Stryke, M. Kawamoto, T.E. Ferrin, and P.C. Babbitt, "Comparison of Methods for Genomic Localization of Gene Tag Sequences," BMC Genomics, submitted.
Laboratory Overview | Research | Outreach & Training | Available Resources | Visitors Center | Search