TY - JOUR
T1 - Targeted discovery of novel human exons by comparative genomics
AU - Siepel, Adam
AU - Diekhans, Mark
AU - Brejová, Broňa
AU - Langton, Laura
AU - Stevens, Michael
AU - Comstock, Charles L.G.
AU - Davis, Colleen
AU - Ewing, Brent
AU - Oommen, Shelly
AU - Lau, Christopher
AU - Yu, Hung Chun
AU - Li, Jianfeng
AU - Roe, Bruce A.
AU - Green, Phil
AU - Gerhard, Daniela S.
AU - Temple, Gary
AU - Haussler, David
AU - Brent, Michael R.
PY - 2007/12
Y1 - 2007/12
N2 - A complete and accurate set of human protein-coding gene annotations is perhaps the single most important resource for genomic research after the human-genome sequence itself, yet the major gene catalogs remain incomplete and imperfect. Here we describe a genome-wide effort, carried out as part of the Mammalian Gene Collection (MGC) project, to identify human genes not yet in the gene catalogs. Our approach was to produce gene predictions by algorithms that rely on comparative sequence data but do not require direct cDNA evidence, then to test predicted novel genes by RT-PCR. We have identified 734 novel gene fragments (NGFs) containing 2188 exons with, at most, weak prior cDNA support. These NGFs correspond to an estimated 563 distinct genes, of which >160 are completely absent from the major gene catalogs, while hundreds of others represent significant extensions of known genes. The NGFs appear to be predominantly protein-coding genes rather than noncoding RNAs, unlike novel transcribed sequences identified by technologies such as tiling arrays and CAGE. They tend to be expressed at low levels and in a tissue-specific manner, and they are enriched for roles in motor activity, cell adhesion, connective tissue, and central nervous system development. Our results demonstrate that many important genes and gene fragments have been missed by traditional approaches to gene discovery but can be identified by their evolutionary signatures using comparative sequence data. However, they suggest that hundreds - not thousands - of protein-coding genes are completely missing from the current gene catalogs.
AB - A complete and accurate set of human protein-coding gene annotations is perhaps the single most important resource for genomic research after the human-genome sequence itself, yet the major gene catalogs remain incomplete and imperfect. Here we describe a genome-wide effort, carried out as part of the Mammalian Gene Collection (MGC) project, to identify human genes not yet in the gene catalogs. Our approach was to produce gene predictions by algorithms that rely on comparative sequence data but do not require direct cDNA evidence, then to test predicted novel genes by RT-PCR. We have identified 734 novel gene fragments (NGFs) containing 2188 exons with, at most, weak prior cDNA support. These NGFs correspond to an estimated 563 distinct genes, of which >160 are completely absent from the major gene catalogs, while hundreds of others represent significant extensions of known genes. The NGFs appear to be predominantly protein-coding genes rather than noncoding RNAs, unlike novel transcribed sequences identified by technologies such as tiling arrays and CAGE. They tend to be expressed at low levels and in a tissue-specific manner, and they are enriched for roles in motor activity, cell adhesion, connective tissue, and central nervous system development. Our results demonstrate that many important genes and gene fragments have been missed by traditional approaches to gene discovery but can be identified by their evolutionary signatures using comparative sequence data. However, they suggest that hundreds - not thousands - of protein-coding genes are completely missing from the current gene catalogs.
UR - http://www.scopus.com/inward/record.url?scp=38849145798&partnerID=8YFLogxK
U2 - 10.1101/gr.7128207
DO - 10.1101/gr.7128207
M3 - Article
C2 - 17989246
AN - SCOPUS:38849145798
SN - 1088-9051
VL - 17
SP - 1763
EP - 1773
JO - Genome research
JF - Genome research
IS - 12
ER -