TY - JOUR
T1 - Text-mining assisted regulatory annotation
AU - Aerts, Stein
AU - Haeussler, Maximilian
AU - van Vooren, Steven
AU - Griffith, Obi L.
AU - Hulpiau, Paco
AU - Jones, Steven J.M.
AU - Montgomery, Stephen B.
AU - Bergman, Casey M.
N1 - Funding Information:
We thank Jonathan Wren for help running his Markov sequence extraction method as well as all of the participants of the RegCreative Jamboree for many fruitful discussions before, during and after the Jamboree. We are especially grateful to Martin Krallinger, Lynette Hirschman, Alfonso Valencia and Ewan Birney for encouraging links between the regulatory informatics and text-mining communities. SA is Postdoctoral Research Fellow of the FWO-Vlaanderen; MH is supported by a Marie Curie Early Stage Research Training Fellowship (MEST-CT-2004-504854) and the Plurigenes STREP project (LSHG-CT-2005-018673); OLG is supported by the Canadian Institutes of Health Research and the Michael Smith Foundation for Health Research; SBM is supported by the European Molecular Biology Organization and the Natural Sciences and Engineering Research Council of Canada. We also thank ENFIN, the BioSapiens Network, the Research Foundation -Flanders (FWO-Vlaanderen), Genome Canada and Genome British Columbia for financial support of the RegCreative Jamboree. This work is conducted as part of the NESCent cis-regulatory evolution working group supported by the NSF National Evolutionary Synthesis Center (NSF #EF-0423641).
PY - 2008/2/13
Y1 - 2008/2/13
N2 - Background: Decoding transcriptional regulatory networks and the genomic cis-regulatory logic implemented in their control nodes is a fundamental challenge in genome biology. High-throughput computational and experimental analyses of regulatory networks and sequences rely heavily on positive control data from prior small-scale experiments, but the vast majority of previously discovered regulatory data remains locked in the biomedical literature. Results: We develop text-mining strategies to identify relevant publications and extract sequence information to assist the regulatory annotation process. Using a vector space model to identify Medline abstracts from papers likely to have high cis-regulatory content, we demonstrate that document relevance ranking can assist the curation of transcriptional regulatory networks and estimate that, minimally, 30,000 papers harbor unannotated cis-regulatory data. In addition, we show that DNA sequences can be extracted from primary text with high cis-regulatory content and mapped to genome sequences as a means of identifying the location, organism and target gene information that is critical to the cis-regulatory annotation process. Conclusion: Our results demonstrate that text-mining technologies can be successfully integrated with genome annotation systems, thereby increasing the availability of annotated cis-regulatory data needed to catalyze advances in the field of gene regulation.
AB - Background: Decoding transcriptional regulatory networks and the genomic cis-regulatory logic implemented in their control nodes is a fundamental challenge in genome biology. High-throughput computational and experimental analyses of regulatory networks and sequences rely heavily on positive control data from prior small-scale experiments, but the vast majority of previously discovered regulatory data remains locked in the biomedical literature. Results: We develop text-mining strategies to identify relevant publications and extract sequence information to assist the regulatory annotation process. Using a vector space model to identify Medline abstracts from papers likely to have high cis-regulatory content, we demonstrate that document relevance ranking can assist the curation of transcriptional regulatory networks and estimate that, minimally, 30,000 papers harbor unannotated cis-regulatory data. In addition, we show that DNA sequences can be extracted from primary text with high cis-regulatory content and mapped to genome sequences as a means of identifying the location, organism and target gene information that is critical to the cis-regulatory annotation process. Conclusion: Our results demonstrate that text-mining technologies can be successfully integrated with genome annotation systems, thereby increasing the availability of annotated cis-regulatory data needed to catalyze advances in the field of gene regulation.
UR - http://www.scopus.com/inward/record.url?scp=43549101608&partnerID=8YFLogxK
U2 - 10.1186/gb-2008-9-2-r31
DO - 10.1186/gb-2008-9-2-r31
M3 - Article
C2 - 18271954
AN - SCOPUS:43549101608
SN - 1474-7596
VL - 9
JO - Genome biology
JF - Genome biology
IS - 2
M1 - R31
ER -