TY - JOUR
T1 - Combining phylogenetic data with co-regulated genes to identify regulatory motifs
AU - Wang, Ting
AU - Stormo, Gary D.
N1 - Funding Information:
We would like to thank P. Cliften and M. Johnston for providing sequences of yeast genomes, and K. Tan and D. GuhaThakurta for providing other sequences. Thanks to B. Cohen and S. Eddy for insightful discussions and comments that improved the manuscript. We also thank Chip Lawrence for providing the Gibbs Motif Sampler and Jeremy Buhler for providing Projection Genomics Toolkit. One unknown reviewer is thanked for bringing to our attention several relevant publications. This work is supported by HG00249 from NIH. TW is partially supported by a NIH training grant in genomic science 2T32HG00045.
PY - 2003/12/12
Y1 - 2003/12/12
N2 - Motivation: Discovery of regulatory motifs in unaligned DNA sequences remains a fundamental problem in computational biology. Two categories of algorithms have been developed to identify common motifs from a set of DNA sequences. The first can be called a 'multiple genes, single species' approach. It proposes that a degenerate motif is embedded in some or all of the otherwise unrelated input sequences and tries to describe a consensus motif and identify its occurrences. It is often used for co-regulated genes identified through experimental approaches. The second approach can be called 'single gene, multiple species'. It requires orthologous input sequences and tries to identify unusually well conserved regions by phylogenetic footprinting. Both approaches perform well, but each has some limitations. It is tempting to combine the knowledge of co-regulation among different genes and conservation among orthologous genes to improve our ability to identify motifs. Results: Based on the Consensus algorithm previously established by our group, we introduce a new algorithm called PhyloCon (Phylogenetic Consensus) that takes into account both conservation among orthologous genes and co-regulation of genes within a species. This algorithm first aligns conserved regions of orthologous sequences into multiple sequence alignments, or profiles, then compares profiles representing non-orthologous sequences. Motifs emerge as common regions in these profiles. Here we present a novel statistic to compare profiles of DNA sequences and a greedy approach to search for common subprofiles. We demonstrate that PhyloCon performs well on both synthetic and biological data.
AB - Motivation: Discovery of regulatory motifs in unaligned DNA sequences remains a fundamental problem in computational biology. Two categories of algorithms have been developed to identify common motifs from a set of DNA sequences. The first can be called a 'multiple genes, single species' approach. It proposes that a degenerate motif is embedded in some or all of the otherwise unrelated input sequences and tries to describe a consensus motif and identify its occurrences. It is often used for co-regulated genes identified through experimental approaches. The second approach can be called 'single gene, multiple species'. It requires orthologous input sequences and tries to identify unusually well conserved regions by phylogenetic footprinting. Both approaches perform well, but each has some limitations. It is tempting to combine the knowledge of co-regulation among different genes and conservation among orthologous genes to improve our ability to identify motifs. Results: Based on the Consensus algorithm previously established by our group, we introduce a new algorithm called PhyloCon (Phylogenetic Consensus) that takes into account both conservation among orthologous genes and co-regulation of genes within a species. This algorithm first aligns conserved regions of orthologous sequences into multiple sequence alignments, or profiles, then compares profiles representing non-orthologous sequences. Motifs emerge as common regions in these profiles. Here we present a novel statistic to compare profiles of DNA sequences and a greedy approach to search for common subprofiles. We demonstrate that PhyloCon performs well on both synthetic and biological data.
UR - http://www.scopus.com/inward/record.url?scp=0344906814&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btg329
DO - 10.1093/bioinformatics/btg329
M3 - Article
C2 - 14668220
AN - SCOPUS:0344906814
SN - 1367-4803
VL - 19
SP - 2369
EP - 2380
JO - Bioinformatics
JF - Bioinformatics
IS - 18
ER -