Statistical genetics of complex traits

Genome-wide association studies, while highly successful in identifying susceptibility loci, explain only a small fraction of heritability of most complex diseases. Part of the problem may lie in the statistical methodology. Traditional methods perform association test of single genetic markers and traits, and may miss many modest-effect loci. I am working on statistical methods to test association of biological pathways and diseases. This new type of testing has the potential to increase the power by addressing the locus heterogeneity problem, i.e. mutations in two different genes (yet in the same pathway) may lead to the same diseases, and by incorporating genetic interactions among genes. I am also developing new strategies to integrate genomewide association data of expression traits and diseases. These so-called "molecular phenotypes" serve as natural bridges between genotypes and complex traits. The statistical challenge is to distinguish causality from associations

Evolution of novel regulatory circuits

How novel functions were evolved is a fundamental problem in evolutionary biology. While evolution of individual proteins has been extensively studied, evolution at the system level is largely unexplored. In most cases, it is not obvious how partial systems are functional. Evolution of gene regulatory systems, which is believed to play a major role in evolution of novel developmental forms and behavior, is particularly interesting. Many regulatory proteins are very pleiotropic and would appear to hinder evolution. We are studying the evolution of a regulatory circuit controlling a metabolic pathway in yeasts. The cross-species analysis suggests this circuit emerged only in the common ancestor of some yeast species. We observed substantial changes at the levels of cis-regulatory sequences, transcription factors and signaling proteins when evolving this circuit. By exploring the history of these changes, we have a rare opportunity to investigate the general issue of how novel gene regulatory systems were evolved.

Quantitative modeling of regulatory sequences

Regulatory DNA sequences drive gene expression patterns by integrating information about the environment in the form of the activities of transcription factors. The rules by which regulatory sequences read this type of information, however, are unclear. I developed quantitative models based on physicochemical principles that directly map regulatory sequences to the expression profiles they generate. These models incorporate mechanistic features that attempt to capture how activating and repressing factors work together. By evaluating the importance of these features in the fruit fly segmentation system, we were able to gain insights on the quantitative regulatory rules, including the way repressors prevent transcriptional activation, and the role of cooperative interactions. A simpler model was also applied to ChIP-seq data of transcription factors important for embryonic stem cells, and was shown to be significantly more predictive of DNA binding affinities than other existing methods.

Prediction of regulatory sequences through comparative genomics

Computational prediction of regulatory sequences may rely on sequence content: whether a sequence contains binding sites that match the specificity of transcritpion factors (TFs). Cross-species genome comparison can further improve prediction because true functional sites tend to be conserved during evolution. We developed computational methods to implement this idea. These methods are built on stochastic models that describe both the sequence content and the evolution of regulatory sequences. In particular, these models capture binding site gain and loss events during evolution. This feature allows our methods to predict partially conserved binding sites, often important in multiple species comparison or comparison of relatively divergent pairs. Our pairwise comparison method is also capable of performing alignment tailored to regulatory sequences and statistically summing over all alignments between sequences.

Evolution of cis-regulatory sequences

Understanding the conservation and change of regulatory sequences is critical to our knowledge of the unity as well as diversity of animal development and phenotypes. We tested key evolutionary hypothesis of cis-regulatory evolution using sequence data of more than 50 developmental enhancers across 12 Drosophila species. We made several interesting findings: for example, there are substantial epistatic interactions among different positions of a transcription factor binding site; loss of functional binding sites roughly follows a molecular clock; and the evolutionary fate of a binding site often depends on its sequence context.

Biomedical literature mining

Text mining is aimed at extracting information automatically from the vast biological literature. In my RA work with BeeSpace, I was involved in several projects that developed novel text mining methods and practical systems. In one project, we developed the BeeSpace question/answering (BSQA) system that performs integrated text mining for insect biology. BSQA recognizes a number of entities and relations, from gene interactions to insect behavior, in Medline documents. For any text query, BSQA is able to automatically identify important concepts associated with this query, arranged in different categories. By utilizing the extracted relations, BSQA is also able to answer many biologically motivated questions, from simple ones such as, where is a gene expressed, to more complex ones involving multiple types of relations. In another project, I proposed a new statistical method that mines biological literature to find important concepts characterizing sets of genes.

Genome rearrangement and prediction of functional gene groups

During evolution, the order and relative proximity of genes in genomes are generally not well conserved because of the rapid genome rearrangement events. On the other hand, functionally related genes may be constrained to remain close to each other due to natural selection. Thus, identifying these so-called conserved gene clusters is one way of finding functional gene groups, and can be used to reveal the forces underlying the evolution of genome organization. However, substantial genome rearrangements pose unique computational challenges. I developed a combinatorial algorithm to detect these gene clusters in pairwise genome comparison , allowing genes in the clusters to appear in arbitrary orders. Later, I helped my colleague, Xu Ling, to improve the efficiency of the algorithm, and extend the analysis to a large number of genomes. By combining the algorithmic approach and our newly developed statistical method, we analyzed more than one hundred bacterial genomes and predicted many novel functional gene groups.