Keynote Speakers
- Xihong Lin, Harvard University
- Kathryn Roeder, Carnegie Mellon University
- John Storey, Princeton University
Invited Speakers
- Mengjie Chen, University of Chicago
- Raphael Gottardo, Fred Hutchinson Center
- Ming Hu, Cleveland Clinic
- Hongkai Ji, Johns Hopkins University
- Sunduz Keles, University of Wisconsin, Madison
- Shannon Ellis, Johns Hopkins University
- Hongzhe Li, University of Pennsylvania
- Shili Lin, Ohio State University
- Qiongshi Lu, Yale University and University of Wisconsin, Madison
- Mary Sara McPeek, University of Chicago
- Peter Mueller, University of Texas at Austin
- Dan Nettleton, Iowa State University
- Shyamal Peddada, NIH
- Ronglai Shen, Memorial Sloan-Kettering Center
- Matthew Stephens, University of Chicago
- Wei Sun, The University of North Carolina at Chapel Hill
- Nancy Zhang, University of Pennsylvania
- David Zhao, University of Illinois at Urbana–Champaign
Program (Tentative)
Speaker | Title | Abstract | |
Jun 5 | |||
Keynote | Kathryn Roeder | Statistical Challenges Modeling Transcriptional Patterns in the Brain | The transcriptional pattern of the developing brain is of great interest in humans. Recent developments have made it possible to measure transcription of single cells rather than bulk tissue, and to obtain a dense temporal sampling of prenatal and postnatal periods measured on fine anatomical division of the brain. We discuss various statistical challenges arising in the analysis of these data. Technological advanced have enabled the measurement of RNA levels for individual cells. Compared to traditional bulk RNA-seq data, single cell sequencing yields valuable insights about gene expression profiles for different cell types, which is potentially critical for understanding many complex human diseases. However, developing quantitative tools for such data remains challenging because of technical noise. We propose a unified statistical framework for both single cell and bulk RNA-seq data, formulated as a hierarchical model. Co-expression networks reveal gene communities and provide insight into the nature of genes involved in risk for genetic disorders. While it is well documented that gene expression varies dramatically over developmental periods in the brain, the associated changes in gene communities over time remain poorly understood. Recently a rich source of data from rhesus monkey brains has become available pertaining to this question. Once the data are divided by cell type and developmental period, however, sample sizes are very small, making inference quite challenging. We develop a global community detection method that combines information across a series of networks, longitudinally, to strengthen the inference for each time period. Our method is derived from evolutionary spectral clustering and degree correction methods. |
Session: Epigenetics | |||
Invited | Mary Sara McPeek | Two-Way Mixed-Effects Methods for Joint Association Analysis Using Both Host and Pathogen Genomes | Many common infectious diseases are affected by specific parings of hosts and pathogens and therefore by both of their genomes. Integrating a pair of genomes into disease mapping will provide an exquisitely detailed view of the genetic landscape of complex diseases. We have developed a new association method, ATOMM, that maps a trait of interest to a pair of genomes simultaneously by taking advantage of the whole genome sequence data available for both host and pathogen organisms. ATOMM utilizes a two-way mixed-effect model to test for gene association and gene-gene interaction while accounting for sample structure including interactions between the genetic backgrounds of the two organisms. We demonstrate the applicability of ATOMM to a joint association study of quantitative disease resistance (QDR) in the Arabidopsis thaliana–Xanthonomas arboricola pathosystem. Our method uncovers a clear host-strain specificity in QDR and provides a powerful approach to identifying genetic variants on both genomes that contribute to the phenotypic variation. This is joint work with Miaoyan Wang, Fabrice Roux, Joy Bergelson, and others. |
Invited | Dan Nettleton | Statistical Challenges in Analysis of Complex Phenotypes Derived from Sequential Images | Researchers at Iowa State University are using networked cameras distributed throughout fields to collect time-lapse image sequences for thousands of growing maize plants. From these images, we extract multiple phenotypic traits as functions of time. Accompanying these multivariate functional phenotypes are time-indexed measurements of the environment experienced by the plants as they grow, as well as high-dimensional genotype information for each plant. Together, these data provide an unprecedented resource for understanding how genotype and environment work together to shape maize phenotypes. This talk will discuss statistical methods that can be used to address questions of scientific interest and highlight some of the interesting challenges that arise for the analysis of such data. |
Invited | Sunduz Keles | Data-Driven Regularization and Priors in GWAS and Mediation Analysis | One of the contemporary challenges in understanding the results from genome-wide association studies (GWAS) is elucidating the potential roles of non-coding SNPs. Large consortia projects generated ample genomic and epigenomic data that are valuable for this task. We develop a number a statistical approaches for systematically incorporating such data-driven prior information into analysis of GWAS. Our approaches leverage penalized regression formulations of GWAS and mediation analysis. We provide large-scale computational experiments that quantify when and how such information is useful as well as a theoretical exposition. Our analysis of several phenotypes from Framingham Heart Study illustrates the utility of this framework. Joint work with Cony Rojo, Pixu Shi, Qi Zhang, and Ming Yuan. |
Session: Statistical Methods | |||
Invited | Shannon Ellis | In Silico Phenotyping to Improve the Usefulness of Public Data | We recently developed and released recount2 (https://jhubiostatistics.shinyapps.io/recount/), a resource in which we aligned RNA-seq data for ~70,000 human samples. I’ll discuss what is available in this resource, how to access these data, and discuss our effort to computationally re-phenotype the samples within recount2 so that these data can be easily integrated and used in future analyses. |
Invited | David Zhao | Do U C what I C? Some Methods for Integrative Genomics | This talk focuses on statistical methods motivated by integrative genomics, a collection of quantitative approaches in genomics research that centers around the joint analysis of multiple datasets. Three methods will be discussed: 1) achieving more powerful GWAS by incorporating gene expression data using mediation analysis regression models; 2) identifying pleiotropic SNPs across independent studies under false discovery rate control; and 3) improving the accuracy of genetic risk prediction by incorporating results from auxiliary GWAS studies using nonparametric empirical Bayes classification. Illustrative applications and some theoretical justifications will be provided. |
Invited | Shili Lin | tREX: A Statistical Inference Method for Chromatin 3D Structure | The expression of a gene is usually controlled by the regulatory elements in its promoter region. However, it has long been hypothesized that, in complex genomes, such as the human genome, a gene may be controlled by distant enhancers and repressors. A recent high throughput molecular technique, Hi-C, that uses formaldehyde cross-linking coupled with massively parallel sequencing technology, enables detections of genome-wide physical contacts between distant loci. Such communication is achieved through spatial organization (looping) of chromosomes to bring genes and their regulatory elements in to close proximity. The availability of such data makes it possible to reconstruct the underlying three-dimensional (3D) spatial chromatin structure and to study spatial gene regulation. In this talk, I will describe a truncated Random effect EXpression (tREX) method for inference on the locations of genomic loci in a 3D Euclidean space. Results from Hi-C data will be visualized to illustrate spatial regulation and proximity of genomic loci that are far apart in their linear chromosomal locations. |
Invited | Peter Mueller | Reciprocal Graphical Models for Integrative Gene Regulatory Network Analysis | Constructing gene regulatory networks is a fundamental task in systems biology. We introduce a Gaussian reciprocal graphical model for inference about gene regulatory relationships by integrating mRNA gene expression and DNA level information including copy number and methylation. Data integration allows for inference on the directionality of certain regulatory relationships, which would be otherwise indistinguishable due to Markov equivalence. Efficient inference is developed based on simultaneous equation models. Bayesian model selection techniques are adopted to estimate the graph structure. We illustrate our approach by simulations and two applications in ZODIAC pairwise gene interaction analysis and colon adenocarcinoma pathway analysis. Y. Ni, Y. Ji and P. Mueller. Reciprocal Graphical Models for Integrative Gene Regulatory Network Analysis, https://arxiv.org/abs/1607.06849 |
Session: Posters and Speed Session | |||
Poster Session | Poster Exhibit with Presenters | https://graybill.natsci.colostate.edu/abstracts/ | |
Jun 6 | |||
Keynote | Xihong Lin | Analysis of Genome, Exposome and Phenome | Massive ‘ome data, including genome, exposome, and phenome data, are becoming available at an increasing rate with no apparent end in sight. Examples include Whole Genome Sequencing data, multiple metal data, digital phenotyping data, and Electronic Medical Records. Whole genome sequencing data and different types of genomics data have become rapidly available. Two large ongoing whole genome sequencing programs (Genome Sequencing Program (GSP) of NHGRI and Trans-omics for Precision Medicine Program (TOPMed) of NHLBI) plan to sequence 300,000-350,000 whole genomes. These massive genetic and genomic data, as well as exposure and phenotype data, present many exciting opportunities as well as challenges in data analysis and result interpretation. In this talk, I will discuss analysis strategies for some of these challenges, including rare variant analysis of whole-genome sequencing association studies; analysis of multiple phenotypes (pleiotropy), and integrative analysis of different types of genetic and genomic, environmental data using causal mediation analysis. Connection between mediation analysis and Mendelian Randomization will be discussed. |
Session: Microbiome and Compositional Data Analysis | |||
Invited | Hongzhe Li | Composition Estimation from Sparse Count Data via a Regularized Likelihood | In microbiome studies, taxa composition is estimated based on the sequencing read counts in order to account for the large variability in total number of observed reads across different samples. Due to limited sequencing depth, some rare microbial taxa might not be captured in the metagenomic sequencing, which results in many zero counts. Naive composition estimation using count normalization leads to many zero proportions, which underestimates the underlying compositions, especially for the rare taxa. Such an estimate of the composition can further lead to biased estimate of taxa diversity and cause difficulty in downstream data analysis. In this paper, the observed counts are assumed to be sampled from a Poisson-multinomial distribution with the composition being the probability parameter in a simplex space. Under the assumption that the composition matrix is approximately low rank, a maximum likelihood estimation with a nuclear norm penalty is developed to estimate the underlying compositions of the samples. The theoretical upper bounds and the minimax lower bounds of the estimation errors, measured by the Kullback-Leibler divergence and the Frobenius norm, are established. Simulation studies demonstrate that the proposed estimator outperforms the naive estimators. The methods are applied to an analysis of a human gut microbiome dataset. |
Invited | Wei Sun | Analyzing Cancer Omic Data as Compositional Data | Omic data collected from tumor samples represent the signals of both tumor and normal cells. The tumor cells can be further classified into different subclones, which is referred to as intra-tumor heterogeneity. The normal cells include fibroblast cells and multiple types of infiltrating immune cells. We first demonstrate that unknown cell type composition can lead to very strong confounding effects when analyzing cancer omic data. Then we will introduce an eQTL mapping method that can separate the genetic effect on gene expression in tumor and normal cells. Finally we briefly describe our on-going work on intra-tumor heterogeneity and estimation of immune cell composition in tumor microenvironment. |
Invited | Shyamal Peddada | Some Challenges in the Analysis of Microbiome Data | Over the past couple of decades, researchers have been interested in studying genes by (external) environment interaction on human health. However, lately there is considerable interest to study the role of internal microbial environment on human health. Numerous studies are being routinely conducted to understand the association between microbiome and various health outcomes. The 16S rRNA data generated from such studies are high dimensional count data containing large number of zeros. Using these microbial count data, researchers are often interested in a wide range of problems, such as comparing various experimental groups and classification of subjects into groups (e.g. healthy and sick). Because of the intrinsic structure of these data, standard methods of analyses are not necessarily appropriate. A goal of this talk is to introduce some statistical issues relating to the analysis of these count data. For example, we shall discuss normalization, comparison of experimental groups, classification of samples etc. We shall use some recently published data to illustrate various methods described in this talk |
Keynote | John Storey | Modeling global human genomic variation | Modern human population genetics studies often sample individuals from a global perspective, which results in a complex population structure present in the data sets. I will discuss flexible models of global human genetic variation from a genome-wide perspective that allow for generalizations of important population genetic models, such as Hardy-Weinberg equilibrium, F_ST, and admixture, and that also allow for more robust tests of genetic associations with complex traits. |
Session: Single cell RNA sequencing | |||
Invited | Hongkai Ji | Decoding Gene Regulation using Single Cell Genomic Data | Emerging single-cell regulome mapping technologies (e.g., single-cell ATAC-seq, DNase-seq or ChIP-seq) have made it possible to assay gene regulation in individual cells. However, single-cell regulome data are highly sparse and noisy. Using these data to analyze activities of each individual cis-regulatory element in a genome remains difficult. We present a series of tools to facilitate more effective use of single-cell genomic technologies for studying gene regulation. We develop SCRAT, a Single-Cell Regulome Analysis Toolbox, for analyzing cell heterogeneity using single-cell regulome data. We also show that one can predict regulatory element activities using RNA-seq. Predictions based on single-cell RNA-seq (scRNA-seq) can more accurately reconstruct bulk chromatin accessibility than using single-cell ATAC-seq (scATAC-seq) by pooling the same number of cells. Integrating ATAC-seq with predictions from RNA-seq increases power of both methods. |
Invited | Nancy Zhang | Expression Recovery in Single Cell RNA Sequencing | In single cell RNA sequencing experiments, not all transcripts present in the cell are captured in the library, and not all molecules present in the library are sequenced. The efficiency, that is, the proportion of transcripts in the cell that are eventually represented by reads, can vary between 2-60%, and can be especially low in highly parallelized droplet-based technologies where the number of reads allocated for each cell can be very small. This leads to a severe case of not-at-random missing data, which hinders and confounds analysis, especially for low to moderately expressed genes. To address this issue, we introduce a noise reduction and missing-data imputation framework for single cell RNA sequencing, which allows for cell-specific efficiency parameters and borrows information across genes and cells to fill in the zeros in the expression matrix as well as improve the expression estimates derived from the low read counts. We demonstrate the accuracy of this procedure in two ways, through “thinning experiments” that subsample from real high quality scRNA-seq data sets, and through comparisons to gold-standard RNA-FISH measurements. We will also illustrate how this critical recovery step improves downstream analyses in single cell experiments. |
Invited | Raphael Gottardo | Statistical Methods for Single-Cell Genomics | Single-cell genomics enables the unprecedented interrogation of gene expression in single-cells. The stochastic nature of transcription is revealed in the bimodality of single-cell data, a feature shared across many single-cell platforms. I will present a new approach to analyze single-cell transcriptomic data that models this bimodality within a coherent generalized linear modeling framework. Our model permits direct inference on statistics formed by collections of genes, facilitating gene set enrichment analysis. The residuals defined by our model can be manipulated to interrogate cellular heterogeneity and gene-gene correlation across cells and conditions, providing insights into the temporal evolution of networks of co-expressed genes at the single-cell level. I will also discuss unwanted sources of variability in single-cell experiments and in particular the effect of the cellular detection rate defined as the fraction of genes turned on in a cell, and show how our model can account and adjust for such variability. Finally, I will illustrate this novel approach using several datasets that we have recently generated to characterize specific human immune cell subsets. |
Invited | Mengjie Chen | Removing Unwanted Variation Using both Control and Target Genes in Single Cell RNA Sequencing Studies | Single cell RNA sequencing (scRNAseq) technique is becoming increasingly popular for unbiased and high-resolutional transcriptome analysis of heterogeneous cell populations. Despite its many advantages, scRNAseq, like any other genomic sequencing technique, is susceptible to the influence of confounding effects. Controlling for confounding effects in scRNAseq data is thus a crucial step for proper data normalization and accurate downstream analysis. Several recent methodological studies have demonstrated the use of control genes (including spike-ins) for controlling for confounding effects in scRNAseq studies. However, these methods can be suboptimal as they ignore the rich information contained in the target genes. Here, we develop an alternative statistical method, which we refer to as scPLS, for more accurate inference of confounding effects. Our method models control and target genes jointly to better infer and control for confounding effects. To accompany our method, we develop a novel expectation maximization algorithm for scalable inference. Our algorithm is an order of magnitude faster than standard ones, making scPLS applicable to hundreds of cells and hundreds of thousands of genes. With simulations and studies, we show the effectiveness of scPLS in removing technical confounding effects as well as for removing cell cycle effects. Under the same framework, we will further discuss how to identify subpopulations using a Bayesian nonparametric approach. |
Jun 7 | Session: Integrative and cancer geomics | ||
Invited | Qiongshi Lu | Inferring Genetic Architecture of Complex Diseases Through Integrated Analysis of Association Signals and Genomic Annotations | Genome-wide association study (GWAS) has been a great success in the past decade. However, significant challenges still remain in both identifying new risk loci and interpreting results, even for samples with tens of thousands of subjects. In this presentation, we describe our recent efforts to develop functonal annotations of the human genome from computational predictions (e.g. genomic conservation) and high-throughput experiments (e.g. the ENCODE and Roadmap Epigenomics Projects) and to integrate these annotations with GWAS test statistics. The effectiveness of our methods will be demonstrated through their applications to a large number of GWASs to identify tissues/cell types that are relevant to a specific disease, to infer shared genetic contributions to several diseases, and to improve genetic disease risk predictions. This is joint work with Hongyu Zhao, Ryan Powels, Yiming Hu, Qian Wang, and others. |
Invited | Matthew Stephens | An Invitation to a Multiple Testing Party | Multiple testing is often described as a “burden”. My goal is to convince you that multiple testing is better viewed as an opportunity, and that instead of laboring under this burden you should be looking for ways to exploit this opportunity. I invite you to a multiple testing party. |
Invited | Ronglai Shen | Integrative Omics Data for Cancer Classification | Large-‐scale comprehensive caner genome studies, including the NCI-‐NHGRI Cancer Genome Atlas (TCGA) project, have generated a large amount of data of multiple “omic” dimensions including somatic mutations, DNA copy number, DNA methylation, mRNA, miRNA, protein expression simultaneously obtained in the same biological samples. In isolation, none of the individual data type alone can completely capture the complexity of the cancer genome. Collectively, the multiple types of genomic, epigenomic, and transcriptomic alterations can provide refined subtype classifications and increased power and precision to determine the molecular basis of clinical phenotypes. We conducted an integrative pan-‐cancer analysis of multiple omic platforms from thousands of patient samples in over a dozen TCGA tumor types. We identified shared molecular alterations in different cancer types which may indicate related disease etiology and provide unique opportunities to compare treatments and outcome across cancer types. Furthermore, we developed a kernel learning approach to systematically investigate the prognostic value of germline and somatic mutation, DNA copy number, DNA methylation, mRNA, miRNA, protein expression, and combination and subsets of these data, for predicting patient survival outcome. We found mRNA expression and DNA methylation are among the most informative data types for cancer prognosis, alone or in combination with clinical factors. The integration of omic profiles with clinical variables further improved the prognostic performance over using the clinical models alone. Moreover, the kernel learning method provides an efficient approach to integrate a large number of moderate effects, and thus consistently outperformed sparse methods such as lasso Cox regression. |
Invited | Ming Hu | Statistical Methods, Computational Tools and Visualization of Hi-C Data | Harnessing the power of high-throughput chromatin conformation capture (3C) based technologies, we have recently generated a compendium of datasets to characterize chromatin organization across human cell lines and primary tissues. Knowledge revealed from these data facilitates deeper understanding of long range chromatin interactions (i.e., peaks) and their functional implications on transcription regulation and genetic mechanisms underlying complex human diseases and traits. However, various layers of uncertainties and complex dependency structure complicate the analysis and interpretation of these data. We have proposed hidden Markov random field (HMRF) based statistical methods, which properly address the complicated dependency issue in Hi-C data, and further leverage such dependency by borrowing information from neighboring pairs of loci, for more powerful and more reproducible peak detection. Through extensive simulations and real data analysis, we demonstrate the power of our methods over existing peak callers. We have applied our methods to the compendium of Hi-C from 21 human cell lines and tissues, and further develop an online visualization tool to facilitate identification of potential target gene(s) for the vast majority of non-coding variants identified from the recent waves of genome-wide association studies. |
Session: Closing | |||
Closing | Debashis Ghosh |