## Welcome to the big data analytics lab at the University of Georgia

Our research focuses on the statistical methodology and theory development to face the striking new phenomena emerged under the big data regime. Over the past few years, Dr. Zhong and Dr. Ma have established diverse extramurally funded research programs to overcome the computational and theoretical challenges arise from the big data analysis. The basic statistical researches are successfully applied in modern genomic, epigenetic, metagenomics, text-mining, chemical sensing and brain imaging researches.

More specifically, our research focuses on those research thrusts:

**Feature selection in high dimensional regression**

Classical statistical theory considered small-sized datasets, with n observations and p carefully chosen variables. The statistical asymptotic is established by fixing p but letting n goes to infinity. These theoretical results cannot be generalized to big data analytics because, when p is bounded, the large n-behavior already sets in relatively small n. For big data analytics, it makes sense to assume that n and p are both large; and in some cases, it even makes sense to assume that p is much larger than n – a situation that would have been totally forbidden classically. One of our research goal is to establish the theoretical underpinning of the existing statistical tools under the big data regime. This work was supported by NSF DMS 1120256 and NSF DMS 1406843 to WZ.

**Statistical theory and methods for big data**

Regression models are useful for predicting a response variable from p predictor variables or to describe relationships between predictor variables and a response variable. Given a set of n data units, in modern massive data sets, p and/or n can be large, in which case conven- tional algorithms face computational challenges. Subsampling of rows and/or columns of a data matrix has been employed traditionally as a heuristic to reduce the size of large data sets. This work was supported by NSF DMS 1228288 and NSF DMS 1440038 to WZ.

**Sparse tensor completion and its application in chemical sensoring**

A tensor is a multidimensional array. Existing statistical analysis for tensor predictors simply ignores the structure restrictions by stacking each tensor observation into a vector and offers the solution using vector-based methods. This solution, however, is far from satisfactory. First, the vectorization destroys the original design information and leads to interpretation difficulties. Second, the vectorization can significantly aggravate the curse of dimensionality. This work was supported by NIH U01 ES016011 to WZ (PI: Kenneth Suslick).

**Large scale functional data analysis and its application in epigenetics**

Gene transcription is a complex and tightly regulated process. Accumulating evidence has indicated that it was concertedly regulated by regulatory proteins, mainly transcription factors (TF), and epigenetic modifications. The role of TFs in the regulation of gene transcription has been extensively studied, but much less understood is the role of epigenetic modification. DNA methylation has been newly discovered as key controller in gene transcription too. Aberrant DNA methylation changes can cause a number of human diseases such as developmental diseases (ICF syndrome, Prader-Willi and Angelman syndromes etc), aging related diseases (i.e. Alzheimer’s disease), heart disease, diabetes, and autoimmune diseases. Moreover, large amount of evidence implicated that DNA methylation is a key player in cancer development. This work was supported by NIH R01GM 113242-01 to WZ.

**Joint analysis of text and time series data to discover causal topics**

Discovering latent topics buried in large amounts of text data is a fundamental task in any text data related applications. Most existing work on topic discovery have focused on analyzing text data alone. As a result, the discovered topics generally reflect clusters of words that co-occur together in text data. In many applications, however, we have time series data available that are associated with text data (e.g., stock prices aligned with news data by time, presidential campaign polls aligned with social media by time), and need to perform a joint analysis of the text data and time series data to discover causal topics, which are topics that might potentially explain or be caused by the changes of an external time series variable. How to discover such causal topics is one of the research goal of our Lab.

**Statistical analysis of singularities on huge volumes of seismograms data in geophysics.**

Earth’s onion-like inner structure is composed of several layers: the crust, the mantle, the outer core, and the inner core. The deep Earth’s dynamic interior, which extends from the lowermost mantle at the depth of 2,890 km to the core center at the depth of 6,371 km, holds keys to understanding the planet’s early state and how its biology, hydrology and atmosphere evolved and shaped the planet on which we now live. Probing deep Earth is challenging. Direct sampling of Earth’s deep interior through man-made probes and volcanism is currently, and perhaps indefinitely, impossible, due to extreme pressures and temperatures involved. Our knowledge of Earth’s deep interior, therefore, is pieced together from a range of surface observations through indirect methods. Recently, the rapid deployment of dense global seismograph networks has brought an unprecedented amount of high resolution seismic data that were inaccessible just a decade ago, offering researchers an unprecedented opportunity to explore Earth’s deep interior. Probing the Earth’s deep interior using the indirect samples is a main research goal of our Lab. These works were featured by many popular media, such as National Geographic, Fox News, Malaysia Sun, The Korean Times and etc., and was supported by NSF DMS-1438957 and NSF DMS-0723759 to PM. In particular, Dr. Ma’s work has been selected by IRIS as examples of human’s effort on conquer the ninth and tenth seismological grand challenges in understanding the earth’s dynamic system.

**Brain Imaging Analysis**

Understanding the organizational architecture of human brain function has been of intense interest since the inception of human neuroscience and is a key application of the big data science. Transforming the current cognitive neuroscience and neuroimaging to human brain mapping fields is highly challenging. The computational limitation, the incredible complexity and heterogeneity of human subjects are major obstacles for getting a reliable estimation of the brain functional network. Moreover, the functional neuroimaging technologies of today and predictable future, while having high-resolution, are also error-prone. Our lab aims to develop a suite of fast computational tools for simultaneously estimating functional components of brain signals and the spatial distribution of the functional components at both population and individual level.

**Metagenomics**

Metagenomics refers to the study of a collection of genomes, typically microbial genomes, present in environmental samples, such as samples from the gastrointestinal tract of a human patient or samples of soil from a particular ecological origin. By sequencing bulk DNA that is directly extracted from environmental samples, one can bypass the diculties arising in cell cultivation, such as quick death of large amount of microbial species as their environment condition changes. Although this line of research holds tremendous scientific promise, the delivery of this promise however has not yet been fully materialized, because there is a lack of effective and efficient analytical tools for handling these complicated metagenomic data. One major challenge arises from the incredible complexity and heterogeneity of genomic composition in the samples, and the fact that sequencing technologies of today and predictable future, while having ultra-high throughput, are also error-prone and refractory to assembly. Computational methods that can efficiently and accurately group sequenced genomic fragments into their own species is highly desirable. One of the major goal of our lab is to develop a suite of statistical and computational methods for simultaneously estimate known or unknown bacteria species and their corresponding distribution.