ABSTRACT

Advances in high throughput technologies, and associated reduction in costs, have enabled simultaneous profiling of many biological compartments and the collection of many data types from biological specimens. These high-dimensional data sets, in turn, have necessitated the development of novel integrative methods to consider the data in a holistic manner, a paradigm sometimes termed “systems biology.” In this section, we survey current approaches for the integration of multiple omics high-dimensional data sets obtained on the same set of individuals. These approaches employ a variety of methods such as factorization, message passing, multi-block methods, generalized canonical correlation analysis, classification and regression algorithms, and network-constrained and Bayesian methods. These approaches can be categorized into data-driven approaches, which are based only on empirical data, or knowledge-based approaches, which also incorporate known biological knowledge in order to improve biological interpretability. A distinction is made between unsupervised approaches that seek common relationships between data sets and supervised approaches that use labeled data in order to identify discriminatory patterns (e.g., between phenotypic groups). We describe the objectives of each approach and give examples of their application to multiple omics data, along with important limitations.