ABSTRACT

Modern biomedical studies routinely collect complex high-dimensional data, so it is no longer possible for an applied biostatistician to avoid encountering bioinformatics problems. In some ways, bioinformatics is a sub-field of biostatistics. However, due to the inherent challenges involved in analyzing and searching for patterns in massive dimensional data sets, it is often the case that statistical rigor is put aside in favor of practically-motivated algorithmicbased approaches. This has naturally led to an increasing number of computer scientists and machine learning experts with interests in bioinformatics. Algorithmic-based methods have certainly led to valuable insights. However, there are clear advantages to the use of coherent statistical model-based approaches that attempt to flexibly, yet sparsely characterize high dimensional data. Such methods can be used to discover sparse latent structure, while accounting for uncertainty. Given the enormous dimensional model spaces that are routinely encountered in bioinformatics applications, it is crucial to account for uncertainty in model selection in conducting inferences and predictions. Optimization-based strategies that search for the best model based on some criterion have the intrinsic disadvantage that the selected model is almost surely not the true model given that the sample size of the available data is dwarfed by the size of the model space. Bayesian methods represent a natural approach for dealing with uncertainty in model selection, while also incorporating available prior information, which is crucial in addressing the

large p small n problems in bioinformatics. This chapter provides a review and commentary on the use of Bayesian

nonparametrics in bioinformatics, while also suggesting a number of promising areas for new research. By “Bayesian nonparametrics”, I am referring to approaches utilizing random probability measures, such as the Dirichlet process (Ferguson, 1973, 1974). Section 1.2 provides a brief review of Bayesian nonparametrics, focusing on Dirichlet process mixture models and related approaches. Section 1.3 provides an overview of work on Bayesian nonparametric approaches for multiple testing and high-dimensional regression. Section 1.4 considers clustering and functional data analysis applications. Section 1.5 provides an overview of recent work on using innovative nonparametric priors in population genetics, EST library analyses and other areas. Section 1.6 contains a discussion and commentary.