ABSTRACT

PLS-DA is a special case of PLS and Linear Discriminant Analysis to discriminate samples based on a categorical outcome (e.g. cancer subtype). PLS-DA fits into a supervised framework and its sparse variant sPLS-DA selects variables that can classify samples of known phenotype, as well as predict the class of new samples. However, because of the large number of variables that are combined to build the discriminant components, PLS-DA can be prone to over-fitting. Therefore, performance assessment using repeated cross-validation is essential in this type of analysis. This chapter explains the principle of PLS-DA and its sparse variant sPLS-DA. Key inputs, including how to tune the number of components and features, key outputs, and a framework to manage overfitting are described. The srbct study available in mixOmics includes gene expression data from small round blue cell tumors, and is analysed with both methods. A detailed example to predict sample tumour subtype is described, along with further examples to analyse microbiome data and data with repeated measurements. A FAQ is provided.