ABSTRACT

Manifold learning is a new class of non-linear embedding techniques that are designed to discover the structure of high-dimensional data that lies on or near a low-dimensional manifold. A manifold is a space that is characterized by complex geometry but can be locally approximated using the Euclidean metric. Principal component analysis can sometimes suggest the appropriate embedding dimension, based on the number of “large” eigenvalues. A heuristic alternative, which is more widely applicable, is an information-theoretic approach that compares embeddings based on the principle of minimum description length. Internal representations are vectorial representations that use only the object’s properties to determine the mapping. Internal representations such as the amino acid composition of a sequence are simple and easy to compute, and they encode some information that pertains to a protein’s function. External representations are vectorial representations that use more than just the object itself to derive its mapping to vector space.