One of the great challenges in a wide variety of learning problems is the ability to obtain sufficient labeled data for modeling purposes. Labeled data is often expensive to obtain, and frequently requires laborious human effort. In many domains, unlabeled data is copious, though labels can be attached to such data at a specific cost in the labeling process. Some examples of such data are as follows:
Document Collections: Large amounts of document data may be available on the Web, which are usually unlabeled. In such cases, it is desirable to attach labels to documents in order to create a learning model. A common approach is to manually label the documents in order to label the training data, a process that is slow, painstaking, and laborious.
Privacy-Constrained Data Sets: In many scenarios, the labels on records may be sensitive information, which may be acquired at a significant query cost (e.g., obtaining permission from the relevant entity).
Social Networks: In social networks, it may be desirable to identify nodes with specific properties. For example, an advertising company may desire to identify nodes in the social network that are interested in “cosmetics.” However, it is rare that labeled nodes will be available in the network that have interests in a specific area. Identification of relevant nodes may only occur through either manual examination of social network posts, or through user surveys. Both processes are time-consuming and costly.