A Multimodal Approach to Image Data Mining and Concept Discovery

ABSTRACT

This chapter gives an example on multimedia data mining by addressing the automatic image annotation problem and its application to multimodal image data mining and retrieval. Specifically, in this chapter, we propose a probabilistic semantic model in which the visual features and the textual words are connected via a hidden layer which constitutes the semantic concepts to be discovered to explicitly exploit the synergy between the two modalities; the association of visual features and the textual words is determined in a Bayesian framework such that the confidence of the association can be provided; and extensive evaluations on a large-scale, visually and semantically diverse image collection crawled from the Web are reported to evaluate the prototype system based on the model. In the proposed probabilistic model, a hidden concept layer which connects the visual features and the word layer is discovered by fitting a generative model to the training images and annotation words. An Expectation-Maximization (EM) based iterative learning procedure is developed to determine the conditional probabilities of the visual features and the textual words given a hidden concept class. Based on the discovered hidden concept layer and the corresponding conditional probabilities, the image annotation and the text-to-image retrieval are performed using the Bayesian framework. The evaluations of the prototype system on 17,000 images and 7,736 automatically extracted annotation words from the crawled Web pages for multimodal image data mining and retrieval have indicated that the model and the framework are superior to a state-of-the-art peer system in the literature.