Language-Guided Visual Recognition

doi:10.1201/9781351119023-5

ABSTRACT

Visual recognition uses machine learning methods to analyze images that can give you insights into your visual content. This is often associated with example images to learn about different visual categories like “cats” and “dogs.” However, people have the capability to learn through exposure to visual categories associated with linguistic descriptions. For instance, teaching visual concepts to children is often accompanied by descriptions in text or speech. In a machine learning context, these observations motivate exploring how this learning process could be computationally modeled to learn visual facts. We aim at recognizing a visual category (e.g., parakeet auklet/Gerbera flower) from language description without any images (also known as zero-shot learning); see Figure 5.1. Zero-shot learning in simple terms is a form of extending supervised learning to a setting of solving, for example, a classification problem when not enough labeled examples are available for all classes.