ABSTRACT

This chapter focuses on the recent surge of interest in automating methods for describing audiovisual content ,whether for image search and retrieval, visual storytelling or in response to the rising demand for audio description following changes to regulatory frameworks. While computer vision communities have intensified research into the automatic generation of video descriptions (Bernardi et al., 2016), the automation of still image captioning remains a challenge in terms of accuracy (Husain and Bober, 2016). Moving images pose additional challenges linked to temporality, including co-referencing (Rohrbach et al., 2017) and other features of narrative continuity (Huang et al., 2016). Machine-generated descriptions are currently less sophisticated than their human equivalents, and frequently incoherent or incorrect. By contrast, human descriptions are more elaborate and reliable but are expensive to produce. Nevertheless, they offer information about visual and auditory elements in audiovisual content that can be exploited for research into machine training. Based on our research conducted in the EU-funded MeMAD project, this chapter outlines a methodological approach for a systematic comparison of human- and machine-generated video descriptions, drawing on corpus-based and discourse-based approaches, with a view to identifying key characteristics and patterns in both types of description, and exploiting human knowledge about video description for machine training.