ABSTRACT

Contents 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 8.2 System Architectures for Content Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

8.2.1 Web Services Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 8.2.2 VoD Ingest Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 8.2.3 Linear and Continuous Ingest Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 8.2.4 Role of Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

8.2.4.1 Electronic Program Guide Metadata . . . . . . . . . . . . . . . . . . . . 220 8.2.4.2 Representation of Automatically Extracted Metadata . . . 220 8.2.4.3 Media Encoding and Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 8.2.4.4 On-Demand Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

8.3 Media Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 8.3.1 Media Segmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

8.3.1.1 Shot Boundary Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 8.3.2 Audio Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

8.3.2.1 Speaker Segmentation and Clustering . . . . . . . . . . . . . . . . . . . 224 8.3.3 Closed Caption Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

8.3.3.1 Web Mining for Language Modeling . . . . . . . . . . . . . . . . . . . 226 8.3.4 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

8.3.4.1 Semantic Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 8.3.4.2 Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 8.3.4.3 Face Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 8.3.4.4 Duplicate Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

8.4 Clients and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 8.4.1 Retrieval Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 8.4.2 Content Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 8.4.3 Mobile and Multiscreen Video Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 236

8.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

8.1 Introduction Over the years the fidelity and quantity of TV content has steadily increased, but consumers are still experiencing considerable difficulties in finding the content matching their personal interests. New mobile and IP consumption environments have emergedwith the promise of ubiquitous delivery of desired content, but inmany cases, available content descriptions in the form of electronic program guides (EPGs) lack sufficient detail and cumbersome human interfaces yield a less-than-positive user experience. Creating metadata through a detailed manual annotation of TV content is costly and, in many cases, this metadata may be lost in the content life cycle as assets are repurposed for multiple distribution channels. Content organization can be daunting when considering domains from breaking news contributions,

local or government channels, live sports, music videos, documentaries up through dramatic series and feature films. As the line between TV content and Internet content continues to blur, more and more long tail content will appear on TV and the ability to automatically generate metadata for it becomes paramount. Research results from several disciplines must be brought together to address the complex challenge of cost effectively augmenting existing content descriptions to facilitate content personalization and adaptation for users given today’s range of content consumption contexts.