ABSTRACT

The amount of information on the World Wide Web has grown enormously since its creation in 1990. By February 2000, the web had over one billion uniquely indexed pages and 30 million audio, video and image links [1]. Since there is no central management on the web, duplication of content is inevitable. A study done in 1998 estimated that about 46% of all the text documents on the web have at least one “near-duplicate” - document which is identical except for low level details such as formatting [2]. The problem is likely to be more severe for web video clips as they are often stored in multiple locations, compressed with different algorithms and bitrates to facilitate downloading and streaming. Similar versions, in part or as a whole, of the same video can also be found on the web when some web users modify and combine original content with their own productions. Identifying these similar contents is beneficial to many web video applications:

1. As users typically do not view beyond the first result screen from a search engine, it is detrimental to have all “near-duplicate” entries cluttering the top retrievals. Rather, it is advantageous to group together similar entries before presenting the retrieval results to users.