ABSTRACT

2D object detection, a robot vision technique that allows robots to identify and locate multiple objects in planar images, is broader in scope than image classification with localization, which ascribes a single label to an entire image. The features of the PASCAL Visual Object Classes (VOC) dataset are presented. It is a widely acclaimed dataset used by researchers for benchmarking object detection algorithms and comparing their relative performances. A traditional sliding window algorithm for object localization is described. It systematically moves a fixed-size window across an image and analyzes the content within each window to determine the presence of an object of interest. Compared to a brute-force exhaustive search, the branch-and-bound scheme-based efficient subwindow search is an optimization technique that efficiently searches through a large set of potential sub-images to find the optimal location of an object within an image, thereby significantly reducing computational cost. A deep learning model that identifies objects within an image is discussed. Referred to as the region-based convolutional neural network (R-CNN), it generates potential regions of interest in the image. Then it extracts features from those regions using a CNN. Its successively enhanced variations are fast R-CNN, Faster R-CNN, and Mask R-CNN. The unsupervised object discovery and its localization are explained, followed by object detection by self-supervised feature learning.