ABSTRACT

Modern Computer Vision (CV) methods heavily rely on neural networks. Models based on convolutional and transformer blocks achieve state-of the-art performance on challenging tasks, including classification, semantic and instance segmentation, object and visual relationship detection, monocular depth estimation, and image reconstruction [He et al., 2016, Dosovitskiy et al., 2020]. One reason behind the rapid growth of neural-based solutions is in the advance of computational resources over last few decades. It allows for training larger models which are capable of recognising more complex patterns [Brown et al., 2020]. Another essential underlying condition is in the accessibility of training data from which the patterns can be retrieved [Birhane and Prabhu, ]. Figure 5.1 shows the typical relation between accuracy and the amount of available training data.