ABSTRACT

In the context where the user wants to retrieve an image corresponding to a sentence, deep learning frameworks have started to give very good results. More precisely contrastive learning can be used to learn good representations of the same object presented under different modalities (text, image, video, etc.). The common representation of the same object is called semantic embedding, and in the case of image and text modalities it becomes visual semantic embedding (VSE). In this chapter, we propose an approach which extends a VSE approach called VSE++ with the ability to handle multiple-modalities and to dispose of several positive items of the same modality for one object. We compare the two methods and show that despite a better expressivity MSE** gives nearly the same results as VSE++ on several datasets. We show that the loss function of MSE** is more accurate for some hard cases of our dataset. This work opens several perspectives: (1) use MSE** on other datasets having many examples of each class (e.g., a sentence that could be linked with several images), (2) use a VSE model to find new positive pairs and to eliminate false negatives of the dataset, and (3) associate images with logical formulas. This last perspective could allow for post-process reasoning. It could also improve the accuracy by enabling us to incorporate the specificity of formulas when comparing the similarities of the images associated with them.