ABSTRACT

Transforming an image to a text or descriptive form has recently got a lot of research attractions. Creating a sentence with correct semantics and syntactic structure is still a matter of issue. Object recognition, relations between the objects, different meanings of the same word make this task more difficult. Therefore, inspection on attention mechanism has recently achieved great progress. In this paper at first we compared between the LSTM and BERT(large) models and fine-tuned these models to enhance their performances. Then, we have developed a new model for image captioning system which comprise the concatenation of BERT model with LSTM and dense model. We have found that while training the same parameters, our new model has shown comparatively less amount of training time among the others and shown the better result at the all common metrics (BLEU, METEOR and CIDEr) on MS-COCO dataset.