ABSTRACT
The modern age is the age of subtle permutation and combination of various dynamics. The reason and objective behind it, is always someway or the other related to the creation of the success story or the comfort of creating a peaceful work culture where there is a whole lot of ease to perform. The contemporary approach of expressing emotion through language to establish a successful communication is one of the most challenging parts. Everybody carries emotion and owes the power of language to communicate with others. Thus, making it a dynamic thing where there are numerous scope of creativity. The paper discusses a new method for image captioning that is proposed using this study combining the BLIP-2 model with DeepFace emotion detection and Google Gemini API for refinement. This framework creates captions that explain the text while touching upon emotional factors through BLEU, METEOR, and ROUGE metrics. The model based on the Flickr 8k dataset improves the quality of the captions. It is used in social networking, in digital marketing, and increases the accessibility to people with visual impairments. It advocates the use of emotional and visual cues into AI-enabled content creation.
