ABSTRACT

This project presents an image caption generator that converts images into descriptive speech to assist visually impaired users. It integrates a VGG16-based CNN for feature extraction and an attention-equipped LSTM for caption generation. Beam and greedy search improve caption quality, while pyttsx3 enables real-time audio output. Trained over 40 epochs, the model achieved a training loss of 0.4941 and validation loss of 12.23, showing potential to enhance accessibility and environmental awareness.