ABSTRACT

With the rise of artificial intelligence and deep learning, natural language processing and computer vision have become enthralling research fields. A text-to-image generation has become one of the most vital aspects of technology nowadays and has been gaining much attention. In this grooming era, various application areas developed in the computer vision domain are gaining attention and crucial importance. This research work aims to identify a variety of suspects including criminals, missing children, and fugitives. Text-to-image generates a realistic image that matches a given text description. The goal of facial image generation from text is to generate the face of the suspect from the description provided by the victims. Machine learning and deep learning techniques have been leveraged as they have the potential to extrapolate specific use cases from the general training data provided to the machine learning architecture. Traditionally, deep learning architectures such as convolutional neural networks (CNNs) are used. However, these architectures are suited for complex visual tasks like image segmentation and image recognition, but they proved inept at image generation. To overcome this issue, this research work has been proposed with generative adversarial networks (GANs) to generate a suspect’s face. The peculiarity of GAN lies in its ability to perform indirect training where a generator is trained on the outputs of the discriminator which is also simultaneously being trained. This unique ability of GANs has allowed researchers to produce vivid face images having myriad text descriptions. A generator creates images upon the training that look real while the discriminator learns to differentiate the real ones from the fake ones. The authors here propose to use Universal Sentence Encoder for encoding the text into vectors that can be further used for text classification. Deep convolutional generative adversarial networks (DCGANs) will be employed for the process of image generation based upon the classified text. This approach is employed upon the Text2FaceGAN dataset for the generation of facial images using textual descriptions.