ABSTRACT

Transformer neural networks have performed so well in language processing tasks leading to the introduction of the vision transformer (ViT) network, an alternative technique to handle vision applications. The outstanding performance of the transformer has made it a competitor to convolutional neural network algorithms, which have already passed through several modifications for optimal performance in computer vision tasks. The ViT model and its variants can learn long-range dependencies and spatial correlations, among other benefits. Vision transformer implementation in the medical domain transcends through image classification, segmentation registration, detection and radiological report generation. The chapter describes the applications of transformers in image captioning and medical image analysis. The chapter also highlights various medical image modalities frequently used in most health facilities for effective disease diagnosis. Vision transformers with the self-attention mechanism proposed in the literature for different disease diagnoses and report generation have been analyzed and presented to introduce up-and-coming researchers and developers of computer-assisted applications for efficient healthcare delivery. Last but not least, we summarize the problems and suggest some potential future avenues for study in medical image processing and associated activities.