ABSTRACT

Conventional deep-learning-based methods for bridge structural health diagnosis require complicated network structures and verbose from-scratch training with hyperparameter tuning. Because the pre-trained vision-language big model perceives fundamental knowledge of large-scale image and linguistic datasets, it should be of great potential to perform structural health diagnosis with full use of image and text datasets. This study performs a feasibility study towards establishing a big model for structural health diagnosis based on vision-language cross-modal learning. Specifically, an overall pipeline is proposed using a pre-trained vision-language big model of OFA (one for all, established by DAMO Academy in 2022). A series of Transformer modules based on the self-attention mechanism are stacked to unify pre-training tasks and downstream tasks in pure vision modality, pure language modality, and vision-language cross-modality learning. The results preliminarily demonstrate the feasibility and effectiveness of the vision-language cross-modal learning paradigm using OFA big model for structural health diagnosis.