Automatic Part-of-Speech Tagging of Intangible Cultural Heritage based on Large Language Models

doi:10.4324/9781003666530-13

Chapter

Automatic Part-of-Speech Tagging of Intangible Cultural Heritage based on Large Language Models

ABSTRACT

Large language models (LLMs) have revolutionised natural language processing, enabling advancements in domain-specific applications. This study explores the use of LLMs for automatic lexical annotation in intangible cultural heritage (ICH), a field with specialised vocabulary and cultural context. The research follows three phases: (1) Collecting ICH data from China’s official Intangible Cultural Heritage website and creating a robust annotation dataset through human-machine collaboration. (2) Developing fine-tuning data with various prompting formats (0-shot, 1-shot, 3-shot) for ICH texts. (3) Evaluating Qwen and GLM models using LoRA and full-parameter fine-tuning to assess performance across training data volumes. Compared to LoRA, full parameter fine-tuning has higher requirements for prompt consistency and poorer robustness. These results provide insights into optimising LLMs for precise lexical annotation, with implications for ICH and other specialised fields. The study highlights the importance of fine-tuning strategies, data preparation, and instruction structuring in enhancing model performance.