ABSTRACT

This chapter discusses the problem of recognizing multiword units (MWUs) in medical domain texts written in the Croatian language. MWUs have been the focus of research of many authors since even before the Natural Language Processing era, which has only helped to spread interest in MWUs in multiple dimensions and directions. An overview of rule-based approaches to different levels of analysis of medical-related texts, ranging from simple regular expressions to commercial healthcare-domain-oriented tools like ClearForest, LEXIMER, and AeroText, among others, is given in Spasic et al. Health care is abundant in free-form medical texts, which are also almost impossible to obtain, even for research purposes. The creation of this lexicon is an ongoing project divided into several phases. In previous phases, the Croatian medical corpus was collected, and it is now continuously being made available through the Sketch Engine interface as the documents are tagged with the domain and subdomain markers.