ABSTRACT

Biomedical text mining is an important research area in both clinical research as well as computer science. A lot of hidden information exists in text data such as a doctor’s prescription. Extracting useful knowledge such as from the signs and symptoms of diseases, testing, and treatments is a real challenge. The current work is related to one of the challenges organized by the i2b2 (Informatics for Integrating Biology and the Bedside) National Center of Biomedical Computing. The challenge has focused on the task of extracting medical concepts as recognized entities. The identification of multiword entities is a critical issue for clinical texts. This chapter presents a method which identifies multiword entities and categorizes them in the appropriate type of concept. The i2b2 organizers provided 426 clinical notes, where 170 were used as the training set and the remaining 256 were used as a test set. The proposed system uses the concept of matrix-based multi-pattern matching. The system generates frequent patterns from the training data. A multi-pattern trained matrix is created using frequent patterns. Then test matrix is created from test data of the same size as the trained matrix. The test matrix alters dynamically at every sentence continuance and subsequently performs parallel pattern matching of the trained matrix with the test matrix. Matched patterns convert into their corresponding sequence of entities. These entities are mapped with a medical dictionary to obtain medical concepts and to prune non-medical concepts. The system has achieved accuracy close to 75%, which is on a par with the internationally known natural language processing tool MetaMap.