ABSTRACT

Introduction ............................................................................................................ 123 Recognition of Biomedical Entities in Text ........................................................... 125

Short Methodological Overview ....................................................................... 125 Gene and Protein Name Recognition ................................................................ 125 Recognition of Information on Mutations......................................................... 126 Concept-Based Identifi cation of Functional Biological Entities ....................... 127 Recognition of Medical Terminology ............................................................... 128 Recognition of Chemical Entities in Text ......................................................... 128 Chemical Entity Recognition in Chemical Structure Depictions ...................... 130

Fundamentals of NLP ............................................................................................ 130 Identifi cation of Relationships that Link Biomedical Entities with Chemical Entities ............................................................................................ 131

State of Academic Research in Applying NLP Techniques to Link Biological and Chemical Entities .......................................................................................... 132 Unstructured Information Management Architecture (UIMA) .............................. 136 Navigation Tools for NLP Linking of Biological and Chemical Information ....... 137 Commercial Solutions for NLP-Based Linking of Biological and Chemical Information .......................................................................................................... 138 Summary and Conclusion ...................................................................................... 146 Notes and References ............................................................................................. 147

One of the great challenges in natural language processing (NLP) for life sciences is the identifi cation and the extraction of relationships between chemical entities and biomedical entities with the goal of establishing links between chemical and biological information. The ability to do so would allow for systematic screening of the literature for biological activities of chemical compounds and thus is one of

the core aims of text mining activities in both the academic world and the pharmaceutical industry. However, biology and chemistry are still quite distinct worlds that communicate their results in very different ways. Biologists and medical researchers, for example, tend to describe chemical compounds by using brand names. In the world of chemistry we prefer the far more informative, unambiguous International Chemical Identifi er (InChI) descriptor or other suitable nomenclature-based designators for naming chemical entities.