ABSTRACT

This chapter discusses the natural language processing techniques that are useful for cybersecurity challenges. Syntax is concerned with form, including whether a given natural language expression is well formed according to the grammatical rules of the language. Basic text preprocessing techniques include noise removal, case normalization, tokenization, stemming and stop word elimination. Words in any language are composed from certain building blocks, called morphemes. Morphology is concerned with the study of morphemes. The goal of word sense disambiguation is to determine the sense of a word, given a word in context and a fixed inventory of senses. Topic modeling is a vibrant area of research with many supervised and unsupervised techniques available. The generation of natural language is an important area of research. There are many uses for this in cybersecurity, the primary use being either to generate training data such as “benign” and spam email, or “benign” user activity.