Word Embeddings | 13 | Text Mining with Machine Learning

ABSTRACT

The main problem of the bag-of-words model is that it does not capture relations between words. In the model, each word or other feature of a text is represented by one dimension in a multidimensional space for representing the documents. Some of the modern representations of texts are based on the idea that similar words should have similar properties. This, however, cannot be achieved in the classical bag-of-words model because more than one dimension would be needed to represent each word or feature. In the modern approaches, each word is mapped to a continuous multidimensional space and is represented by a multidimensional word vector. We, therefore, talk about word embeddings. The chapter introduces some basic concepts related to word embeddings, algorithms used to compute them (neural language model, word2vec, GLoVe, fastText), as well as an example demonstrating the properties of word vectors calculated from a real dataset.