ABSTRACT

Word embeddings are a way to represent text data as vectors of numbers based on a huge corpus of text, capturing semantic meaning from words' context. Mapping words (or other tokens) to an embedding in a special vector space is a powerful approach in natural language processing. We can determine these vectors for a corpus of text using word counts and matrix factorization to understand what word embeddings are at a fundamental level, or rely on more sophisticated algorithms or pre-trained word embeddings when appropriate. Perhaps more than any of the other preprocessing steps this book has covered so far, using word embeddings opens an analysis or model up to the possibility of being influenced by systemic unfairness and bias. Embeddings are trained or learned from a large corpus of text data, and whatever human prejudice or bias exists in the corpus becomes imprinted into the vector data of the embeddings.