ABSTRACT

With the rise in the generation of text data, there has been an increasing number of companies leveraging natural language processing. This chapter discusses common sources of text and Python code for reading text from different formats it can be found in, including PDF, scanned images, webpage, CSV, JSON, Word documents, and more. A big challenge in leveraging natural language processing can be the lack of available data. Other than data that is generated and curated by a company itself, several public data sources and conditionally available data sources exist. This chapter shares numerous public text-based datasets and shares code samples for reading data from social media APIs, including YouTube Data API and Twitter API.

Once there is text data at hand, data storage becomes prime, especially if the scale does not permit storing data in memory or files on a computer. A database management system can be a solution. We discuss commonly used databases for storing text and share code to add, query, and perform text operations. The databases discussed include Elasticsearch, MongoDB, and Google BigQuery. Furthermore, data maintenance tips and tricks are discussed.