A methodology to detect and extract tables from born-digital PDF documents using deep learning

doi:10.1201/9781351227544-52

Chapter

A methodology to detect and extract tables from born-digital PDF documents using deep learning

ABSTRACT

Table is an efficient and compact means to present data and statistics and has been widely used in different kinds of documents. Tables allow information from different contexts to be generalized and understood instantly by the reader. Hence table recognition from documents is of significant importance in the field of document recognition and analysis. PDF is widely known and used document format and it guaranties consistency of presentation between different platforms. But most PDF files contain a little or no structural information making detection and extraction of information a challenging task. The diversity of table layouts in use also makes table detection from PDF documents a formidable task. Tables in PDFs often contain very valuable data and hence extraction of data from such tables is also a significant task.

Deep learning is one of the latest breakthroughs in the machine learning field. Deep learning methods aims to learn features automatically at multiple levels and allow systems to learn complex functions mapping from the input to the output for the given data. This paper proposes a methodology to detect tables and extract data from them using neural networks. The scope of this methodology is limited to the case of born-digital PDF documents. Text information extracted from the PDF documents alongside visual features of documents is considered to detect the tables.