Journal "Computational Technologies"

Article information

2024 , Volume 29, № 6, p.125-146

Shigarov A.O.

Table recognition in untagged PDF documents using PDF-specific features

Nowadays, PDF is one of the most popular formats for distributing print-oriented documents in the electronic environment. PDF documents are often untagged, i.e. pages are represented only by low-level instructions for rendering text and graphics and are not accompanied by annotations of their structural components (headings, paragraphs, tables, etc.). Automatic recovering for such annotations can ensure the accessibility of structural components. The latter is possible as a result of solving a number of tasks, one of which is recognizing tables in untagged PDF documents: detecting the boundaries of their rows, columns, and cells. This paper proposes a method for recognizing tables in untagged PDF documents. Unlike existing analogues, it is originally proposed to solve the stated task based on the use of PDF-specific features such as text output order, pen movement positions, etc. This proposal allowed adapting some known approaches and methods to the declared task, initially oriented towards raster images and unformatted text, including “word clustering”, “rows first” detection, whitespace segmentation, and connected component analysis. The presented performance evaluation results demonstrate the effectiveness of solutions implementing this method. The presented results of the performance evaluation demonstrate the efficiency of the solutions implemented based on the proposed method. Quantitative comparison with analogues indicates their compliance with the current level of technology development in the area under consideration. At the same time, qualitative comparison reveals the following advantages over analogues. The implementation of the proposed table recognition method does not require preliminary parameter adjustment and supervised learning. However, if ready-to-use neural network models are available, they can replace rule-based table detection algorithms. At the same time, the quality of the final results can be improved by applying filtering of candidate cases.

[full text]
Keywords: table recognition, table extraction, unstructured data, document tables, document page layout analysis

doi: 10.25743/ICT.2024.29.6.008

Author(s):
Shigarov Alexei Olegovich
PhD.
Position: Leading research officer
Office: Institute for System Dynamics and Control Theory, Siberian Branch of RAS
Address: 664033, Russia, Irkutsk
Phone Office: (3952) 45-31-07
E-mail: shigarov@icc.ru
SPIN-code: 5159-9006

Bibliography link:
Shigarov A.O. Table recognition in untagged PDF documents using PDF-specific features // Computational technologies. 2024. V. 29. № 6. P. 125-146