Article information
2024 , Volume 29, ¹ 6, p.125-146
Shigarov A.O.
Table recognition in untagged PDF documents using PDF-specific features
Nowadays, PDF is one of the most popular formats for distributing print-oriented documents in the electronic environment. PDF documents are often untagged, i.e. pages are represented only by low-level instructions for rendering text and graphics and are not accompanied by annotations of their structural components (headings, paragraphs, tables, etc.). Automatic recovering for such annotations can ensure the accessibility of structural components. The latter is possible as a result of solving a number of tasks, one of which is recognizing tables in untagged PDF documents: detecting the boundaries of their rows, columns, and cells. This paper proposes a method for recognizing tables in untagged PDF documents. Unlike existing analogues, it is originally proposed to solve the stated task based on the use of PDF-specific features such as text output order, pen movement positions, etc. This proposal allowed adapting some known approaches and methods to the declared task, initially oriented towards raster images and unformatted text, including “word clustering”, “rows first” detection, whitespace segmentation, and connected component analysis. The presented performance evaluation results demonstrate the effectiveness of solutions implementing this method. The presented results of the performance evaluation demonstrate the efficiency of the solutions implemented based on the proposed method. Quantitative comparison with analogues indicates their compliance with the current level of technology development in the area under consideration. At the same time, qualitative comparison reveals the following advantages over analogues. The implementation of the proposed table recognition method does not require preliminary parameter adjustment and supervised learning. However, if ready-to-use neural network models are available, they can replace rule-based table detection algorithms. At the same time, the quality of the final results can be improved by applying filtering of candidate cases.
Keywords: table recognition, table extraction, unstructured data, document tables, document page layout analysis
doi: 10.25743/ICT.2024.29.6.008
Author(s): Shigarov Alexei Olegovich PhD. Position: Senior Research Scientist Office: Institute for System Dynamics and Control Theory, Siberian Branch of RAS Address: 664033, Russia, Irkutsk
Phone Office: (3952) 45-31-02 E-mail: shigarov@icc.ru
Bibliography link: Shigarov A.O. Table recognition in untagged PDF documents using PDF-specific features // Computational technologies. 2024. V. 29. ¹ 6. P. 125-146
|