Loading…
This event has ended. View the official site or create your own event → Check it out
This event has ended. Create your own
ACM/IEEE Joint Conference on Digital Libraries 2017
University of Toronto
JCDL 2017 | #JCDL@2017
View analytic
Wednesday, June 21 • 14:00 - 15:30
Paper Session 06: Text Extraction and Analysis

Sign up or log in to save this to your schedule and see who's attending!

Hannah Bast and Claudius Korzen. A Benchmark and Evaluation for Text Extraction from PDF (Full)
Extracting the body text from a PDF document is an important but surprisingly difficult task. The reason is that PDF is a layout-based format which specifies the fonts and positions of the individual characters rather than the semantic units of the text (e.g., words or paragraphs) and their role in the document (e.g., body text or footnote or caption). There is an abundance of extraction tools, but their quality and the range of their functionality are hard to determine.
In this paper, we show how to construct a high-quality benchmark of principally arbitrary size from parallel TeX and PDF data. We construct such a benchmark of 12,099 scientific articles from arXiv.org and make it publicly available. We establish a set of criteria for a clean and independent assessment of the semantic abilities of a given extraction tool. We provide an extensive evaluation of 13 state-of-the-art tools for text extraction from PDF on our benchmark according to our criteria. We include our own method, Icecite, which significantly outperforms all other tools, but is still not perfect. We outline the remaining steps necessary to finally make text extraction from PDF a solved problem.

Kresimir Duretec, Andreas Rauber and Christoph Becker. A text extraction software benchmark based on a synthesized dataset (Full)
Text extraction plays an important function for data processing workflows in digital libraries. For example, it is a crucial prerequisite for evaluating the quality of migrated textual documents. Complex file formats make the extraction process error-prone and have made it very challenging to verify the correctness of extraction components.
Based on digital preservation and information retrieval scenarios, three quality requirements in terms of effectiveness of text extraction tools are identified: 1) is a certain text snippet correctly extracted from a document, 2) does the extracted text appear in the right order relatively to other elements and, 3) is the structure of the text preserved. 
A number of text extraction tools is available fulfilling these three quality requirements to various degrees. However, systematic benchmarks to evaluate those tools are still missing, mainly due to the lack of proper datasets with accompanying ground truth. 
The contribution of this paper is two-fold. First we describe a dataset generation method based on model driven engineering principles and use it to synthesize a dataset and its ground truth directly from a model. Second, we define a benchmark for text extraction tools and complete an experiment to calculate performance measures for several tools that cover the three quality requirements. The results demonstrate the benefits of the approach in terms of scalability and effectiveness in generating ground truth for content and structure of text elements.

Tokinori Suzuki and Atsushi Fujii. Mathematical Document Categorization with Structure of Mathematical Expressions (Full)
A mathematical document is a document subjected to mathematical communication, for example, a math paper and discussion in online Q&A community. Mathematical document categorization (MDC) is a task to classify mathematical documents to mathematical categories, e.g. probability theory and set theory. This task is an important task for supporting user search on recent wide-spreaded digital libraries and archiving services. Although Mathematical expressions (ME) in the document could bring an essential information as being in a central part of communication especially in math fields, how to utilize ME for MDC has not been matured. In this paper, we propose the classification method based on text combined with structures of ME, which are supposed to reflect conventions and rules specific to a category. Also, we present document collections built for evaluating the MDC systems, with investigation on categorial settings and its statistics. We demonstrate classification results that our proposed method outperforms existing methods with state-of-the-art ME modeling on F-measure.


Wednesday June 21, 2017 14:00 - 15:30
Room 205, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6

Attendees (19)