ACM/IEEE Joint Conference on Digital Libraries 2017
University of Toronto
JCDL 2017 | #JCDL@2017
Tuesday, June 20 • 14:00 - 15:30
Paper Session 03: Collection Access and Indexing

Martin Toepfer and Christin Seifert. Descriptor-invariant Fusion Architectures for Automatic Subject Indexing : Analysis and Empirical Results on Short Texts (Full) 

*VB Best Paper Nominee

Documents indexed with controlled vocabularies enable users of libraries to discover relevant documents, even across language barriers. Due to the rapid growth of scientific publications, digital libraries require automatic methods that index documents accurately, especially with regard to explicit or implicit concept drift, that is, with respect to new descriptor terms and new types of documents, respectively. This paper first analyzes architectures of related approaches on automatic indexing. We show that their design determines individual strengths and weaknesses and justify research on their fusion. In particular, systems benefit from statistical associative components as well as from lexical components applying dictionary matching, ranking, and binary classification. The analysis emphasizes the importance of descriptor-invariant learning, that is, learning based on features which can be transferred between different descriptors. Theoretic and experimental results on economic titles and author keywords underline the relevance of the fusion methodology in terms of overall accuracy and adaptability to dynamic domains. Experiments show that fusion strategies combining a binary relevance approach and a thesaurus-based system outperform all other strategies on the tested data set. Our findings can help researchers and practitioners in digital libraries to choose appropriate methods for automatic indexing.

Guillaume Chiron, Antoine Doucet, Mickaël Coustaty, Muriel Visani and Jean-Philippe Moreux. Impact of OCR errors on the use of digital libraries. Towards a better access to information. (Short) 
Digital collections are increasingly used for a variety of purposes. In Europe only, we can conservatively estimate that tens of thousands of users consult digital libraries daily. The usages are often motivated by qualitative and quantitative research. However, caution must be advised as most digitized documents are indexed through their OCRed version, which is far from perfect, especially for ancient documents. In this paper, we aim to estimate the impact of OCR errors on the use of a major online platform: The Gallica digital library from the French National Library. It accounts for more than 100M OCRed documents and receives 80M search queries every year. In this context, we introduce two main contributions. First, an original corpus of OCRed documents composed of 12M characters along with the corresponding gold standard is presented and provided, with an equal share of English- and French-written documents. Next, statistics on OCR errors have been computed thanks to a novel alignment method introduced in this paper. Making use of all the user queries submitted to the Gallica portal over 4 months, we take advantage of our error model to propose an indicator for predicting the relative risk that queried terms mismatch targeted resources due to OCR errors, underlining the critical extent to which OCR quality impacts on digital library access.

David M. Weigl, Kevin R. Page, Peter Organisciak and J. Stephen Downie. Information-Seeking in Large-Scale Digital Libraries: Strategies for Scholarly Workset Creation (Short) 
Large-scale digital libraries such as the HathiTrust contain massive quantities of content combined from heterogeneous collections, with consequential challenges in providing mechanisms for discovery, unified access, and analysis. The HathiTrust Research Center has proposed 'worksets' as a solution for users to conduct their research into the 15 million volumes of HathiTrust content; however existing models of users' information-seeking behaviour, which might otherwise inform workset development, were established before digital library resources existed at such a scale.

We examine whether these information-seeking models can sufficiently articulate the emergent user activities of scholarly investigation as perceived during the creation of worksets. We demonstrate that a combination of established models by Bates, Ellis, and Wilson can accommodate many aspects of information seeking in large-scale digital libraries at a broad, conceptual, level. We go on to identify the supplemental information-seeking strategies necessary to specifically describe several workset creation exemplars.

Finally, we propose complementary additions to the existing models: we classify strategies as instances of querying, browsing, and contribution. Similarly we introduce a notion of scope according to the interaction of a strategy with content, content-derived metadata, or contextual metadata. Considering the scope and modality of new and existing strategies within the composite model allows us to better express--and so aid our understanding of--information-seeking behaviour within large-scale digital libraries.

Peter Darch and Ashley Sands. Uncertainty About the Long-Term: Digital Libraries, Astronomy Data, and Open Source Software (Short) 
Digital library developers make critical design and implementation decisions in the face of uncertainties about the future. We present a qualitative case study of the Large Synoptic Survey Telescope (LSST), a major astronomy project that will collect large-scale datasets and make them accessible through a digital library. LSST developers make decisions now, while facing uncertainties about its period of operations (2022-2032). Uncertainties we identify include topics researchers will seek to address, tools and expertise, and the availability of other astronomy infrastructure to exploit LSST observations. LSST is developing, and already releasing, its data management software open source. We evaluate benefits and burdens of this approach as a strategy for addressing uncertainty. Benefits include: enabling software to adapt to researchers’ changing needs; embedding LSST standards and tools as community practices; and promoting interoperability with other infrastructure. Burdens include: open source community management; documentation requirements; and trade-offs between software speed and accessibility.

Jaimie Murdock, Jacob Jett, Timothy W. Cole, Yu Ma, J. Stephen Downie and Beth Plale. Towards Publishing Secure Capsule-based Analysis (Short) 
HathiTrust Digital Library (HTDL) is an example of next generation Big Data collection of digitized content. Consisting of over 15 million digitized volumes (books), HTDL is of immense utility to scholars of all disciplines, from chemistry to literary studies. Researchers engage through user browsing and, more recently, through computational analysis (information retrieval, computational linguistics, and text mining), from focused studies of a few dozen volumes to large-scale experiments on millions of volumes. Computational engagement with HTDL is confounded by the in-copyright status of the majority of the content, which requires that computational analysis on HTDL be carried out in a secure environment. This is provided by the HathiTrust Research Center (HTRC) Data Capsule service. A reseracher is given a Capsule through which they carry out research for weeks to months. As a Capsule inherently limits flow in and out to protect the in-copyright data, this environment has unique challenges in support of researchers who wish to publish results from their research with a Capsule. We discuss recent advancements in our activities on provenance, workflows, worksets, and non-consumptive exports to aid a researcher in publishing results from Big Data analysis.


Timothy W. Cole

Professor, Mathematics Librarian and CIRSS Coordinator for Library Applications, University of Illinois at Urbana-Champaign

Peter Darch

University of Illinois at Urbana-Champaign
avatar for J. Stephen Downie

J. Stephen Downie

Co-PI HathiTrust Research Center, University of Illinois at Urbana-Champaign
avatar for Jaimie Murdock

Jaimie Murdock

PhD Student, Indiana University
Jaimie Murdock is a joint PhD student in Cognitive Science and Informatics. He studies the construction of knowledge representations and the dynamics of expertise. While majoring in two scientific disciplines, most of Jaimie's research occurs in the digital humanities, where he u... Read More →
avatar for Peter Organisciak

Peter Organisciak

University of Illinois at Urbana-Champaign
avatar for Beth Plale

Beth Plale

Co-PI HathiTrust Research Center, Indiana University
Science Director, Pervasive Technology Institute | Director, Data To Insight Center | Professor, Informatics and Computing | Indiana University
avatar for Ashley Sands

Ashley Sands

Senior Library Program Officer, Institute of Museum and Library Services
Managing a portfolio of grants and funding opportunities encompassed under the National Digital Platform emphasis of the Office of Library Services. In this position, I have a birds-eye perspective of the development of knowledge infrastructures across and between academic resear... Read More →
avatar for Christin Seifert

Christin Seifert

University of Passau
avatar for Martin Toepfer

Martin Toepfer

ZBW – Leibniz Information Centre for Economics

Tuesday June 20, 2017 14:00 - 15:30
Innis Town Hall 2 Sussex Ave, Toronto, ON M5S 1J5

