JCDL2017: Full Schedule

ACM/IEEE Joint Conference on Digital Libraries 2017
University of Toronto
JCDL 2017 | #JCDL@2017

11:00 EDT

Paper Session 01: Web Archives

Justin Brunelle, Michele Weigle and Michael Nelson. Archival Crawlers and JavaScript: Discover More Stuff but Crawl More Slowly (Full)
The web is today's primary publication medium, making web archiving an important activity for historical and analytical purposes. Web pages are correspondingly interactive, resulting in pages that are increasingly difficult to archive. JavaScript enables interactions that can potentially change the client-side state of a representation. We refer to representations that load embedded resources via JavaScript as deferred representations. It is difficult to discover and crawl all of the resources in deferred representations and the result of archiving deferred representations is archived web pages that are either incomplete or erroneously load embedded resources from the live web. We propose a method of discovering and archiving deferred representations and their descendants (representation states) that are only reachable through client-side events. Our approach identified an average of 38.5 descendants per seed URI crawled, 70.9% of which are reached through an onclick event. This approach also added 15.6 times more embedded resources than Heritrix to the crawl frontier, but at a crawl rate that was 38.9 times slower than simply using Heritrix. If our method was applied to the July 2015 Common Crawl dataset, a web-scale archival crawler will discover an additional 7.17 PB (5.12 times more) of information per year. This illustrates the significant increase in resources necessary for more thorough archival crawls.

Faryaneh Poursardar and Frank Shipman. What is Part of that Resource? User Expectations for Personal Archiving (Short)
Users wish to preserve Internet resources for later use. But what is part of and what is not part of an Internet resource remains an open question. In this paper we examine how specific relationships between web pages affect user perceptions of their being part of the same resource. This study presented participants with pairs of pages and asked about their expectation for having access to the second page after they save the first. The primary-page content in the study comes from multi-page stories, multi-image collections, product pages with reviews and ratings on separate pages, and short single page writings. Participants were asked to agree or disagree with three statements regarding their expectation for later access. Nearly 80% of participants agreed in the case of articles spread across multiple pages, images in the same collection, and additional details or assessments of product information. About 50% agreed for related content on pages linked to by the original page or related items while only about 30% thought advertisements or wish lists linked to were part of the resource. Differences in responses to the same page pairs for the three statements regarding later access indicate users’ recognize the difference between what would be valuable to them and current implementations of saving web content.

Weijia Xu, Maria Esteva, Deborah Beck and Yi-Hsuan Hsieh. A Portable Strategy for Preserving Web Applications and Data Functionality (Short)
Increasingly, the value of research data not only resides in its content, but on how it is made available to the users. To introduce complex research topics data is often presented interactively through a web application, the design of which is the result of years of work by researchers. Therefore, preserving the data and the application's functionalities becomes equally important. In the current academic IT environment it is often the case that these web applications are developed by multiple people with different expertise, deployed within shared technology infrastructures, and that they evolve technically and in content over short periods of time. This lifecycle model presents challenges to reproducibility and portability of the application across technology platforms over time. Preservation approaches such as virtualization and emulation may not be applicable to these cases due, among other issues, to the co-dependencies of the hosting infrastructure, to missing documentation about the original development, and to the evolving nature of these applications. To address these issues, we propose a functional preservation strategy to decouple web applications and their corresponding data from their hosting environment and re-launching data and web-code in a more portable environment without compromising the look and feel or the interactive features. Crucial to the strategy is identifying discrepancies between the application and the existing hosting environment including library dependencies and system’s configuration, and to adapt them within a simplified virtual environment to bring up the application’s functionality. Advantages over virtualization and emulation reside in not having to recreate one or all of the layers of the original hosting environment. We demonstrate this approach using as a case study the Speech Presentation in Homeric Epics database, a digital humanities project, and evaluated portability by deploying the application in two different hosting environments. We also assessed the strategy in relation to the ease by which a a non-savvy user can re-launch the application in a new host.

Sawood Alam, Mat Kelly, Michele Weigle and Michael Nelson. Client-side Reconstruction of Composite Mementos Using ServiceWorker (Short)
We use the ServiceWorker (SW) web API to intercept HTTP requests for embedded resources and reconstruct Composite Mementos without the need for conventional URL rewriting typically performed by web archives. URL rewriting is a problem for archival replay systems, especially for URLs constructed by JavaScript; frequently resulting in incorrect URI references. By intercepting requests on the client using SW, we are able to strategically reroute instead of rewrite. Our implementation moves rewriting to clients, saving servers' computing resources and allowing servers to return responses more quickly. Our experiments show that retrieving the original instead of rewritten pages from the archive reduces time overhead by 35.66% and data overhead by 19.68%. Our system prevents Composite Mementos from leaking the live web while being easy to distribute and maintain.

Moderators

Martin Klein

Scientist, Los Alamos National Laboratory

Speakers

Sawood Alam

Researcher, Old Dominion University

I am commonly tagged by Web, Digital Library, Web Archiving, Ruby on Rails, PHP, HTML, CSS, JavaScript, ExtJS, Go, Urdu, RTL, Docker, and Linux.

Deborah Beck

Justin F. Brunelle

Lead Researcher, The MITRE Corporation

Lead Researcher at The MITRE Corporation and Adjunct Assistant Professor at Old Dominion University. Research interests include: web science, digital preservation, cloud computing, emerging technologies

Maria Esteva

Research Scientist, University of Texas at Austin

Yi-Hsuan Hsieh

Mat Kelly

Michael Nelson

Professor, Old Dominion University

Faryaneh Poursardar

PhD Candidate, Texas A&M University

Web archive, HCI

Frank Shipman

Michele Weigle

Associate Professor, Old Dominion University

Weijia Xu

Tuesday June 20, 2017 11:00 - 12:30 EDT
Innis Town Hall 2 Sussex Ave, Toronto, ON M5S 1J5

Papers

14:00 EDT

Paper Session 02: Semantics and Linking

Pavlos Fafalios, Helge Holzmann, Vaibhav Kasturia and Wolfgang Nejdl. Building and Querying Semantic Layers for Web Archives (Full)

*VB Best Paper Award Nominee

Web archiving is the process of collecting portions of the Web to ensure the information is preserved for future exploitation. However, despite the increasing number of web archives worldwide, the absence of efficient and meaningful exploration methods still remains a major hurdle in the way of turning them into a usable and useful information source. In this paper, we elaborate on this problem and propose an RDF/S model and a distributed framework for building semantic profiles ("layers") that describe semantic information about the contents of web archives. A semantic layer allows describing metadata information about the archived documents, annotating them with useful semantic information (like entities, concepts and events), and publishing all this data on the Web as Linked Data. Such structured repositories offer advanced query and integration capabilities and make web archives directly exploitable by other systems and tools. To demonstrate their query capabilities, we build and query semantic layers for three different types of web archives. An experimental evaluation showed that a semantic layer can answer information needs that existing keyword-based systems are not able to sufficiently satisfy.

Abhik Jana, Sruthi Mooriyath, Animesh Mukherjee and Pawan Goyal. WikiM: Metapaths based Wikification of Scientific Abstracts (Full)
In order to disseminate the exponential extent of knowledge being produced in the form of scientific publications, it would be best to design mechanisms that connect it with already existing rich repository of concepts -- the Wikipedia. Not only does it make scientific reading simple and easy (by connecting the involved concepts used in the scientific articles to their Wikipedia explanations) but also improves the overall quality of the article. In this paper, we present a novel metapath based method, WikiM, to efficiently wikify scientific abstracts -- a topic that has been rarely investigated in the literature. One of the prime motivations for this work comes from the observation that, wikified abstracts of scientific documents help a reader to decide better, in comparison to the plain abstracts, whether (s)he would be interested to read the full article. We perform mention extraction mostly through traditional tf-idf measures coupled with a set of smart filters. The entity linking heavily leverages on the rich citation and author publication networks. Our observation is that various metapaths defined over these networks can significantly enhance the overall performance of the system. For mention extraction and entity linking, we outperform most of the competing state-of-the-art techniques by a large margin arriving at precision values of 72.42% and 73.8% respectively over a dataset from the ACL Anthology Network. In order to establish the robustness of our scheme, we wikify three other datasets and get precision values of 63.41%-94.03% and 67.67%-73.29% respectively for the mention extraction and the entity linking phase.

Jian Wu, Sagnik Ray Choudhury, Agnese Chiatti, Chen Liang and C. Lee Giles. HESDK: A Hybrid Approach to Extracting Scientific Domain Knowledge Entities (Short)
Automatic keyphrase extraction from scientific documents is a well-known problem. We investigate a variant of that problem: Scientific Domain Knowledge Entity (SDKE) extraction. Keyphrases are noun phrases that are important to the document. On the contrary, an SDKE is a span of text that refers to a concept and can be classified as a process, material, task, dataset etc. Supervised keyphrase extraction algorithms using non-sequential classifiers and global measures of informativeness (PMI, tf-idf) are good candidates for this task. Another approach is to use sequential labeling algorithms with local context from a sentence, as done in the named entity recognition tasks. We show that these methods can complement each other and a simple merging can improve the extraction accuracy by 5-7 percentiles. We further propose several heuristics to improve the extraction accuracy. Our preliminary experiments suggest that it is possible to improve the accuracy of the sequential learner itself by utilizing the predictions of the non-sequential model.

Xiao Yang, Dafang He, Wenyi Huang, Zihan Zhou, Alexander Ororbia, Daniel Kifer and C. Lee Giles. Smart Library: Identifying Books in a Library using Richly Supervised Deep Scene Text (Short)
Physical library collections are valuable and long standing resources for knowledge and learning. However, managing books in a large bookshelf and finding books on it often leads to tedious manual work, especially for large book collections where books might be missing or misplaced. Recently, deep neural-based models have achieved great success for scene text detection and recognition. Motivated by these recent successes, we aim to investigate their viability in facilitating book management, a task that introduces further challenges including large amounts of cluttered scene text, distortion, and varied lighting conditions. In this paper, we present a library inventory building and retrieval system based on scene text reading methods. We specifically design our scene text recognition model using rich supervision to accelerate training and achieve state-of-the-art performance on several benchmark datasets. Our proposed system has the potential to greatly reduce the amount of human labor required in managing book inventories as well as the space needed to store book information.

Moderators

Faryaneh Poursardar

PhD Candidate, Texas A&M University

Web archive, HCI

Speakers

Agnese Chiatti

Sagnik Ray Choudhury

Pavlos Fafalios

Postdoctoral researcher, L3S Research Center, Leibniz Universität Hannover

C. Lee Giles

Professor, Pennsylvania State University

Pawan Goyal

Dafang He

Helge Holzmann

Wenyi Huang

Abhik Jana

Jian Wu

Vaibhav Kasturia

Daniel Kifer

Chen Liang

SRUTHI MOORIYATH

Developer Associate, SAP Labs India Pvt Ltd

Animesh Mukherjee

Alexander Ororbia

Wolfgang Nejdl

Xiao Yang

Zihan Zhou

Tuesday June 20, 2017 14:00 - 15:30 EDT
Room 325, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6

Papers

14:00 EDT

Paper Session 03: Collection Access and Indexing

Martin Toepfer and Christin Seifert. Descriptor-invariant Fusion Architectures for Automatic Subject Indexing : Analysis and Empirical Results on Short Texts (Full)

*VB Best Paper Nominee

Documents indexed with controlled vocabularies enable users of libraries to discover relevant documents, even across language barriers. Due to the rapid growth of scientific publications, digital libraries require automatic methods that index documents accurately, especially with regard to explicit or implicit concept drift, that is, with respect to new descriptor terms and new types of documents, respectively. This paper first analyzes architectures of related approaches on automatic indexing. We show that their design determines individual strengths and weaknesses and justify research on their fusion. In particular, systems benefit from statistical associative components as well as from lexical components applying dictionary matching, ranking, and binary classification. The analysis emphasizes the importance of descriptor-invariant learning, that is, learning based on features which can be transferred between different descriptors. Theoretic and experimental results on economic titles and author keywords underline the relevance of the fusion methodology in terms of overall accuracy and adaptability to dynamic domains. Experiments show that fusion strategies combining a binary relevance approach and a thesaurus-based system outperform all other strategies on the tested data set. Our findings can help researchers and practitioners in digital libraries to choose appropriate methods for automatic indexing.

Guillaume Chiron, Antoine Doucet, Mickaël Coustaty, Muriel Visani and Jean-Philippe Moreux. Impact of OCR errors on the use of digital libraries. Towards a better access to information. (Short)
Digital collections are increasingly used for a variety of purposes. In Europe only, we can conservatively estimate that tens of thousands of users consult digital libraries daily. The usages are often motivated by qualitative and quantitative research. However, caution must be advised as most digitized documents are indexed through their OCRed version, which is far from perfect, especially for ancient documents. In this paper, we aim to estimate the impact of OCR errors on the use of a major online platform: The Gallica digital library from the French National Library. It accounts for more than 100M OCRed documents and receives 80M search queries every year. In this context, we introduce two main contributions. First, an original corpus of OCRed documents composed of 12M characters along with the corresponding gold standard is presented and provided, with an equal share of English- and French-written documents. Next, statistics on OCR errors have been computed thanks to a novel alignment method introduced in this paper. Making use of all the user queries submitted to the Gallica portal over 4 months, we take advantage of our error model to propose an indicator for predicting the relative risk that queried terms mismatch targeted resources due to OCR errors, underlining the critical extent to which OCR quality impacts on digital library access.

David M. Weigl, Kevin R. Page, Peter Organisciak and J. Stephen Downie. Information-Seeking in Large-Scale Digital Libraries: Strategies for Scholarly Workset Creation (Short)
Large-scale digital libraries such as the HathiTrust contain massive quantities of content combined from heterogeneous collections, with consequential challenges in providing mechanisms for discovery, unified access, and analysis. The HathiTrust Research Center has proposed 'worksets' as a solution for users to conduct their research into the 15 million volumes of HathiTrust content; however existing models of users' information-seeking behaviour, which might otherwise inform workset development, were established before digital library resources existed at such a scale.

We examine whether these information-seeking models can sufficiently articulate the emergent user activities of scholarly investigation as perceived during the creation of worksets. We demonstrate that a combination of established models by Bates, Ellis, and Wilson can accommodate many aspects of information seeking in large-scale digital libraries at a broad, conceptual, level. We go on to identify the supplemental information-seeking strategies necessary to specifically describe several workset creation exemplars.

Finally, we propose complementary additions to the existing models: we classify strategies as instances of querying, browsing, and contribution. Similarly we introduce a notion of scope according to the interaction of a strategy with content, content-derived metadata, or contextual metadata. Considering the scope and modality of new and existing strategies within the composite model allows us to better express--and so aid our understanding of--information-seeking behaviour within large-scale digital libraries.

Peter Darch and Ashley Sands. Uncertainty About the Long-Term: Digital Libraries, Astronomy Data, and Open Source Software (Short)
Digital library developers make critical design and implementation decisions in the face of uncertainties about the future. We present a qualitative case study of the Large Synoptic Survey Telescope (LSST), a major astronomy project that will collect large-scale datasets and make them accessible through a digital library. LSST developers make decisions now, while facing uncertainties about its period of operations (2022-2032). Uncertainties we identify include topics researchers will seek to address, tools and expertise, and the availability of other astronomy infrastructure to exploit LSST observations. LSST is developing, and already releasing, its data management software open source. We evaluate benefits and burdens of this approach as a strategy for addressing uncertainty. Benefits include: enabling software to adapt to researchers’ changing needs; embedding LSST standards and tools as community practices; and promoting interoperability with other infrastructure. Burdens include: open source community management; documentation requirements; and trade-offs between software speed and accessibility.

Jaimie Murdock, Jacob Jett, Timothy W. Cole, Yu Ma, J. Stephen Downie and Beth Plale. Towards Publishing Secure Capsule-based Analysis (Short)
HathiTrust Digital Library (HTDL) is an example of next generation Big Data collection of digitized content. Consisting of over 15 million digitized volumes (books), HTDL is of immense utility to scholars of all disciplines, from chemistry to literary studies. Researchers engage through user browsing and, more recently, through computational analysis (information retrieval, computational linguistics, and text mining), from focused studies of a few dozen volumes to large-scale experiments on millions of volumes. Computational engagement with HTDL is confounded by the in-copyright status of the majority of the content, which requires that computational analysis on HTDL be carried out in a secure environment. This is provided by the HathiTrust Research Center (HTRC) Data Capsule service. A reseracher is given a Capsule through which they carry out research for weeks to months. As a Capsule inherently limits flow in and out to protect the in-copyright data, this environment has unique challenges in support of researchers who wish to publish results from their research with a Capsule. We discuss recent advancements in our activities on provenance, workflows, worksets, and non-consumptive exports to aid a researcher in publishing results from Big Data analysis.

Moderators

Sally Jo Cunningham

Speakers

Guillaume CHIRON

Timothy W. Cole

University of Illinois at Urbana-Champaign

Mickaël Coustaty

Peter Darch

University of Illinois at Urbana-Champaign

Antoine Doucet

J. Stephen Downie

Associate Dean & Professor, School of Information Sciences, University of Illinois

Jacob Jett

Yu (Marie) Ma

HathiTrust Research Center

Jean-Philippe Moreux

Jaimie Murdock

Indiana University Bloomington

Jaimie Murdock is a joint PhD student in Cognitive Science and Informatics. He studies the construction of knowledge representations and the dynamics of expertise. While majoring in two scientific disciplines, most of Jaimie's research occurs in the digital humanities, where he uses... Read More →

Peter Organisciak

University of Illinois at Urbana-Champaign

Kevin Page

University of Oxford

Beth Plale

Executive Director, Pervasive Technology Institute, Pervasive Technology Institute, Indiana University

Beth is the Michael A and Laurie Burns McRobbie Bicentennial Professor of Computer Engineering in the Luddy School of Informatics, Computing, and Engineering at Indiana University Bloomington (IUB).Pervasive Technology Institute at Indiana University is an Indiana University institute advancing innovation in computational science and cyberinfrastructure in areas of artificial intelligence, large scale data, workforce development, and computational infrastructure and software services. Indiana University... Read More →

Ashley Sands

Senior Library Program Officer, Institute of Museum and Library Services

Managing a portfolio of grants and funding opportunities encompassed under the National Digital Platform emphasis of the Office of Library Services. In this position, I have a birds-eye perspective of the development of knowledge infrastructures across and between academic research... Read More →

Christin Seifert

University of Passau

Martin Toepfer

ZBW – Leibniz Information Centre for Economics

Muriel Visani

David M. Weigl

Tuesday June 20, 2017 14:00 - 15:30 EDT
Innis Town Hall 2 Sussex Ave, Toronto, ON M5S 1J5

Papers

11:00 EDT

Paper Session 04: Citation Analysis

Saeed-Ul Hassan, Anam Akram and Peter Haddawy. Identifying Important Citations using Contextual Information from Full Text (Full)

*VB Best Paper Award Nominee

In this paper we address the problem of classifying cited work into important and non-important to the developments presented in a research publication. This task is vital for the algorithmic techniques that detect and follow emerging research topics and to qualitatively measure the impact of publications in increasingly growing scholarly big data. We consider cited work as important to a publication if that work is used or extended in some way. If a reference is cited as background work or for the purpose of comparing results, the cited work is considered to be non-important. By employing five classification techniques (Support Vector Machine, Naïve Bayes, Decision Tree, K-Nearest Neighbors and Random Forest) on an annotated dataset of 465 citations, we explore the effectiveness of eight previously published features and six novel features (including context based, cue words based and textual based). Within this set, our new features are among the best performing. Using the Random Forest classifier we achieve an overall classification accuracy of 0.91 AUC.

Luca Weihs and Oren Etzioni. Learning to Predict Citation-Based Impact Measures (Full)
Citations implicitly encode a community's judgment of a paper's importance and thus provide a unique signal by which to sLucaWeihstudy scientific impact. Efforts in understanding and refining this signal are reflected in the probabilistic modeling of citation networks and the proliferation of citation-based impact measures such as Hirsch's h-index. While these efforts focus on understanding the past and present, they leave open the question of whether scientific impact can be predicted into the future. Recent work addressing this deficiency has employed linear and simple probabilistic models; we show that these results can be handily outperformed by leveraging non-linear techniques. In particular, we find that these AI methods can predict measures of scientific impact for papers and authors, namely citation rates and h-indices, with surprising accuracy, even 10 years into the future. Moreover, we demonstrate how existing probabilistic models for paper citations can be extended to better incorporate refined prior knowledge. Of course, predictions of ``scientific impact" should be approached with healthy skepticism, but our results improve upon prior efforts and form a baseline against which future progress can be easily judged.

Mayank Singh, Ajay Jaiswal, Priya Shree, Arindam Pal, Animesh Mukherjee and Pawan Goyal. Understanding the Impact of Early Citers on Long-Term Scientific Impact (Full)
Our current knowledge of scholarly plagiarism is largely based on the similarity between full text research articles. In this paper, we propose an innovative and novel conceptualization of scholarly plagiarism in the form of reuse of explicit citation sentences in scientific research articles. Note that while full-text plagiarism is an indicator of a gross-level behavior, copying of citation sentences is a more nuanced micro-scale phenomenon observed even for well-known researchers. The current work poses several interesting questions and attempts to answer them by empirically investigating a large bibliographic text dataset from computer science containing millions of lines of citation sentences. In particular, we report evidences of massive copying behavior. We also present several striking real examples throughout the paper to showcase widespread adoption of this undesirable practice. In contrast to the popular perception, we find that copying tendency increases as an author matures. The copying behavior is reported to exist in all fields of computer science; however, the theoretical fields indicate more copying than the applied fields.

Moderators

Jian Wu

Speakers

Abhishek Niranjan

Anam Akram

Nikhil Angad Bakshi

Divyansh Gupta

Oren Etzioni

Pawan Goyal

Peter Haddawy

Saeed-Ul Hassan

Assistant Professor, Information Technology University

Animesh Mukherjee

Mayank Singh

Indian Institute of Technology Kharagpur

Luca Weihs

University of Washington

Wednesday June 21, 2017 11:00 - 12:30 EDT
Innis Town Hall 2 Sussex Ave, Toronto, ON M5S 1J5

Papers

11:00 EDT

Paper Session 05: Exploring and Analyzing Collections

Felix Hamborg, Norman Meuschke and Bela Gipp. Matrix-based News Aggregation: Exploring Different News Perspectives (Full)

*Best Student Paper Award Nominee

News aggregators are able to cope with the large amount of news that is published nowadays. However, they focus on the presentation of important, common information, but do not reveal the different perspectives within one topic. Thus, they suffer from media bias, a phenomenon that describes differences in news, such as in their content or tone. Finding these differences is crucial to reduce the effects of media bias. This paper presents matrix-based news analysis (MNA), a novel design for news exploration. MNA helps users gain a broad and diverse news understanding by presenting them various news perspectives on the same topic. Furthermore, we present NewsBird, a news aggregator that implements MNA to find different perspectives on international news topics. The results of a case study demonstrate that NewsBird broadens the user’s news understanding while it also provides similar news aggregation functionalities as established systems.

Nicholas Cole, Alfie Abdul-Rahman and Grace Mallon. Quill: A Framework for Constructing Negotiated Texts - with a Case Study on the US Constitutional Convention of 1787 (Full)

*VB Best paper nominee

This paper describes a new approach to the presentation of records relating to formal negotiations and the texts that they create. It describes the architecture of a model, platform, and web-interface that can be used by domain-experts to convert the records typical of formal negotiations in to a model of decision-making (with minimal training). This model has implications for both research and teaching, by allowing for better qualitative and quantitative analysis of negotiations. The platform emphasizes the reconstruction as closely as possible of the context within which proposals and decisions are made. A generic platform, its usability and benefits are illustrated by a presentation of the records relating to the 1787 Constitutional Convention that wrote the Constitution of the United States.

Kevin Page, Sean Bechhofer, Georgy Fazekas, David Weigl and Thomas Wilmering. Realising a Layered Digital Library: Exploration and Analysis of the Live Music Archive through Linked Data (Full)
Building upon a collection with functionality for discovery and analysis has been described by Lynch as a 'layered' approach to digital libraries. Meanwhile, as digital corpora have grown in size, their analysis is necessarily supplemented by automated application of computational methods, which can create layers of information as intricate and complex as those within the content itself. This combination of layers -- aggregating homogeneous collections, specialised analyses, and new observations -- requires a flexible approach to systems implementation which enables pathways through the layers via common points of understanding, while simultaneously accommodating the emergence of previously unforeseen layers.

In this paper we follow a Linked Data approach to build a layered digital library based on content from the Internet Archive Live Music Archive. Starting from the recorded audio and basic information in the Archive, we first deploy a layer of catalogue metadata which allows an initial -- if imperfect -- consolidation of performer, song, and venue information. A processing layer extracts audio features from the original recordings, workflow provenance, and summary feature metadata. A further analysis layer provides tools for the user to combine audio and feature data, discovered and reconciled using interlinked catalogue and feature metadata from layers below.

Finally, we demonstrate the feasibility of the system through an investigation of 'key typicality' across performances. This highlights the need to incorporate robustness to inevitable 'imperfections' when undertaking scholarship within the digital library, be that from mislabelling, poor quality audio, or intrinsic limitations of computational methods. We do so not with the assumption that a 'perfect' version can be reached; but that a key benefit of a layered approach is to allow accurate representations of information to be discovered, combined, and investigated for informed interpretation.

Moderators

Adam Jatowt

Speakers

Alfie Abdul-Rahman

Sean Bechhofe

Bela Gipp

Nicholas Cole

University of Oxford

Gyorgy Fazekas

Queen Mary University

Felix Hamborg

Grace Mallon

Norman Meuschke

PhD Student, University of Konstanz, Germany

Research interests: Information Retrieval for text, images, and mathematical content Plagiarism Detection News Analysis Citation and Link Analysis Blockchain Technology Information Visualization

Kevin Page

University of Oxford

David M. Weigl

Thomas Wilmering

Wednesday June 21, 2017 11:00 - 12:30 EDT
Room 205, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6

Papers

14:00 EDT

Paper Session 06: Text Extraction and Analysis

Hannah Bast and Claudius Korzen. A Benchmark and Evaluation for Text Extraction from PDF (Full)
Extracting the body text from a PDF document is an important but surprisingly difficult task. The reason is that PDF is a layout-based format which specifies the fonts and positions of the individual characters rather than the semantic units of the text (e.g., words or paragraphs) and their role in the document (e.g., body text or footnote or caption). There is an abundance of extraction tools, but their quality and the range of their functionality are hard to determine.
In this paper, we show how to construct a high-quality benchmark of principally arbitrary size from parallel TeX and PDF data. We construct such a benchmark of 12,099 scientific articles from arXiv.org and make it publicly available. We establish a set of criteria for a clean and independent assessment of the semantic abilities of a given extraction tool. We provide an extensive evaluation of 13 state-of-the-art tools for text extraction from PDF on our benchmark according to our criteria. We include our own method, Icecite, which significantly outperforms all other tools, but is still not perfect. We outline the remaining steps necessary to finally make text extraction from PDF a solved problem.

Kresimir Duretec, Andreas Rauber and Christoph Becker. A text extraction software benchmark based on a synthesized dataset (Full)
Text extraction plays an important function for data processing workflows in digital libraries. For example, it is a crucial prerequisite for evaluating the quality of migrated textual documents. Complex file formats make the extraction process error-prone and have made it very challenging to verify the correctness of extraction components.
Based on digital preservation and information retrieval scenarios, three quality requirements in terms of effectiveness of text extraction tools are identified: 1) is a certain text snippet correctly extracted from a document, 2) does the extracted text appear in the right order relatively to other elements and, 3) is the structure of the text preserved.
A number of text extraction tools is available fulfilling these three quality requirements to various degrees. However, systematic benchmarks to evaluate those tools are still missing, mainly due to the lack of proper datasets with accompanying ground truth.
The contribution of this paper is two-fold. First we describe a dataset generation method based on model driven engineering principles and use it to synthesize a dataset and its ground truth directly from a model. Second, we define a benchmark for text extraction tools and complete an experiment to calculate performance measures for several tools that cover the three quality requirements. The results demonstrate the benefits of the approach in terms of scalability and effectiveness in generating ground truth for content and structure of text elements.

Tokinori Suzuki and Atsushi Fujii. Mathematical Document Categorization with Structure of Mathematical Expressions (Full)
A mathematical document is a document subjected to mathematical communication, for example, a math paper and discussion in online Q&A community. Mathematical document categorization (MDC) is a task to classify mathematical documents to mathematical categories, e.g. probability theory and set theory. This task is an important task for supporting user search on recent wide-spreaded digital libraries and archiving services. Although Mathematical expressions (ME) in the document could bring an essential information as being in a central part of communication especially in math fields, how to utilize ME for MDC has not been matured. In this paper, we propose the classification method based on text combined with structures of ME, which are supposed to reflect conventions and rules specific to a category. Also, we present document collections built for evaluating the MDC systems, with investigation on categorial settings and its statistics. We demonstrate classification results that our proposed method outperforms existing methods with state-of-the-art ME modeling on F-measure.

Moderators

Lillian Cassel

Professor, Villanova University

Speakers

Hannah Bast

Christoph Becker

University of Toronto

Kresimir Duretec

Atsushi Fujii

Claudius Korzen

Andreas Rauber

Tokinori Suzuki

Wednesday June 21, 2017 14:00 - 15:30 EDT
Room 205, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6

Papers

14:00 EDT

Paper Session 07: Collection Building

Federico Nanni, Simone Paolo Ponzetto and Laura Dietz. Building Entity-Centric Event Collections (Full)

*Best Student Paper Award Nominee

Web archives preserve an unprecedented abundance of primary sources for the diachronic tracking, examination and -- ultimately -- understanding of major events and transformations in our society. A topic of interest is, for example, the rise of Euroscepticism as a consequence of the recent economic crisis.
We present an approach for building event-centric sub-collections from large archives, which includes not only the core documents related to the event itself but, even more importantly, documents which describe related aspects (e.g., premises and consequences). This is achieved by 1) identifying relevant concepts and entities from a knowledge base, and 2) detecting mentions of these entities in documents, which is interpreted as an indicator for relevance. We extensively evaluate our system on two diachronic corpora, the New York Times Corpus and the US Congressional Record, and we test its performance on the TREC KBA Stream corpus, a large and publicly available web archive.

Jan R. Benetka, Krisztian Balog and Kjetil Nørvåg. Towards Building a Knowledge Base of Monetary Transactions from a News Collection (Full)
We address the problem of extracting structured representations of economic events from a large corpus of news articles, using a combination of natural language processing and machine learning techniques. The developed techniques allow for semi-automatic population of a financial knowledge base, which, in turn, may be used to support a range of data mining and exploration tasks. The key challenge we face in this domain is that the same event is often reported multiple times, with varying correctness of details. We address this challenge by first collecting all information pertinent to a given event from the entire corpus, then considering all possible representations of the event, and finally, using a supervised learning method, to rank these representations by the associated confidence scores. A main innovative element of our approach is that it jointly extracts and stores all attributes of the event as a single representation (quintuple). Using a purpose-built test set we demonstrate that our supervised learning approach can achieve 25\% improvement in F1-score over baseline methods that consider the earliest, the latest or the most frequent reporting of the event.

Alexander Nwala, Michael Nelson, Michele Weigle, Adam Ziegler and Anastasia Aizman. Local Memory Project: providing tools to build collections of stories for local events from local sources (Full)
The national (non-local) news media has different priorities than the local news media. If one seeks to build a collection of stories about local events, the national news media may be insufficient, with the exception of local news which "bubbles" up to the national news media. If we rely exclusively on national media, or build collections exclusively on their reports, we could be late to the important milestones which precipitate major local events, thus, run the risk of losing important stories due to link rot and content drift. Consequently, it is important to consult local sources affected by local events. Our goal is to provide a suite of tools (beginning with two) under the umbrella of the Local Memory Project (LMP) to help users and small communities discover, collect, build, archive, and share collections of stories for important local events by leveraging local news sources. The first service (Geo) returns a list of local news sources (Newspaper, TV and Radio stations) in order of proximity to a user-supplied zip code. The second service (Local Stories Collection Generator) discovers, collects and archives a collection of news stories about a story or event represented by a user-supplied query and zip code pair. We evaluated 20 pairs of collections - Local (generated by our system) and non-Local by measuring archival coverage, tweet index rate, temporal range, precision, and sub-collections overlap. Our experimental results showed Local and non-Local collections with archive rates of 0.63 and 0.83, respectively, and a tweet index rates of 0.59 and 0.80, respectively. Local collections produced older stories than non-Local collections, but had a lower precision (relevance) of 0.77 compared to a non-Local precision of 0.91. These results indicate that Local collections are less exposed, thus less popular than their non-Local counterpart.

Moderators

Justin F. Brunelle

Lead Researcher, The MITRE Corporation

Speakers

Anastasia Aizman

Krisztian Balog

Jan R. Benetka

Laura Dietz

Federico Nanni

Researcher, University of Mannheim

Michael Nelson

Professor, Old Dominion University

Alexander C. Nwala

PhD Student/Research Assistant, Old Dominion University

Kjetil Nørvåg

Simone Paolo Ponzetto

Michele Weigle

Associate Professor, Old Dominion University

Adam Ziegler

Managing Director, Harvard Library Innovation Lab, Harvard University

Adam Ziegler is an attorney and member of the Library Innovation Lab at Harvard Law School where he leads technology projects like Free the Law, Perma.cc and H2O. Before taking that role, he co-founded a legal tech startup named Mootus and represented clients for over a decade at... Read More →

Wednesday June 21, 2017 14:00 - 15:30 EDT
Innis Town Hall 2 Sussex Ave, Toronto, ON M5S 1J5

Papers

16:00 EDT

Paper Session 08: Classification and Clustering

Eduardo Castro, Saurabh Chakravarty, Eric Williamson, Denilson Pereira and Edward Fox. Classifying Short Unstructured Data using the Apache Spark Platform (Full)
People worldwide use Twitter to post updates about the events that concern them directly or indirectly. Study of these posts can help identify global events and trends of importance. Similarly, E-commerce applications organize their products in a way that can facilitate their management and satisfy the needs and expectations of their customers. However, classifying data such as tweets or product descriptions is still a challenge. These data are described by short texts, containing in their vocabulary abbreviations of sentences, emojis, hashtags, implicit codes, and other non-standard usage of written language. Consequently, traditional text classification techniques are not effective on these data. In this paper, we describe our use of the Spark platform to implement two classification strategies to process large data collections, where each datum is a short textual description. One of our solutions uses an associative classifier, while the other is based on a multiclass Logistic Regression classifier using Word2Vec as a feature selection and transformation technique. Our associative classifier captures the relationships among words that uniquely identify each class, and Word2Vec captures the semantic and syntactic context of the words. In our experimental evaluation, we compared our solutions, as well as Spark MLlib classifiers. We assessed effectiveness, efficiency, and memory requirements. The results indicate that our solutions are able to effectively classify the millions of data instances composed of thousands of distinct features and classes, found in our digital libraries.

Abel Elekes, Martin Schäler and Klemens Böhm. On the Various Semantics of Similarity in Word Embedding Models (Full)

*Best Student Paper Award Nominee

Finding similar words with the help of word embedding models has yielded meaningful results in many cases. However, the notion of similarity has remained ambiguous. In this paper, we examine when exactly similarity values in word embedding models are meaningful. To do so, we analyze the statistical distribution of similarity values systematically, in two series of experiments. The first one examines how the distribution of similarity values depends on the different embedding model algorithms and parameters. The second one starts by showing that intuitive similarity thresholds do not exist. We then propose a method stating which similarity values actually are meaningful for a given embedding model. In more abstract terms, our insights should give way to a better understanding of the notion of similarity in embedding models and to more reliable evaluations of such models.

Mirco Kocher and Jacques Savoy. Author Clustering Using Spatium (Short)
This paper presents the author clustering problem and compares it to related authorship attribution questions. The proposed model is based on a distance measure called Spatium derived from the Canberra measure (weighted version of L1 norm). The selected features consist of the 200 most frequent words and punctuation symbols. An evaluation methodology is presented and the test collections are extracted from the PAN CLEF 2016 evaluation campaign. In addition to those, we also consider two additional corpora reflecting the literature domain more closely. Based on four different languages, the evaluation measures demonstrate a high precision and high F1 values for all 20 test collections. A more detailed analysis provides reasons explaining some of the failures of the Spatium model.

Shaobin Xu and David Smith. Retrieving and Combining Repeated Passages to Improve OCR (Short)
We present a novel approach to improve the output of optical character recognition (OCR) systems by first detect ing duplicate passages in their output and then performing consensus decoding combined with a language model. This approach is orthogonal to, and may be combined with, previously proposed methods for combining the output of different OCR systems on the same text or the output of the same OCR system on differently processed images of the same text. It may also be combined with methods to estimate the parameters of a noisy channel model of OCR errors. Additionally, the current method generalizes previous proposals for simple majority-vote combination of known duplicated texts. On a corpus of historical newspapers, an annotated set of clusters has a baseline word error rate (WER) of 33%. A majority vote procedure reaches 23% on passages where one or more duplicates were found, and consensus decoding combined with a language model achieves 18% WER. In a separate experiment, newspapers were aligned to very widely reprinted texts such as State of the Union speeches, producing clusters with up to 58 witnesses. Beyond 20 witnesses, simple majority vote outperforms language model rescoring, though the gap between them is much less in this experiment.

Moderators

Peter Organisciak

University of Illinois at Urbana-Champaign

Speakers

Klemens Böhm

Eduardo Castro

Saurabh Chakravarty

Ábel Elekes

Edward Fox

Professor, Virginia Polytechnic Institute and State University

digital libraries, family, students, Reiki

Mirco Kocher

PhD student, University of Neuchâtel

Denilson Pereira

Jacques Savoy

Martin Schäler

David Smith

Eric Williamson

Shaobin Xu

Wednesday June 21, 2017 16:00 - 17:30 EDT
Room 205, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6

Papers

16:00 EDT

Paper Session 09: Content Provenance and Reuse

David Bamman, Michelle Carney, Jon Gillick, Cody Hennesy and Vijitha Sridhar. Estimating the date of first publication in a large-scale digital library (Full)
One prerequisite for cultural analysis in large-scale digital libraries is an accurate estimate of the date of composition--as distinct from publication--of the texts they contain. In this work, we present a manually annotated dataset of first dates of publication of three samples of books from the HathiTrust Digital Library (uniform random, uniform fiction, and stratified by decade), and empirically evaluate the disparity between these gold standard labels and several approximations used in practice (using the date of publication as provided in metadata, several deduplication methods, and automatically predicting the date of composition from the text of the book). We find that a simple heuristic of metadata-based deduplication works best in pratice, and text-based composition dating is accurate enough to inform the analysis of "apparent time."

George Buchanan and Dana Mckay. The Lowest form of Flattery: Characterising Text Re-use and Plagiarism Patterns in a Digital Library Corpus (Full)
The re-use of text—particularly misuse, or plagiarism—is a contentious issue for researchers, universities, libraries and publishers. Technological approaches to identifying student plagiarism, such as TurnItIn are now widespread. Academic publishing, however, does not typically come under such scrutiny. While it is common knowledge that plagiarism occurs, we do not know how frequently or how extensively, nor where in a document it is likely to be found. This paper offers the first assessment of text re-use within the field of digital libraries. It also characterises text re-use generally (and plagiarism specifically) according to its location in the document, author seniority, publication venue and open access. As a secondary contribution, we suggest routes towards more rigorous plagiarism detection and management in the future.

Bela Gipp, Corinna Breitinger, Norman Meuschke and Joeran Beel. CryptSubmit: Introducing Securely Timestamped Manuscript Submission and Peer Review Feedback using the Blockchain (Short)
Manuscript submission systems are a central fixture in scholarly publishing. However, researchers who submit their unpublished work to a conference or journal must trust that the system and its provider will not accidentally or willfully leak unpublished findings. Additionally, researchers must trust that the program committee and the anonymous peer reviewers will not plagiarize unpublished ideas or results. To address these weaknesses, we propose a method that automatically creates a publicly verifiable, tamper-proof timestamp for manuscripts utilizing the decentralized Bitcoin blockchain. The presented method hashes each submitted manuscript and uses the API of the timestamping service OriginStamp to persistently embed this manuscript hash on Bitcoin’s blockchain. Researchers can use this tamper-proof trusted timestamp to prove that their manuscript existed in its specific form at the time of submission to a conference or journal. This verifiability allows researchers to stake a claim to their research findings and intellectual property, even in the face of vulnerable submission platforms or dishonest peer reviewers. Optionally, the system also associates trusted timestamps with the feedback and ideas shared by peer reviewers to increase the traceability of ideas. The proposed concept, which we introduce as CryptSubmit, is currently being integrated into the open-source conference management system OJS. In the future, the method could be integrated at nearly no overhead cost into other manuscript submission systems, such as EasyChair, ConfTool, or Ambra. The introduced method can also improve electronic pre-print services and storage systems for research data.

Mayank Singh, Abhishek Niranjan, Divyansh Gupta, Nikhil Angad Bakshi, Animesh Mukherjee and Pawan Goyal. Citation sentence reuse behavior of scientists: A case study on massive bibliographic text dataset of computer science (Short)
Our current knowledge of scholarly plagiarism is largely based on the similarity between full text research articles. In this paper, we propose an innovative and novel conceptualization of scholarly plagiarism in the form of reuse of explicit citation sentences in scientific research articles. Note that while full-text plagiarism is an indicator of a gross-level behavior, copying of citation sentences is a more nuanced micro-scale phenomenon observed even for well-known researchers. The current work poses several interesting questions and attempts to answer them by empirically investigating a large bibliographic text dataset from computer science containing millions of lines of citation sentences. In particular, we report evidences of massive copying behavior. We also present several striking real examples throughout the paper to showcase widespread adoption of this undesirable practice. In contrast to the popular perception, we find that copying tendency increases as an author matures. The copying behavior is reported to exist in all fields of computer science; however, the theoretical fields indicate more copying than the applied fields.

Moderators

Edie Rasmussen

University of British Columbia

Speakers

Abhishek Niranjan

Nikhil Angad Bakshi

David Bamman

UC Berkeley

Joeran Beel

Bela Gipp

Corinna Breitinger

George Buchanan

University of Melbourne

Michelle Carney

Divyansh Gupta

Jon Gillick

Pawan Goyal

Cody Hennesy

E-Learning Librarian, UC Berkeley Library

Dana McKay

University of Melbourne

Norman Meuschke

PhD Student, University of Konstanz, Germany

Research interests: Information Retrieval for text, images, and mathematical content Plagiarism Detection News Analysis Citation and Link Analysis Blockchain Technology Information Visualization

Animesh Mukherjee

Mayank Singh

Indian Institute of Technology Kharagpur

Vijitha Sridhar

Wednesday June 21, 2017 16:00 - 17:30 EDT
Innis Town Hall 2 Sussex Ave, Toronto, ON M5S 1J5

Papers

09:00 EDT

Paper Session 10: Scientific Collections and Libraries

Abdussalam Alawini, Leshang Chen, Susan Davidson and Gianmaria Silvello. Automating data citation: the eagle-i experience (Full)
Data citation is of growing concern for owners of curated databases, who wish to give credit to the contributors and curators responsible for portions of the dataset and enable the data retrieved by a query to be later examined. While several databases specify how data should be cited, they leave it to users to manually construct the citations and do not generate them automatically.

We report our experiences in automating data citation for an RDF dataset called eagle-i, and discuss how to generalize this to a citation framework that can work across a variety of different types of databases (e.g. relational, XML, and RDF). We also describe how a database administrator would use this framework to automate citation for a particular dataset.

Sandipan Sikdar, Matteo Marsili, Niloy Ganguly and Animesh Mukherjee. Influence of Reviewer Interaction Network on Long-term Citations: A Case Study of the Scientific Peer-Review System of the Journal of High Energy Physics (Full)

*Best Student Paper Award Nominee

A `peer-review system' in the context of judging research contributions, is one of the prime steps undertaken to ensure the qualityof the submissions received; a significant portion of the publishing budget is spent towards successful completion of the peer-review by the publication houses. Nevertheless, the scientific community is largely reaching a consensus that peer-review system, although indispensable, is nonetheless flawed.

A very pertinent question therefore is ``could this system be improved?". In this paper, we attempt to present an answer to this question by considering a massive dataset of around $29k$ papers with roughly $70k$ distinct review reports together consisting of $12m$ lines of review text from the Journal of High Energy Physics (JHEP) from 1997 to 2015. In specific, we introduce a novel reviewer-reviewer interaction network (an edge exists between two reviewers if they were assigned by the same editor) and show that surprisingly the simple structural properties of this network such as degree, clustering coefficient, centrality (closeness, betweenness etc.) serve as strong predictors of the long-term citations (i.e., the overall scientific impact) of a submitted paper. Compared to a set of baseline features built from the basic characteristics of the submitted papers, the authors and the referees (e.g., the popularity of the submitting author, the acceptance rate history of a referee, the linguistic properties laden in the text of the review reports etc.), the network features perform manifolds better. Although we do not claim to provide a full-fledged reviewer recommendation system (that could potentially replace an editor), our method could be extremely useful in assisting the editors in deciding the acceptance or rejection of a paper, thereby, improving the effectiveness of the peer-review system.

Martin Klein and Herbert Van De Sompel. Discovering Scholarly Orphans Using ORCID (Full)
Archival efforts such as (C)LOCKSS and Portico are in place to ensure the longevity of traditional scholarly resources like journal articles. At the same time, researchers are depositing a broad variety of other scholarly artifacts into emerging online portals that are designed to support web-based scholarship. These web-native scholarly objects are largely neglected by current archival practices and hence they become scholarly orphans. We therefore argue for a novel paradigm that is tailored towards archiving these scholarly orphans. We are investigating the feasibility of using Open Researcher and Contributor ID (ORCID) as a supporting infrastructure for the process of discovery of web identities and scholarly orphans for active researchers. We analyze ORCID in terms of coverage of researchers, subjects, and location and assess the richness of its profiles in terms of web identities and scholarly artifacts. We find that ORCID currently lacks in all considered aspects and hence can only be considered in conjunction with other discovery sources. However, ORCID is growing fast so there is potential that it could achieve a satisfactory level of coverage and richness in the near future.

Moderators

Papers