Loading…
This event has ended. View the official site or create your own event → Check it out
This event has ended. Create your own
ACM/IEEE Joint Conference on Digital Libraries 2017
University of Toronto
JCDL 2017 | #JCDL@2017
View analytic

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Monday, June 19
 

09:00

Doctoral Consortium

The Doctoral Consortium is a workshop for Ph.D. students from all over the world who are in the early phases of their dissertation work (i.e., the consortium is not intended for those who are finished or nearly finished with their dissertation). The goal of the Doctoral Consortium is to help students with their thesis and research plans by providing feedback and general advice in a constructive atmosphere. Students will present and discuss their research in the context of a well-known and established international conference, in a supportive atmosphere with other doctoral students and an international panel of established researchers. The workshop will take place on a single full day 9am-5pm (June 19, 2017).


Monday June 19, 2017 09:00 - 17:00
Room 313, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6

09:00

Introduction to Digital Libraries
This tutorial is a thorough and deep introduction to the Digital Libraries (DL) field, providing a firm foundation: covering key concepts and terminology, as well as services, systems, technologies, methods, standards, projects, issues, and practices. It introduces and builds upon a firm theoretical foundation (starting with the `5S’ set of intuitive aspects: Streams, Structures, Spaces, Scenarios, Societies), giving careful definitions and explanations of all the key parts of a `minimal digital library’, and expanding from that basis to cover key DL issues. Illustrations come from a set of case studies, including from multiple current projects. Attendees will be exposed to four Morgan and Claypool books that elaborate on 5S, published 2012-2014. Complementing the coverage of `5S’ will be an overview of key aspects of the DELOS Reference Model and DL.org activities. Further, use of a Hadoop cluster supporting big data DLs will be described.

Monday June 19, 2017 09:00 - 17:00
Room 325, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6

09:00

6th International Workshop on Mining Scientific Publications (WOSP 2017)

Digital libraries that store scientific publications are becoming increasingly central to the research process. They are not only used for traditional tasks, such as finding and storing research outputs, but also as a source for discovering new research trends or evaluating research excellence. With the current growth of scientific publications deposited in digital libraries, it is no longer sufficient to provide only access to content. To aid research, it is especially important to leverage the potential of text and data mining technologies to improve the process of how research is being done. 
This workshop aims to bring together people from different backgrounds who: (a) are interested in analysing and mining databases of scientific publications, (b) develop systems that enable such analysis and mining of scientific databases (especially those who run databases of publications) or (c) who develop novel technologies that improve the way research is being done.

For more information, including submission instructions, see workshop webpage.


Monday June 19, 2017 09:00 - 17:00
Room 205, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6

12:00

Lunch
Monday June 19, 2017 12:00 - 14:00
TBA

14:00

Introduction to the Digital Public Library of America API--CANCELLED
The Digital Public Library of America (DPLA) provides access to over 15 million objects from libraries, museums, and archives. In addition to serving as an open portal for cultural heritage, literature, art, and scientific materials, the DPLA provides access to extensive metadata related to these materials via an openly available, RESTful application programming interface (API). The open API enables third party developers to create targeted applications that enable new and transformative uses of the items indexed by the DPLA. This half day tutorial will introduce participants to the DPLA’s data model, describe the API, explain how to retrieve data using the API, and how to work with the retrieved data using freely available software using both interactive and programmatic techniques.

This program has been cancelled as of June 14, 2017.

Monday June 19, 2017 14:00 - 17:00
Room 417, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6

14:00

Scholarly Data Mining: Making Sense of the Scientific Literature
The objective of this tutorial is to provide a comprehensive overview of the issues to face in order to mine scientific textual information, thus identifying challenges, solutions, and opportunities to improve the way we access to Scholarly Digital Libraries. This tutorial includes an overview of the most relevant tasks related to the processing of scientific documents, including but not limited to the in-depth analysis of the structure of the scientific articles, their semantic interpretation, content extraction and summarization.

Monday June 19, 2017 14:00 - 17:00
Room 520, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6
 
Tuesday, June 20
 

07:30

JCDL Steering Committee Meeting (Closed Meeting)
The annual meeting of the JCDL Steering Committee. This is a closed meeting for steering committee members only.

Tuesday June 20, 2017 07:30 - 08:30
Room 312, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6

09:00

Keynote: Liz Lyon
Keynote Liz Lyon

In her Opening Keynote, Liz Lyon will begin by presenting a short personal retrospective of the data successes and achievements of the last decade, citing examples from both policy and practice. She will examine where we are now as a reality check, in order to understand what research data services and data roles are currently established and operational. Liz will then explore a number of emerging or next generation data zones, drawing on recent media headlines, technological developments and societal concerns – these will include addressing data at risk, thinking beyond data analytics and promoting transparency and trust in open science. A suite of next gen data roles will be presented, which seek to expand our understanding of data science in the broadest sense, but also to raise questions for data practitioners, faculty scholars and the wider public. The talk will close with a look at some of the challenges and opportunities for data and information professionals in this brave new data world.

Moderators
avatar for Robert McDonald

Robert McDonald

Associate Dean for Research & Technology Strategies, Indiana University
As the Associate Dean for Research and Technology Strategies, Robert H. McDonald works to provide library information system services and discovery services to the entire IU system and manages projects related to scholarly communications, new model publishing, and technologies that enable the Libraries to support teaching and learning for the IU Bloomington campus. In his role as Deputy Director of the Data to Insight Center, he works on new research related to large data analysis, storage and preservation through grant-funded and collaborative projects such as the HathiTrust Research Center. He also serves as the Data Steward for the IU Libraries. His research interests include technology management and integration of lean and agile frameworks, data preservation, learning eco-systems, data cyberinfrastructure, and big data analytics. | | Robert frequently presents and writes on a variety of topics, and was editor of the E-Content column for EDUCAUSE Review in 2016... Read More →

Speakers
avatar for Liz Lyon

Liz Lyon

Visiting Professor, School of Information Sciences, University of Pittsburgh


Tuesday June 20, 2017 09:00 - 10:30
Innis Town Hall 2 Sussex Ave, Toronto, ON M5S 1J5

11:00

Paper Session 01: Web Archives
Justin Brunelle, Michele Weigle and Michael Nelson. Archival Crawlers and JavaScript: Discover More Stuff but Crawl More Slowly (Full)
The web is today's primary publication medium, making web archiving an important activity for historical and analytical purposes. Web pages are correspondingly interactive, resulting in pages that are increasingly difficult to archive. JavaScript enables interactions that can potentially change the client-side state of a representation. We refer to representations that load embedded resources via JavaScript as deferred representations. It is difficult to discover and crawl all of the resources in deferred representations and the result of archiving deferred representations is archived web pages that are either incomplete or erroneously load embedded resources from the live web. We propose a method of discovering and archiving deferred representations and their descendants (representation states) that are only reachable through client-side events. Our approach identified an average of 38.5 descendants per seed URI crawled, 70.9% of which are reached through an onclick event. This approach also added 15.6 times more embedded resources than Heritrix to the crawl frontier, but at a crawl rate that was 38.9 times slower than simply using Heritrix. If our method was applied to the July 2015 Common Crawl dataset, a web-scale archival crawler will discover an additional 7.17 PB (5.12 times more) of information per year. This illustrates the significant increase in resources necessary for more thorough archival crawls.

Faryaneh Poursardar and Frank Shipman. What is Part of that Resource? User Expectations for Personal Archiving (Short)
Users wish to preserve Internet resources for later use. But what is part of and what is not part of an Internet resource remains an open question. In this paper we examine how specific relationships between web pages affect user perceptions of their being part of the same resource. This study presented participants with pairs of pages and asked about their expectation for having access to the second page after they save the first. The primary-page content in the study comes from multi-page stories, multi-image collections, product pages with reviews and ratings on separate pages, and short single page writings. Participants were asked to agree or disagree with three statements regarding their expectation for later access. Nearly 80% of participants agreed in the case of articles spread across multiple pages, images in the same collection, and additional details or assessments of product information. About 50% agreed for related content on pages linked to by the original page or related items while only about 30% thought advertisements or wish lists linked to were part of the resource. Differences in responses to the same page pairs for the three statements regarding later access indicate users’ recognize the difference between what would be valuable to them and current implementations of saving web content.

Weijia Xu, Maria Esteva, Deborah Beck and Yi-Hsuan Hsieh. A Portable Strategy for Preserving Web Applications and Data Functionality (Short)
Increasingly, the value of research data not only resides in its content, but on how it is made available to the users. To introduce complex research topics data is often presented interactively through a web application, the design of which is the result of years of work by researchers. Therefore, preserving the data and the application's functionalities becomes equally important. In the current academic IT environment it is often the case that these web applications are developed by multiple people with different expertise, deployed within shared technology infrastructures, and that they evolve technically and in content over short periods of time. This lifecycle model presents challenges to reproducibility and portability of the application across technology platforms over time. Preservation approaches such as virtualization and emulation may not be applicable to these cases due, among other issues, to the co-dependencies of the hosting infrastructure, to missing documentation about the original development, and to the evolving nature of these applications. To address these issues, we propose a functional preservation strategy to decouple web applications and their corresponding data from their hosting environment and re-launching data and web-code in a more portable environment without compromising the look and feel or the interactive features. Crucial to the strategy is identifying discrepancies between the application and the existing hosting environment including library dependencies and system’s configuration, and to adapt them within a simplified virtual environment to bring up the application’s functionality. Advantages over virtualization and emulation reside in not having to recreate one or all of the layers of the original hosting environment. We demonstrate this approach using as a case study the Speech Presentation in Homeric Epics database, a digital humanities project, and evaluated portability by deploying the application in two different hosting environments. We also assessed the strategy in relation to the ease by which a a non-savvy user can re-launch the application in a new host. 

Sawood Alam, Mat Kelly, Michele Weigle and Michael Nelson. Client-side Reconstruction of Composite Mementos Using ServiceWorker (Short)
We use the ServiceWorker (SW) web API to intercept HTTP requests for embedded resources and reconstruct Composite Mementos without the need for conventional URL rewriting typically performed by web archives. URL rewriting is a problem for archival replay systems, especially for URLs constructed by JavaScript; frequently resulting in incorrect URI references. By intercepting requests on the client using SW, we are able to strategically reroute instead of rewrite. Our implementation moves rewriting to clients, saving servers' computing resources and allowing servers to return responses more quickly. Our experiments show that retrieving the original instead of rewritten pages from the archive reduces time overhead by 35.66% and data overhead by 19.68%. Our system prevents Composite Mementos from leaking the live web while being easy to distribute and maintain.

Moderators
avatar for Martin Klein

Martin Klein

Scientist, Los Alamos National Laboratory

Speakers
avatar for Sawood Alam

Sawood Alam

Researcher, Old Dominion University
I am commonly tagged by Web, Digital Library, Web Archiving, Ruby on Rails, PHP, HTML, CSS, JavaScript, ExtJS, Go, Urdu, RTL, Docker, and Linux.
avatar for Justin F. Brunelle

Justin F. Brunelle

Lead Researcher, The MITRE Corporation
Lead Researcher at The MITRE Corporation and Adjunct Assistant Professor at Old Dominion University. Research interests include: | web science, digital preservation, cloud computing, emerging technologies
MN

Michael Nelson

Professor, Old Dominion University
FP

Faryaneh Poursardar

PhD Candidate, Texas A&M University
Web archive, HCI
avatar for Michele Weigle

Michele Weigle

Associate Professor, Old Dominion University


Tuesday June 20, 2017 11:00 - 12:30
Innis Town Hall 2 Sussex Ave, Toronto, ON M5S 1J5

12:30

Lunch
Tuesday June 20, 2017 12:30 - 14:00
Innis Café Complex 2 Sussex Ave, Toronto, ON M5S 1J5

14:00

Paper Session 02: Semantics and Linking
Pavlos Fafalios, Helge Holzmann, Vaibhav Kasturia and Wolfgang Nejdl. Building and Querying Semantic Layers for Web Archives (Full)

*VB Best Paper Award Nominee

Web archiving is the process of collecting portions of the Web to ensure the information is preserved for future exploitation. However, despite the increasing number of web archives worldwide, the absence of efficient and meaningful exploration methods still remains a major hurdle in the way of turning them into a usable and useful information source. In this paper, we elaborate on this problem and propose an RDF/S model and a distributed framework for building semantic profiles ("layers") that describe semantic information about the contents of web archives. A semantic layer allows describing metadata information about the archived documents, annotating them with useful semantic information (like entities, concepts and events), and publishing all this data on the Web as Linked Data. Such structured repositories offer advanced query and integration capabilities and make web archives directly exploitable by other systems and tools. To demonstrate their query capabilities, we build and query semantic layers for three different types of web archives. An experimental evaluation showed that a semantic layer can answer information needs that existing keyword-based systems are not able to sufficiently satisfy.

Abhik Jana, Sruthi Mooriyath, Animesh Mukherjee and Pawan Goyal. WikiM: Metapaths based Wikification of Scientific Abstracts (Full)
In order to disseminate the exponential extent of knowledge being produced in the form of scientific publications, it would be best to design mechanisms that connect it with already existing rich repository of concepts -- the Wikipedia. Not only does it make scientific reading simple and easy (by connecting the involved concepts used in the scientific articles to their Wikipedia explanations) but also improves the overall quality of the article. In this paper, we present a novel metapath based method, WikiM, to efficiently wikify scientific abstracts -- a topic that has been rarely investigated in the literature. One of the prime motivations for this work comes from the observation that, wikified abstracts of scientific documents help a reader to decide better, in comparison to the plain abstracts, whether (s)he would be interested to read the full article. We perform mention extraction mostly through traditional tf-idf measures coupled with a set of smart filters. The entity linking heavily leverages on the rich citation and author publication networks. Our observation is that various metapaths defined over these networks can significantly enhance the overall performance of the system. For mention extraction and entity linking, we outperform most of the competing state-of-the-art techniques by a large margin arriving at precision values of 72.42% and 73.8% respectively over a dataset from the ACL Anthology Network. In order to establish the robustness of our scheme, we wikify three other datasets and get precision values of 63.41%-94.03% and 67.67%-73.29% respectively for the mention extraction and the entity linking phase.

Jian Wu, Sagnik Ray Choudhury, Agnese Chiatti, Chen Liang and C. Lee Giles. HESDK: A Hybrid Approach to Extracting Scientific Domain Knowledge Entities (Short)
Automatic keyphrase extraction from scientific documents is a well-known problem. We investigate a variant of that problem: Scientific Domain Knowledge Entity (SDKE) extraction. Keyphrases are noun phrases that are important to the document. On the contrary, an SDKE is a span of text that refers to a concept and can be classified as a process, material, task, dataset etc. Supervised keyphrase extraction algorithms using non-sequential classifiers and global measures of informativeness (PMI, tf-idf) are good candidates for this task. Another approach is to use sequential labeling algorithms with local context from a sentence, as done in the named entity recognition tasks. We show that these methods can complement each other and a simple merging can improve the extraction accuracy by 5-7 percentiles. We further propose several heuristics to improve the extraction accuracy. Our preliminary experiments suggest that it is possible to improve the accuracy of the sequential learner itself by utilizing the predictions of the non-sequential model.

Xiao Yang, Dafang He, Wenyi Huang, Zihan Zhou, Alexander Ororbia, Daniel Kifer and C. Lee Giles. Smart Library: Identifying Books in a Library using Richly Supervised Deep Scene Text  (Short)
Physical library collections are valuable and long standing resources for knowledge and learning. However, managing books in a large bookshelf and finding books on it often leads to tedious manual work, especially for large book collections where books might be missing or misplaced. Recently, deep neural-based models have achieved great success for scene text detection and recognition. Motivated by these recent successes, we aim to investigate their viability in facilitating book management, a task that introduces further challenges including large amounts of cluttered scene text, distortion, and varied lighting conditions. In this paper, we present a library inventory building and retrieval system based on scene text reading methods. We specifically design our scene text recognition model using rich supervision to accelerate training and achieve state-of-the-art performance on several benchmark datasets. Our proposed system has the potential to greatly reduce the amount of human labor required in managing book inventories as well as the space needed to store book information.

Moderators
FP

Faryaneh Poursardar

PhD Candidate, Texas A&M University
Web archive, HCI

Speakers

Tuesday June 20, 2017 14:00 - 15:30
Room 325, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6

14:00

Paper Session 03: Collection Access and Indexing
Martin Toepfer and Christin Seifert. Descriptor-invariant Fusion Architectures for Automatic Subject Indexing : Analysis and Empirical Results on Short Texts (Full) 

*VB Best Paper Nominee

Documents indexed with controlled vocabularies enable users of libraries to discover relevant documents, even across language barriers. Due to the rapid growth of scientific publications, digital libraries require automatic methods that index documents accurately, especially with regard to explicit or implicit concept drift, that is, with respect to new descriptor terms and new types of documents, respectively. This paper first analyzes architectures of related approaches on automatic indexing. We show that their design determines individual strengths and weaknesses and justify research on their fusion. In particular, systems benefit from statistical associative components as well as from lexical components applying dictionary matching, ranking, and binary classification. The analysis emphasizes the importance of descriptor-invariant learning, that is, learning based on features which can be transferred between different descriptors. Theoretic and experimental results on economic titles and author keywords underline the relevance of the fusion methodology in terms of overall accuracy and adaptability to dynamic domains. Experiments show that fusion strategies combining a binary relevance approach and a thesaurus-based system outperform all other strategies on the tested data set. Our findings can help researchers and practitioners in digital libraries to choose appropriate methods for automatic indexing.

Guillaume Chiron, Antoine Doucet, Mickaël Coustaty, Muriel Visani and Jean-Philippe Moreux. Impact of OCR errors on the use of digital libraries. Towards a better access to information. (Short) 
Digital collections are increasingly used for a variety of purposes. In Europe only, we can conservatively estimate that tens of thousands of users consult digital libraries daily. The usages are often motivated by qualitative and quantitative research. However, caution must be advised as most digitized documents are indexed through their OCRed version, which is far from perfect, especially for ancient documents. In this paper, we aim to estimate the impact of OCR errors on the use of a major online platform: The Gallica digital library from the French National Library. It accounts for more than 100M OCRed documents and receives 80M search queries every year. In this context, we introduce two main contributions. First, an original corpus of OCRed documents composed of 12M characters along with the corresponding gold standard is presented and provided, with an equal share of English- and French-written documents. Next, statistics on OCR errors have been computed thanks to a novel alignment method introduced in this paper. Making use of all the user queries submitted to the Gallica portal over 4 months, we take advantage of our error model to propose an indicator for predicting the relative risk that queried terms mismatch targeted resources due to OCR errors, underlining the critical extent to which OCR quality impacts on digital library access.

David M. Weigl, Kevin R. Page, Peter Organisciak and J. Stephen Downie. Information-Seeking in Large-Scale Digital Libraries: Strategies for Scholarly Workset Creation (Short) 
Large-scale digital libraries such as the HathiTrust contain massive quantities of content combined from heterogeneous collections, with consequential challenges in providing mechanisms for discovery, unified access, and analysis. The HathiTrust Research Center has proposed 'worksets' as a solution for users to conduct their research into the 15 million volumes of HathiTrust content; however existing models of users' information-seeking behaviour, which might otherwise inform workset development, were established before digital library resources existed at such a scale.

We examine whether these information-seeking models can sufficiently articulate the emergent user activities of scholarly investigation as perceived during the creation of worksets. We demonstrate that a combination of established models by Bates, Ellis, and Wilson can accommodate many aspects of information seeking in large-scale digital libraries at a broad, conceptual, level. We go on to identify the supplemental information-seeking strategies necessary to specifically describe several workset creation exemplars.

Finally, we propose complementary additions to the existing models: we classify strategies as instances of querying, browsing, and contribution. Similarly we introduce a notion of scope according to the interaction of a strategy with content, content-derived metadata, or contextual metadata. Considering the scope and modality of new and existing strategies within the composite model allows us to better express--and so aid our understanding of--information-seeking behaviour within large-scale digital libraries.

Peter Darch and Ashley Sands. Uncertainty About the Long-Term: Digital Libraries, Astronomy Data, and Open Source Software (Short) 
Digital library developers make critical design and implementation decisions in the face of uncertainties about the future. We present a qualitative case study of the Large Synoptic Survey Telescope (LSST), a major astronomy project that will collect large-scale datasets and make them accessible through a digital library. LSST developers make decisions now, while facing uncertainties about its period of operations (2022-2032). Uncertainties we identify include topics researchers will seek to address, tools and expertise, and the availability of other astronomy infrastructure to exploit LSST observations. LSST is developing, and already releasing, its data management software open source. We evaluate benefits and burdens of this approach as a strategy for addressing uncertainty. Benefits include: enabling software to adapt to researchers’ changing needs; embedding LSST standards and tools as community practices; and promoting interoperability with other infrastructure. Burdens include: open source community management; documentation requirements; and trade-offs between software speed and accessibility.

Jaimie Murdock, Jacob Jett, Timothy W. Cole, Yu Ma, J. Stephen Downie and Beth Plale. Towards Publishing Secure Capsule-based Analysis (Short) 
HathiTrust Digital Library (HTDL) is an example of next generation Big Data collection of digitized content. Consisting of over 15 million digitized volumes (books), HTDL is of immense utility to scholars of all disciplines, from chemistry to literary studies. Researchers engage through user browsing and, more recently, through computational analysis (information retrieval, computational linguistics, and text mining), from focused studies of a few dozen volumes to large-scale experiments on millions of volumes. Computational engagement with HTDL is confounded by the in-copyright status of the majority of the content, which requires that computational analysis on HTDL be carried out in a secure environment. This is provided by the HathiTrust Research Center (HTRC) Data Capsule service. A reseracher is given a Capsule through which they carry out research for weeks to months. As a Capsule inherently limits flow in and out to protect the in-copyright data, this environment has unique challenges in support of researchers who wish to publish results from their research with a Capsule. We discuss recent advancements in our activities on provenance, workflows, worksets, and non-consumptive exports to aid a researcher in publishing results from Big Data analysis.

Moderators
Speakers
TW

Timothy W. Cole

Professor, Mathematics Librarian and CIRSS Coordinator for Library Applications, University of Illinois at Urbana-Champaign
PD

Peter Darch

University of Illinois at Urbana-Champaign
avatar for J. Stephen Downie

J. Stephen Downie

Co-PI HathiTrust Research Center, University of Illinois at Urbana-Champaign
avatar for Jaimie Murdock

Jaimie Murdock

PhD Student, Indiana University
Jaimie Murdock is a joint PhD student in Cognitive Science and Informatics. He studies the construction of knowledge representations and the dynamics of expertise. While majoring in two scientific disciplines, most of Jaimie's research occurs in the digital humanities, where he u... Read More →
avatar for Peter Organisciak

Peter Organisciak

University of Illinois at Urbana-Champaign
avatar for Beth Plale

Beth Plale

Co-PI HathiTrust Research Center, Indiana University
Science Director, Pervasive Technology Institute | Director, Data To Insight Center | Professor, Informatics and Computing | Indiana University
avatar for Ashley Sands

Ashley Sands

Senior Library Program Officer, Institute of Museum and Library Services
Managing a portfolio of grants and funding opportunities encompassed under the National Digital Platform emphasis of the Office of Library Services. In this position, I have a birds-eye perspective of the development of knowledge infrastructures across and between academic resear... Read More →
avatar for Christin Seifert

Christin Seifert

University of Passau
avatar for Martin Toepfer

Martin Toepfer

ZBW – Leibniz Information Centre for Economics


Tuesday June 20, 2017 14:00 - 15:30
Innis Town Hall 2 Sussex Ave, Toronto, ON M5S 1J5

15:30

JCDL Plenary Community Meeting
Moderators
MN

Michael Nelson

Professor, Old Dominion University

Tuesday June 20, 2017 15:30 - 16:00
Innis Town Hall 2 Sussex Ave, Toronto, ON M5S 1J5

16:00

Minute Madness
Moderators
avatar for Justin F. Brunelle

Justin F. Brunelle

Lead Researcher, The MITRE Corporation
Lead Researcher at The MITRE Corporation and Adjunct Assistant Professor at Old Dominion University. Research interests include: | web science, digital preservation, cloud computing, emerging technologies

Tuesday June 20, 2017 16:00 - 17:00
Inforum, Faculty of Information, 4th Floor 140 St George St, Toronto, ON M5S 3G6

17:00

Poster Session & Reception
Moderators
avatar for Justin F. Brunelle

Justin F. Brunelle

Lead Researcher, The MITRE Corporation
Lead Researcher at The MITRE Corporation and Adjunct Assistant Professor at Old Dominion University. Research interests include: | web science, digital preservation, cloud computing, emerging technologies

Tuesday June 20, 2017 17:00 - 19:00
Inforum, Faculty of Information, 4th Floor 140 St George St, Toronto, ON M5S 3G6
 
Wednesday, June 21
 

09:00

Keynote: Ray Siemens

Keynote Ray Siemens

Describing academic production up to early last century, literary theorist Northrop Frye observed of what he called the Wissenschaft period, one of building knowledge via dynamic systematic research, that its “imaginative model was the assembly line, to which each scholar ‘contributed’ something [to] an indefinitely expanding body of knowledge” (Northrop Frye [1991]. “Literary and Mechanical Models”).  Reflecting on tendencies in the mid- to latter stages of last century, William Winder has noted that “Amassing knowledge is relatively simple … [but] organizing, retrieving, and understanding the interrelations of the information is another matter” — encouraging us to imagine the challenge of our information age, a neo-Wissenschaft period  perhaps, as one that “brings with it … issues of retrieval and reuse” and requiring us “to be just as efficient at retrieving the information we produce as we are at stockpiling it” (William Winder [1997]. “Texpert Systems.”).  Today, as we revisit issues related to the production, accumulation, organization, retrieval, and navigation of knowledge, we typically do so with attention to contemporary technologies that have worked to redefine roles associated with these issues, introducing new imaginative models for academic knowledge production and engagement.


My talk builds on this foundation and considers ways in which open social scholarship’s framing of these elements, and beyond, encourage building knowledge to scale in a Humanistic context and others.  Open social scholarship involves creating and disseminating research and research technologies to a broad audience of specialists and active non-specialists in ways that are accessible and significant. As a concept, it has grown from roots in open access and open scholarship movements, the digital humanities’ methodological commons and community of practice, contemporary online practices, and public facing “citizen scholarship” to include i) developing, sharing, and implementing research in ways that consider the needs and interests of both academic specialists and communities beyond academia; ii) providing opportunities to co-create, interact with, and experience openly-available cultural data; iii) exploring, developing, and making public tools and technologies under open licenses to promote wide access, education, use, and repurposing; and iv) enabling productive dialogue between academics and non-academics.


Moderators
avatar for Ian Milligan

Ian Milligan

Digital Historian, University of Waterloo

Speakers
avatar for Ray Siemens

Ray Siemens

Distinguished Professor, U Victoria
A leader of collaborative, transformative, interdisciplinary scholarship and pedagogy, Dr. Raymond Siemens is Distinguished Professor in the Faculty of Humanities at the University of Victoria, in English with cross appointment in Computer Science, appointed also 2004-15 as Canad... Read More →


Wednesday June 21, 2017 09:00 - 10:30
Innis Town Hall 2 Sussex Ave, Toronto, ON M5S 1J5

11:00

Paper Session 04: Citation Analysis
Saeed-Ul Hassan, Anam Akram and Peter Haddawy. Identifying Important Citations using Contextual Information from Full Text (Full)

*VB Best Paper Award Nominee

In this paper we address the problem of classifying cited work into important and non-important to the developments presented in a research publication. This task is vital for the algorithmic techniques that detect and follow emerging research topics and to qualitatively measure the impact of publications in increasingly growing scholarly big data. We consider cited work as important to a publication if that work is used or extended in some way. If a reference is cited as background work or for the purpose of comparing results, the cited work is considered to be non-important. By employing five classification techniques (Support Vector Machine, Naïve Bayes, Decision Tree, K-Nearest Neighbors and Random Forest) on an annotated dataset of 465 citations, we explore the effectiveness of eight previously published features and six novel features (including context based, cue words based and textual based). Within this set, our new features are among the best performing. Using the Random Forest classifier we achieve an overall classification accuracy of 0.91 AUC.

Luca Weihs and Oren Etzioni. Learning to Predict Citation-Based Impact Measures (Full)
Citations implicitly encode a community's judgment of a paper's importance and thus provide a unique signal by which to sLucaWeihstudy scientific impact. Efforts in understanding and refining this signal are reflected in the probabilistic modeling of citation networks and the proliferation of citation-based impact measures such as Hirsch's h-index. While these efforts focus on understanding the past and present, they leave open the question of whether scientific impact can be predicted into the future. Recent work addressing this deficiency has employed linear and simple probabilistic models; we show that these results can be handily outperformed by leveraging non-linear techniques. In particular, we find that these AI methods can predict measures of scientific impact for papers and authors, namely citation rates and h-indices, with surprising accuracy, even 10 years into the future. Moreover, we demonstrate how existing probabilistic models for paper citations can be extended to better incorporate refined prior knowledge. Of course, predictions of ``scientific impact" should be approached with healthy skepticism, but our results improve upon prior efforts and form a baseline against which future progress can be easily judged.

Mayank Singh, Ajay Jaiswal, Priya Shree, Arindam Pal, Animesh Mukherjee and Pawan Goyal. Understanding the Impact of Early Citers on Long-Term Scientific Impact (Full)
Our current knowledge of scholarly plagiarism is largely based on the similarity between full text research articles. In this paper, we propose an innovative and novel conceptualization of scholarly plagiarism in the form of reuse of explicit citation sentences in scientific research articles. Note that while full-text plagiarism is an indicator of a gross-level behavior, copying of citation sentences is a more nuanced micro-scale phenomenon observed even for well-known researchers. The current work poses several interesting questions and attempts to answer them by empirically investigating a large bibliographic text dataset from computer science containing millions of lines of citation sentences. In particular, we report evidences of massive copying behavior. We also present several striking real examples throughout the paper to showcase widespread adoption of this undesirable practice. In contrast to the popular perception, we find that copying tendency increases as an author matures. The copying behavior is reported to exist in all fields of computer science; however, the theoretical fields indicate more copying than the applied fields.

Moderators
Speakers
avatar for Saeed-Ul Hassan

Saeed-Ul Hassan

Assistant Professor, Information Technology University
MS

Mayank Singh

Indian Institute of Technology Kharagpur
LW

Luca Weihs

University of Washington


Wednesday June 21, 2017 11:00 - 12:30
Innis Town Hall 2 Sussex Ave, Toronto, ON M5S 1J5

11:00

Paper Session 05: Exploring and Analyzing Collections
Felix Hamborg, Norman Meuschke and Bela Gipp. Matrix-based News Aggregation: Exploring Different News Perspectives (Full)

*Best Student Paper Award Nominee

News aggregators are able to cope with the large amount of news that is published nowadays. However, they focus on the presentation of important, common information, but do not reveal the different perspectives within one topic. Thus, they suffer from media bias, a phenomenon that describes differences in news, such as in their content or tone. Finding these differences is crucial to reduce the effects of media bias. This paper presents matrix-based news analysis (MNA), a novel design for news exploration. MNA helps users gain a broad and diverse news understanding by presenting them various news perspectives on the same topic. Furthermore, we present NewsBird, a news aggregator that implements MNA to find different perspectives on international news topics. The results of a case study demonstrate that NewsBird broadens the user’s news understanding while it also provides similar news aggregation functionalities as established systems.

Nicholas Cole, Alfie Abdul-Rahman and Grace Mallon. Quill: A Framework for Constructing Negotiated Texts - with a Case Study on the US Constitutional Convention of 1787 (Full)

*VB Best paper nominee


This paper describes a new approach to the presentation of records relating to formal negotiations and the texts that they create. It describes the architecture of a model, platform, and web-interface that can be used by domain-experts to convert the records typical of formal negotiations in to a model of decision-making (with minimal training). This model has implications for both research and teaching, by allowing for better qualitative and quantitative analysis of negotiations. The platform emphasizes the reconstruction as closely as possible of the context within which proposals and decisions are made. A generic platform, its usability and benefits are illustrated by a presentation of the records relating to the 1787 Constitutional Convention that wrote the Constitution of the United States.

Kevin Page, Sean Bechhofer, Georgy Fazekas, David Weigl and Thomas Wilmering. Realising a Layered Digital Library: Exploration and Analysis of the Live Music Archive through Linked Data (Full)
Building upon a collection with functionality for discovery and analysis has been described by Lynch as a 'layered' approach to digital libraries. Meanwhile, as digital corpora have grown in size, their analysis is necessarily supplemented by automated application of computational methods, which can create layers of information as intricate and complex as those within the content itself. This combination of layers -- aggregating homogeneous collections, specialised analyses, and new observations -- requires a flexible approach to systems implementation which enables pathways through the layers via common points of understanding, while simultaneously accommodating the emergence of previously unforeseen layers.

In this paper we follow a Linked Data approach to build a layered digital library based on content from the Internet Archive Live Music Archive. Starting from the recorded audio and basic information in the Archive, we first deploy a layer of catalogue metadata which allows an initial -- if imperfect -- consolidation of performer, song, and venue information. A processing layer extracts audio features from the original recordings, workflow provenance, and summary feature metadata. A further analysis layer provides tools for the user to combine audio and feature data, discovered and reconciled using interlinked catalogue and feature metadata from layers below.

Finally, we demonstrate the feasibility of the system through an investigation of 'key typicality' across performances. This highlights the need to incorporate robustness to inevitable 'imperfections' when undertaking scholarship within the digital library, be that from mislabelling, poor quality audio, or intrinsic limitations of computational methods. We do so not with the assumption that a 'perfect' version can be reached; but that a key benefit of a layered approach is to allow accurate representations of information to be discovered, combined, and investigated for informed interpretation.

Moderators
Speakers
NC

Nicholas Cole

University of Oxford
avatar for Norman Meuschke

Norman Meuschke

PhD Student, University of Konstanz, Germany
Research interests: | Information Retrieval for text, images, and mathematical content | Plagiarism Detection | News Analysis | Citation and Link Analysis | Blockchain Technology | Information Visualization


Wednesday June 21, 2017 11:00 - 12:30
Room 205, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6

12:30

Lunch
Wednesday June 21, 2017 12:30 - 14:00
Innis Café Complex 2 Sussex Ave, Toronto, ON M5S 1J5

12:45

JCDL 2018 Organizing Committee Meeting (Closed Meeting)
This is the meeting of members of the JCDL 2018 Organizing Committee. This is a closed meeting for members of the organizing committee only.

Wednesday June 21, 2017 12:45 - 14:00
Room 312, Innis College 2 Sussex Ave, Toronto, ON M5S 1J5

14:00

Paper Session 06: Text Extraction and Analysis
Hannah Bast and Claudius Korzen. A Benchmark and Evaluation for Text Extraction from PDF (Full)
Extracting the body text from a PDF document is an important but surprisingly difficult task. The reason is that PDF is a layout-based format which specifies the fonts and positions of the individual characters rather than the semantic units of the text (e.g., words or paragraphs) and their role in the document (e.g., body text or footnote or caption). There is an abundance of extraction tools, but their quality and the range of their functionality are hard to determine.
In this paper, we show how to construct a high-quality benchmark of principally arbitrary size from parallel TeX and PDF data. We construct such a benchmark of 12,099 scientific articles from arXiv.org and make it publicly available. We establish a set of criteria for a clean and independent assessment of the semantic abilities of a given extraction tool. We provide an extensive evaluation of 13 state-of-the-art tools for text extraction from PDF on our benchmark according to our criteria. We include our own method, Icecite, which significantly outperforms all other tools, but is still not perfect. We outline the remaining steps necessary to finally make text extraction from PDF a solved problem.

Kresimir Duretec, Andreas Rauber and Christoph Becker. A text extraction software benchmark based on a synthesized dataset (Full)
Text extraction plays an important function for data processing workflows in digital libraries. For example, it is a crucial prerequisite for evaluating the quality of migrated textual documents. Complex file formats make the extraction process error-prone and have made it very challenging to verify the correctness of extraction components.
Based on digital preservation and information retrieval scenarios, three quality requirements in terms of effectiveness of text extraction tools are identified: 1) is a certain text snippet correctly extracted from a document, 2) does the extracted text appear in the right order relatively to other elements and, 3) is the structure of the text preserved. 
A number of text extraction tools is available fulfilling these three quality requirements to various degrees. However, systematic benchmarks to evaluate those tools are still missing, mainly due to the lack of proper datasets with accompanying ground truth. 
The contribution of this paper is two-fold. First we describe a dataset generation method based on model driven engineering principles and use it to synthesize a dataset and its ground truth directly from a model. Second, we define a benchmark for text extraction tools and complete an experiment to calculate performance measures for several tools that cover the three quality requirements. The results demonstrate the benefits of the approach in terms of scalability and effectiveness in generating ground truth for content and structure of text elements.

Tokinori Suzuki and Atsushi Fujii. Mathematical Document Categorization with Structure of Mathematical Expressions (Full)
A mathematical document is a document subjected to mathematical communication, for example, a math paper and discussion in online Q&A community. Mathematical document categorization (MDC) is a task to classify mathematical documents to mathematical categories, e.g. probability theory and set theory. This task is an important task for supporting user search on recent wide-spreaded digital libraries and archiving services. Although Mathematical expressions (ME) in the document could bring an essential information as being in a central part of communication especially in math fields, how to utilize ME for MDC has not been matured. In this paper, we propose the classification method based on text combined with structures of ME, which are supposed to reflect conventions and rules specific to a category. Also, we present document collections built for evaluating the MDC systems, with investigation on categorial settings and its statistics. We demonstrate classification results that our proposed method outperforms existing methods with state-of-the-art ME modeling on F-measure.


Wednesday June 21, 2017 14:00 - 15:30
Room 205, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6

14:00

Paper Session 07: Collection Building
Federico Nanni, Simone Paolo Ponzetto and Laura Dietz. Building Entity-Centric Event Collections (Full)

*Best Student Paper Award Nominee

Web archives preserve an unprecedented abundance of primary sources for the diachronic tracking, examination and -- ultimately -- understanding of major events and transformations in our society. A topic of interest is, for example, the rise of Euroscepticism as a consequence of the recent economic crisis.
We present an approach for building event-centric sub-collections from large archives, which includes not only the core documents related to the event itself but, even more importantly, documents which describe related aspects (e.g., premises and consequences). This is achieved by 1) identifying relevant concepts and entities from a knowledge base, and 2) detecting mentions of these entities in documents, which is interpreted as an indicator for relevance. We extensively evaluate our system on two diachronic corpora, the New York Times Corpus and the US Congressional Record, and we test its performance on the TREC KBA Stream corpus, a large and publicly available web archive.

Jan R. Benetka, Krisztian Balog and Kjetil Nørvåg. Towards Building a Knowledge Base of Monetary Transactions from a News Collection (Full)
We address the problem of extracting structured representations of economic events from a large corpus of news articles, using a combination of natural language processing and machine learning techniques. The developed techniques allow for semi-automatic population of a financial knowledge base, which, in turn, may be used to support a range of data mining and exploration tasks. The key challenge we face in this domain is that the same event is often reported multiple times, with varying correctness of details. We address this challenge by first collecting all information pertinent to a given event from the entire corpus, then considering all possible representations of the event, and finally, using a supervised learning method, to rank these representations by the associated confidence scores. A main innovative element of our approach is that it jointly extracts and stores all attributes of the event as a single representation (quintuple). Using a purpose-built test set we demonstrate that our supervised learning approach can achieve 25\% improvement in F1-score over baseline methods that consider the earliest, the latest or the most frequent reporting of the event.

Alexander Nwala, Michael Nelson, Michele Weigle, Adam Ziegler and Anastasia Aizman. Local Memory Project: providing tools to build collections of stories for local events from local sources (Full)
The national (non-local) news media has different priorities than the local news media. If one seeks to build a collection of stories about local events, the national news media may be insufficient, with the exception of local news which "bubbles" up to the national news media. If we rely exclusively on national media, or build collections exclusively on their reports, we could be late to the important milestones which precipitate major local events, thus, run the risk of losing important stories due to link rot and content drift. Consequently, it is important to consult local sources affected by local events. Our goal is to provide a suite of tools (beginning with two) under the umbrella of the Local Memory Project (LMP) to help users and small communities discover, collect, build, archive, and share collections of stories for important local events by leveraging local news sources. The first service (Geo) returns a list of local news sources (Newspaper, TV and Radio stations) in order of proximity to a user-supplied zip code. The second service (Local Stories Collection Generator) discovers, collects and archives a collection of news stories about a story or event represented by a user-supplied query and zip code pair. We evaluated 20 pairs of collections - Local (generated by our system) and non-Local by measuring archival coverage, tweet index rate, temporal range, precision, and sub-collections overlap. Our experimental results showed Local and non-Local collections with archive rates of 0.63 and 0.83, respectively, and a tweet index rates of 0.59 and 0.80, respectively. Local collections produced older stories than non-Local collections, but had a lower precision (relevance) of 0.77 compared to a non-Local precision of 0.91. These results indicate that Local collections are less exposed, thus less popular than their non-Local counterpart.

Moderators
avatar for Justin F. Brunelle

Justin F. Brunelle

Lead Researcher, The MITRE Corporation
Lead Researcher at The MITRE Corporation and Adjunct Assistant Professor at Old Dominion University. Research interests include: | web science, digital preservation, cloud computing, emerging technologies

Speakers
avatar for Federico Nanni

Federico Nanni

Researcher, University of Mannheim
MN

Michael Nelson

Professor, Old Dominion University
avatar for Alexander C. Nwala

Alexander C. Nwala

PhD Student/Research Assistant, Old Dominion University
avatar for Michele Weigle

Michele Weigle

Associate Professor, Old Dominion University
avatar for Adam Ziegler

Adam Ziegler

Managing Director, Harvard Library Innovation Lab, Harvard University
Adam Ziegler is an attorney and member of the Library Innovation Lab at Harvard Law School where he leads technology projects like Free the Law, Perma.cc and H2O. Before taking that role, he co-founded a legal tech startup named Mootus and represented clients for over a decade at... Read More →


Wednesday June 21, 2017 14:00 - 15:30
Innis Town Hall 2 Sussex Ave, Toronto, ON M5S 1J5

16:00

Paper Session 08: Classification and Clustering
Eduardo Castro, Saurabh Chakravarty, Eric Williamson, Denilson Pereira and Edward Fox. Classifying Short Unstructured Data using the Apache Spark Platform (Full)
People worldwide use Twitter to post updates about the events that concern them directly or indirectly. Study of these posts can help identify global events and trends of importance. Similarly, E-commerce applications organize their products in a way that can facilitate their management and satisfy the needs and expectations of their customers. However, classifying data such as tweets or product descriptions is still a challenge. These data are described by short texts, containing in their vocabulary abbreviations of sentences, emojis, hashtags, implicit codes, and other non-standard usage of written language. Consequently, traditional text classification techniques are not effective on these data. In this paper, we describe our use of the Spark platform to implement two classification strategies to process large data collections, where each datum is a short textual description. One of our solutions uses an associative classifier, while the other is based on a multiclass Logistic Regression classifier using Word2Vec as a feature selection and transformation technique. Our associative classifier captures the relationships among words that uniquely identify each class, and Word2Vec captures the semantic and syntactic context of the words. In our experimental evaluation, we compared our solutions, as well as Spark MLlib classifiers. We assessed effectiveness, efficiency, and memory requirements. The results indicate that our solutions are able to effectively classify the millions of data instances composed of thousands of distinct features and classes, found in our digital libraries.

Abel Elekes, Martin Schäler and Klemens Böhm. On the Various Semantics of Similarity in Word Embedding Models (Full)

*Best Student Paper Award Nominee

Finding similar words with the help of word embedding models has yielded meaningful results in many cases. However, the notion of similarity has remained ambiguous. In this paper, we examine when exactly similarity values in word embedding models are meaningful. To do so, we analyze the statistical distribution of similarity values systematically, in two series of experiments. The first one examines how the distribution of similarity values depends on the different embedding model algorithms and parameters. The second one starts by showing that intuitive similarity thresholds do not exist. We then propose a method stating which similarity values actually are meaningful for a given embedding model. In more abstract terms, our insights should give way to a better understanding of the notion of similarity in embedding models and to more reliable evaluations of such models.

Mirco Kocher and Jacques Savoy. Author Clustering Using Spatium (Short)
This paper presents the author clustering problem and compares it to related authorship attribution questions. The proposed model is based on a distance measure called Spatium derived from the Canberra measure (weighted version of L1 norm). The selected features consist of the 200 most frequent words and punctuation symbols. An evaluation methodology is presented and the test collections are extracted from the PAN CLEF 2016 evaluation campaign. In addition to those, we also consider two additional corpora reflecting the literature domain more closely. Based on four different languages, the evaluation measures demonstrate a high precision and high F1 values for all 20 test collections. A more detailed analysis provides reasons explaining some of the failures of the Spatium model.

Shaobin Xu and David Smith. Retrieving and Combining Repeated Passages to Improve OCR (Short)
We present a novel approach to improve the output of optical character recognition (OCR) systems by first detect ing duplicate passages in their output and then performing consensus decoding combined with a language model. This approach is orthogonal to, and may be combined with, previously proposed methods for combining the output of different OCR systems on the same text or the output of the same OCR system on differently processed images of the same text. It may also be combined with methods to estimate the parameters of a noisy channel model of OCR errors. Additionally, the current method generalizes previous proposals for simple majority-vote combination of known duplicated texts. On a corpus of historical newspapers, an annotated set of clusters has a baseline word error rate (WER) of 33%. A majority vote procedure reaches 23% on passages where one or more duplicates were found, and consensus decoding combined with a language model achieves 18% WER. In a separate experiment, newspapers were aligned to very widely reprinted texts such as State of the Union speeches, producing clusters with up to 58 witnesses. Beyond 20 witnesses, simple majority vote outperforms language model rescoring, though the gap between them is much less in this experiment.

Moderators
avatar for Peter Organisciak

Peter Organisciak

University of Illinois at Urbana-Champaign

Speakers
avatar for Edward Fox

Edward Fox

Professor, Virginia Tech (VPI&SU)
digital libraries, family, students, Reiki
avatar for Mirco Kocher

Mirco Kocher

PhD student, University of Neuchâtel


Wednesday June 21, 2017 16:00 - 17:30
Room 205, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6

16:00

Paper Session 09: Content Provenance and Reuse
David Bamman, Michelle Carney, Jon Gillick, Cody Hennesy and Vijitha Sridhar. Estimating the date of first publication in a large-scale digital library (Full)
One prerequisite for cultural analysis in large-scale digital libraries is an accurate estimate of the date of composition--as distinct from publication--of the texts they contain. In this work, we present a manually annotated dataset of first dates of publication of three samples of books from the HathiTrust Digital Library (uniform random, uniform fiction, and stratified by decade), and empirically evaluate the disparity between these gold standard labels and several approximations used in practice (using the date of publication as provided in metadata, several deduplication methods, and automatically predicting the date of composition from the text of the book). We find that a simple heuristic of metadata-based deduplication works best in pratice, and text-based composition dating is accurate enough to inform the analysis of "apparent time."

George Buchanan and Dana Mckay. The Lowest form of Flattery: Characterising Text Re-use and Plagiarism Patterns in a Digital Library Corpus (Full)
The re-use of text—particularly misuse, or plagiarism—is a contentious issue for researchers, universities, libraries and publishers. Technological approaches to identifying student plagiarism, such as TurnItIn are now widespread. Academic publishing, however, does not typically come under such scrutiny. While it is common knowledge that plagiarism occurs, we do not know how frequently or how extensively, nor where in a document it is likely to be found. This paper offers the first assessment of text re-use within the field of digital libraries. It also characterises text re-use generally (and plagiarism specifically) according to its location in the document, author seniority, publication venue and open access. As a secondary contribution, we suggest routes towards more rigorous plagiarism detection and management in the future.

Bela Gipp, Corinna Breitinger, Norman Meuschke and Joeran Beel. CryptSubmit: Introducing Securely Timestamped Manuscript Submission and Peer Review Feedback using the Blockchain (Short)
Manuscript submission systems are a central fixture in scholarly publishing. However, researchers who submit their unpublished work to a conference or journal must trust that the system and its provider will not accidentally or willfully leak unpublished findings. Additionally, researchers must trust that the program committee and the anonymous peer reviewers will not plagiarize unpublished ideas or results. To address these weaknesses, we propose a method that automatically creates a publicly verifiable, tamper-proof timestamp for manuscripts utilizing the decentralized Bitcoin blockchain. The presented method hashes each submitted manuscript and uses the API of the timestamping service OriginStamp to persistently embed this manuscript hash on Bitcoin’s blockchain. Researchers can use this tamper-proof trusted timestamp to prove that their manuscript existed in its specific form at the time of submission to a conference or journal. This verifiability allows researchers to stake a claim to their research findings and intellectual property, even in the face of vulnerable submission platforms or dishonest peer reviewers. Optionally, the system also associates trusted timestamps with the feedback and ideas shared by peer reviewers to increase the traceability of ideas. The proposed concept, which we introduce as CryptSubmit, is currently being integrated into the open-source conference management system OJS. In the future, the method could be integrated at nearly no overhead cost into other manuscript submission systems, such as EasyChair, ConfTool, or Ambra. The introduced method can also improve electronic pre-print services and storage systems for research data.

Mayank Singh, Abhishek Niranjan, Divyansh Gupta, Nikhil Angad Bakshi, Animesh Mukherjee and Pawan Goyal. Citation sentence reuse behavior of scientists: A case study on massive bibliographic text dataset of computer science (Short)
Our current knowledge of scholarly plagiarism is largely based on the similarity between full text research articles. In this paper, we propose an innovative and novel conceptualization of scholarly plagiarism in the form of reuse of explicit citation sentences in scientific research articles. Note that while full-text plagiarism is an indicator of a gross-level behavior, copying of citation sentences is a more nuanced micro-scale phenomenon observed even for well-known researchers. The current work poses several interesting questions and attempts to answer them by empirically investigating a large bibliographic text dataset from computer science containing millions of lines of citation sentences. In particular, we report evidences of massive copying behavior. We also present several striking real examples throughout the paper to showcase widespread adoption of this undesirable practice. In contrast to the popular perception, we find that copying tendency increases as an author matures. The copying behavior is reported to exist in all fields of computer science; however, the theoretical fields indicate more copying than the applied fields.

Moderators
ER

Edie Rasmussen

University of British Columbia

Speakers
DB

David Bamman

UC Berkeley
GB

George Buchanan

University of Melbourne
CH

Cody Hennesy

E-Learning Librarian, UC Berkeley Library
DM

Dana McKay

University of Melbourne
avatar for Norman Meuschke

Norman Meuschke

PhD Student, University of Konstanz, Germany
Research interests: | Information Retrieval for text, images, and mathematical content | Plagiarism Detection | News Analysis | Citation and Link Analysis | Blockchain Technology | Information Visualization
MS

Mayank Singh

Indian Institute of Technology Kharagpur


Wednesday June 21, 2017 16:00 - 17:30
Innis Town Hall 2 Sussex Ave, Toronto, ON M5S 1J5

18:30

Conference Banquet
The conference banquet will be held at the restaurant Sassafraz 100 Cumberland Street | Toronto, Ontario | M5R 1A6 | 416-964-2222 (http://www.sassafraz.ca

The following is a link to a google map showing directions from the conference hotel (Toronto Marriott Bloor Yorkville Hotel) to the Sassafraz restaurant - https://goo.gl/maps/UxNZywG3x3K2

Wednesday June 21, 2017 18:30 - 20:00
Sassafraz 100 Cumberland St, Toronto, ON M5R 1A6
 
Thursday, June 22
 

09:00

Panel Session: “Can We Really Show This”?: Ethics, Representation and Social Justice in Sensitive Digital Space
This panel addresses ethical issues of curators who work with contentious, uncomfortable, or “sensitive” content relating to marginalized populations and underrepresented histories.

Deborah Maron
Dorothy Berry
Raegan Swanson
Erin White


Moderators
avatar for Debbie

Debbie

Doctoral Student and Teaching fellow, School of Information and Library Science
Critical Information Science, Philosophy of information science, metadata

Speakers
avatar for Dorothy J. Berry

Dorothy J. Berry

Umbra Search Digitization and Metadata Lead, University of Minnesota Libraries
avatar for Erin White

Erin White

Head, Digital Engagement, Virginia Commonwealth University
Talk to me about library technology, the web, user experience, digital collections and digital humanities.


Thursday June 22, 2017 09:00 - 10:30
Innis Town Hall 2 Sussex Ave, Toronto, ON M5S 1J5

09:00

Paper Session 10: Scientific Collections and Libraries
Abdussalam Alawini, Leshang Chen, Susan Davidson and Gianmaria Silvello. Automating data citation: the eagle-i experience (Full)
Data citation is of growing concern for owners of curated databases, who wish to give credit to the contributors and curators responsible for portions of the dataset and enable the data retrieved by a query to be later examined. While several databases specify how data should be cited, they leave it to users to manually construct the citations and do not generate them automatically. 

We report our experiences in automating data citation for an RDF dataset called eagle-i, and discuss how to generalize this to a citation framework that can work across a variety of different types of databases (e.g. relational, XML, and RDF). We also describe how a database administrator would use this framework to automate citation for a particular dataset. 

Sandipan Sikdar, Matteo Marsili, Niloy Ganguly and Animesh Mukherjee. Influence of Reviewer Interaction Network on Long-term Citations: A Case Study of the Scientific Peer-Review System of the Journal of High Energy Physics (Full)

*Best Student Paper Award Nominee

A `peer-review system' in the context of judging research contributions, is one of the prime steps undertaken to ensure the qualityof the submissions received; a significant portion of the publishing budget is spent towards successful completion of the peer-review by the publication houses. Nevertheless, the scientific community is largely reaching a consensus that peer-review system, although indispensable, is nonetheless flawed. 

A very pertinent question therefore is ``could this system be improved?". In this paper, we attempt to present an answer to this question by considering a massive dataset of around $29k$ papers with roughly $70k$ distinct review reports together consisting of $12m$ lines of review text from the Journal of High Energy Physics (JHEP) from 1997 to 2015. In specific, we introduce a novel reviewer-reviewer interaction network (an edge exists between two reviewers if they were assigned by the same editor) and show that surprisingly the simple structural properties of this network such as degree, clustering coefficient, centrality (closeness, betweenness etc.) serve as strong predictors of the long-term citations (i.e., the overall scientific impact) of a submitted paper. Compared to a set of baseline features built from the basic characteristics of the submitted papers, the authors and the referees (e.g., the popularity of the submitting author, the acceptance rate history of a referee, the linguistic properties laden in the text of the review reports etc.), the network features perform manifolds better. Although we do not claim to provide a full-fledged reviewer recommendation system (that could potentially replace an editor), our method could be extremely useful in assisting the editors in deciding the acceptance or rejection of a paper, thereby, improving the effectiveness of the peer-review system.

Martin Klein and Herbert Van De Sompel. Discovering Scholarly Orphans Using ORCID (Full)
Archival efforts such as (C)LOCKSS and Portico are in place to ensure the longevity of traditional scholarly resources like journal articles. At the same time, researchers are depositing a broad variety of other scholarly artifacts into emerging online portals that are designed to support web-based scholarship. These web-native scholarly objects are largely neglected by current archival practices and hence they become scholarly orphans. We therefore argue for a novel paradigm that is tailored towards archiving these scholarly orphans. We are investigating the feasibility of using Open Researcher and Contributor ID (ORCID) as a supporting infrastructure for the process of discovery of web identities and scholarly orphans for active researchers. We analyze ORCID in terms of coverage of researchers, subjects, and location and assess the richness of its profiles in terms of web identities and scholarly artifacts. We find that ORCID currently lacks in all considered aspects and hence can only be considered in conjunction with other discovery sources. However, ORCID is growing fast so there is potential that it could achieve a satisfactory level of coverage and richness in the near future.

Moderators
Speakers
SD

Susan Davidson

University of Pennsylvania
avatar for Martin Klein

Martin Klein

Scientist, Los Alamos National Laboratory
SS

Sandipan Sikdar

Indian Institute of Technology Kharagpur
avatar for Gianmaria Silvello

Gianmaria Silvello

Researcher, University of Padua
Data Citation
HV

Herbert Van De Sompel

Scientist, Los Alamos National Laboratory
Herbert Van de Sompel graduated in Mathematics and Computer Science at Ghent University (Belgium), and in 2000 obtained a Ph.D. in Communication Science there. For many years, he headed Library Automation at Ghent University. After leaving Ghent in 2000, he was Visiting Professor in Computer Science at Cornell University, and Director of e-Strategy and Programmes at the British Library. | Currently, he is the team leader of the Prototyping Team at the Research Library of the Los Alamos National Laboratory. The Team does research regarding various aspects of scholarly communication in the digital age, including information infrastructure, interoperability, digital preservation and indicators for the assessment of the quality of units of scholarly communication. Herbert has played a major role in creating the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), the Open Archives Initiative Object Reuse... Read More →


Thursday June 22, 2017 09:00 - 10:30
Room 325, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6

09:00

WCSA Working Meeting (Closed Meeting)
Thursday June 22, 2017 09:00 - 15:00
Room 312, Innis College 2 Sussex Ave, Toronto, ON M5S 1J5

11:00

Keynote: Salvatore Mele

Keynote Salvatore Mele

Preprints have shaped scholarly communication in High-Energy Physics since more than half a century. Pioneering preprint servers and community-driven digital libraries created a unique ecosystem for information discovery and access in the discipline (as well as in Astronomy and, to some extent, branches of Economics and the Social Sciences). These infrastructures have long coexisted with academic journals, serving distinct needs in the spectrum from dissemination to certification. Recently some academic journals entirely ‘flipped’ to Open Access, modifying some of those roles. We report on the results of two data-driven studies to assess this coexistence, and complementarity.

First, leveraging information from the INSPIREHEP.net platform, we analyzed millions of citations to and from preprints and journal articles to study the effect of early availability of scientific information in citation patterns in the discipline.

Second, with the gracious help of arXiv.org and leading scholarly publishers, we compared downloads statistics to assess access patterns for ‘fresh’ and ‘archival’ material and how Open Access modifies researchers’ practices.

Those findings are particularly relevant in the current scenario of renewed attention to preprints as a medium for scholarly communication.


Moderators
Speakers
avatar for Salvatore Mele

Salvatore Mele

Head of Open Access, CERN
Salvatore Mele holds a PhD in Physics and is head of Open Access at CERN, where he architected the SCOAP3 initiative [scoap3.org]: a partnership of 3’000 libraries and funding agencies in 46 countries which converted to Open Access the majority of High-Energy Physics articles. It is transparent for authors and leverages the CERN model of international collaboration... Read More →


Thursday June 22, 2017 11:00 - 12:30
Innis Town Hall 2 Sussex Ave, Toronto, ON M5S 1J5

12:30

Lunch
Thursday June 22, 2017 12:30 - 14:00
Innis Café Complex 2 Sussex Ave, Toronto, ON M5S 1J5

14:00

PSDL 2017: Physical Samples and Digital Libraries Part 1--CANCELLED

Research in disciplines such as the earth and biological sciences depends on the availability of representative physical samples that often have been collected at substantial cost and effort and some are irreplaceable. The EarthCube iSamples (Internet of Samples in the Earth Sciences) RCN (Research Coordination Network), funded by the National Science Foundation, aims to connect physical samples and sample collections across the Earth Sciences with digital data infrastructures to revolutionize their utility in the support of science. The goal of this workshop is to attract a broad audience comprising of biologists, earth scientists and those working with physical samples, data curators, along with computer and information scientists to learn from each other about the requirements of physical as well as digital sample and collection management. This is a fourth in a series of workshops held previously at JCDL 2016, ASIS&T Annual Meeting 2016, and a forthcoming workshop at iConference 2017, which have been targeted to develop a global community of scholars whose work relates to physical samples.

This workshop runs over two days:

  • June 22 2-5pm
  • June 23 9am-12pm

 

For more information, including submission instructions, see workshop webpage.

This program has been cancelled as of June 14, 2017.


Thursday June 22, 2017 14:00 - 17:00
Room 325, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6

14:00

Rich Semantics and Direct Representation for Digital Collections

Rich semantics supports detailed information organization for the contents of documents, across documents, and even across resources in different modalities. In its strongest form, rich semantics provides highly-structured direct representations. This workshop welcomes papers on new directions for frameworks using such rich information organization. Rich semantics goes beyond simple models for linked data such as those using RDF-based triples and beyond ad hoc ontologies. Rather, rich semantic frameworks may include complex entities, dynamic models, schemas, systems, and descriptive programs.

For more information, including submission instructions, see workshop webpage.


Thursday June 22, 2017 14:00 - 17:00
Room 4036, Blackburn Room, Robarts Library 140 St. George Street, Toronto, ON

14:00

RUMOUR-2017 Workshop On Social Media and the Web of Linked Data
RUMOUR-2017 aims to gather innovative approaches for exploitation of social media using semantic web technologies and linked data by bringing together research on the Semantic Web, Linked Data, and the Social Sciences. The aim of this workshop is to expand an internationally appreciated forum for scientific research in ICT, based on fields as semantic web, social networks and multi-agent systems, knowledge integration, etc. 

Workshop Website: https://profs.info.uaic.ro/~rumour/

Thursday June 22, 2017 14:00 - 17:00
Room 417, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6

14:00

Web Archiving and Digital Libraries (WADL) Part 1

This workshop will explore integration of Web archiving and digital libraries, so the complete life cycle involved is covered: creation/authoring, uploading/publishing in the Web (2.0), (focused) crawling, indexing, exploration (searching, browsing), archiving (of events), etc. It will include particular coverage of current topics of interest, like: big data, mobile web archiving, and systems (e.g., Memento, SiteStory, Hadoop processing).

This workshop runs over two days:

  • June 22 2-5pm
  • June 23 9am-12pm

 

For more information, including submission instructions, see workshop webpage.


Thursday June 22, 2017 14:00 - 17:00
Room 205, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6
 
Friday, June 23
 

09:00

PSDL 2017: Physical Samples and Digital Libraries Part 2--CANCELLED

Research in disciplines such as the earth and biological sciences depends on the availability of representative physical samples that often have been collected at substantial cost and effort and some are irreplaceable. The EarthCube iSamples (Internet of Samples in the Earth Sciences) RCN (Research Coordination Network), funded by the National Science Foundation, aims to connect physical samples and sample collections across the Earth Sciences with digital data infrastructures to revolutionize their utility in the support of science. The goal of this workshop is to attract a broad audience comprising of biologists, earth scientists and those working with physical samples, data curators, along with computer and information scientists to learn from each other about the requirements of physical as well as digital sample and collection management. This is a fourth in a series of workshops held previously at JCDL 2016, ASIS&T Annual Meeting 2016, and a forthcoming workshop at iConference 2017, which have been targeted to develop a global community of scholars whose work relates to physical samples.

This workshop runs over two days:

  • June 22 2-5pm
  • June 23 9am-12pm

 

For more information, including submission instructions, see workshop webpage.

This program has been cancelled as of June 14, 2017.


Friday June 23, 2017 09:00 - 12:00
Room 325, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6

09:00

Web Archiving and Digital Libraries (WADL) Part 2

This workshop will explore integration of Web archiving and digital libraries, so the complete life cycle involved is covered: creation/authoring, uploading/publishing in the Web (2.0), (focused) crawling, indexing, exploration (searching, browsing), archiving (of events), etc. It will include particular coverage of current topics of interest, like: big data, mobile web archiving, and systems (e.g., Memento, SiteStory, Hadoop processing).

This workshop runs over two days:

  • June 22 2-5pm
  • June 23 9am-12pm

 

For more information, including submission instructions, see workshop webpage.


Friday June 23, 2017 09:00 - 12:00
Room 205, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6

12:00

Lunch
Friday June 23, 2017 12:00 - 14:00
TBA