JCDL2017 has ended
ACM/IEEE Joint Conference on Digital Libraries 2017
University of Toronto
JCDL 2017 | #JCDL@2017
Back To Schedule
Tuesday, June 20 • 11:00 - 12:30
Paper Session 01: Web Archives

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Justin Brunelle, Michele Weigle and Michael Nelson. Archival Crawlers and JavaScript: Discover More Stuff but Crawl More Slowly (Full)
The web is today's primary publication medium, making web archiving an important activity for historical and analytical purposes. Web pages are correspondingly interactive, resulting in pages that are increasingly difficult to archive. JavaScript enables interactions that can potentially change the client-side state of a representation. We refer to representations that load embedded resources via JavaScript as deferred representations. It is difficult to discover and crawl all of the resources in deferred representations and the result of archiving deferred representations is archived web pages that are either incomplete or erroneously load embedded resources from the live web. We propose a method of discovering and archiving deferred representations and their descendants (representation states) that are only reachable through client-side events. Our approach identified an average of 38.5 descendants per seed URI crawled, 70.9% of which are reached through an onclick event. This approach also added 15.6 times more embedded resources than Heritrix to the crawl frontier, but at a crawl rate that was 38.9 times slower than simply using Heritrix. If our method was applied to the July 2015 Common Crawl dataset, a web-scale archival crawler will discover an additional 7.17 PB (5.12 times more) of information per year. This illustrates the significant increase in resources necessary for more thorough archival crawls.

Faryaneh Poursardar and Frank Shipman. What is Part of that Resource? User Expectations for Personal Archiving (Short)
Users wish to preserve Internet resources for later use. But what is part of and what is not part of an Internet resource remains an open question. In this paper we examine how specific relationships between web pages affect user perceptions of their being part of the same resource. This study presented participants with pairs of pages and asked about their expectation for having access to the second page after they save the first. The primary-page content in the study comes from multi-page stories, multi-image collections, product pages with reviews and ratings on separate pages, and short single page writings. Participants were asked to agree or disagree with three statements regarding their expectation for later access. Nearly 80% of participants agreed in the case of articles spread across multiple pages, images in the same collection, and additional details or assessments of product information. About 50% agreed for related content on pages linked to by the original page or related items while only about 30% thought advertisements or wish lists linked to were part of the resource. Differences in responses to the same page pairs for the three statements regarding later access indicate users’ recognize the difference between what would be valuable to them and current implementations of saving web content.

Weijia Xu, Maria Esteva, Deborah Beck and Yi-Hsuan Hsieh. A Portable Strategy for Preserving Web Applications and Data Functionality (Short)
Increasingly, the value of research data not only resides in its content, but on how it is made available to the users. To introduce complex research topics data is often presented interactively through a web application, the design of which is the result of years of work by researchers. Therefore, preserving the data and the application's functionalities becomes equally important. In the current academic IT environment it is often the case that these web applications are developed by multiple people with different expertise, deployed within shared technology infrastructures, and that they evolve technically and in content over short periods of time. This lifecycle model presents challenges to reproducibility and portability of the application across technology platforms over time. Preservation approaches such as virtualization and emulation may not be applicable to these cases due, among other issues, to the co-dependencies of the hosting infrastructure, to missing documentation about the original development, and to the evolving nature of these applications. To address these issues, we propose a functional preservation strategy to decouple web applications and their corresponding data from their hosting environment and re-launching data and web-code in a more portable environment without compromising the look and feel or the interactive features. Crucial to the strategy is identifying discrepancies between the application and the existing hosting environment including library dependencies and system’s configuration, and to adapt them within a simplified virtual environment to bring up the application’s functionality. Advantages over virtualization and emulation reside in not having to recreate one or all of the layers of the original hosting environment. We demonstrate this approach using as a case study the Speech Presentation in Homeric Epics database, a digital humanities project, and evaluated portability by deploying the application in two different hosting environments. We also assessed the strategy in relation to the ease by which a a non-savvy user can re-launch the application in a new host. 

Sawood Alam, Mat Kelly, Michele Weigle and Michael Nelson. Client-side Reconstruction of Composite Mementos Using ServiceWorker (Short)
We use the ServiceWorker (SW) web API to intercept HTTP requests for embedded resources and reconstruct Composite Mementos without the need for conventional URL rewriting typically performed by web archives. URL rewriting is a problem for archival replay systems, especially for URLs constructed by JavaScript; frequently resulting in incorrect URI references. By intercepting requests on the client using SW, we are able to strategically reroute instead of rewrite. Our implementation moves rewriting to clients, saving servers' computing resources and allowing servers to return responses more quickly. Our experiments show that retrieving the original instead of rewritten pages from the archive reduces time overhead by 35.66% and data overhead by 19.68%. Our system prevents Composite Mementos from leaking the live web while being easy to distribute and maintain.

avatar for Martin Klein

Martin Klein

Scientist, Los Alamos National Laboratory

avatar for Sawood Alam

Sawood Alam

Researcher, Old Dominion University
I am commonly tagged by Web, Digital Library, Web Archiving, Ruby on Rails, PHP, HTML, CSS, JavaScript, ExtJS, Go, Urdu, RTL, Docker, and Linux.
avatar for Justin F. Brunelle

Justin F. Brunelle

Lead Researcher, The MITRE Corporation
Lead Researcher at The MITRE Corporation and Adjunct Assistant Professor at Old Dominion University. Research interests include: web science, digital preservation, cloud computing, emerging technologies
avatar for Michael Nelson

Michael Nelson

Professor, Old Dominion University

Faryaneh Poursardar

PhD Candidate, Texas A&M University
Web archive, HCI
avatar for Michele Weigle

Michele Weigle

Associate Professor, Old Dominion University

Tuesday June 20, 2017 11:00 - 12:30 EDT
Innis Town Hall 2 Sussex Ave, Toronto, ON M5S 1J5