ACM/IEEE Joint Conference on Digital Libraries 2017
University of Toronto
JCDL 2017 | #JCDL@2017
Wednesday, June 21 • 16:00 - 17:30
Paper Session 09: Content Provenance and Reuse

David Bamman, Michelle Carney, Jon Gillick, Cody Hennesy and Vijitha Sridhar. Estimating the date of first publication in a large-scale digital library (Full)
One prerequisite for cultural analysis in large-scale digital libraries is an accurate estimate of the date of composition--as distinct from publication--of the texts they contain. In this work, we present a manually annotated dataset of first dates of publication of three samples of books from the HathiTrust Digital Library (uniform random, uniform fiction, and stratified by decade), and empirically evaluate the disparity between these gold standard labels and several approximations used in practice (using the date of publication as provided in metadata, several deduplication methods, and automatically predicting the date of composition from the text of the book). We find that a simple heuristic of metadata-based deduplication works best in pratice, and text-based composition dating is accurate enough to inform the analysis of "apparent time."

George Buchanan and Dana Mckay. The Lowest form of Flattery: Characterising Text Re-use and Plagiarism Patterns in a Digital Library Corpus (Full)
The re-use of text—particularly misuse, or plagiarism—is a contentious issue for researchers, universities, libraries and publishers. Technological approaches to identifying student plagiarism, such as TurnItIn are now widespread. Academic publishing, however, does not typically come under such scrutiny. While it is common knowledge that plagiarism occurs, we do not know how frequently or how extensively, nor where in a document it is likely to be found. This paper offers the first assessment of text re-use within the field of digital libraries. It also characterises text re-use generally (and plagiarism specifically) according to its location in the document, author seniority, publication venue and open access. As a secondary contribution, we suggest routes towards more rigorous plagiarism detection and management in the future.

Bela Gipp, Corinna Breitinger, Norman Meuschke and Joeran Beel. CryptSubmit: Introducing Securely Timestamped Manuscript Submission and Peer Review Feedback using the Blockchain (Short)
Manuscript submission systems are a central fixture in scholarly publishing. However, researchers who submit their unpublished work to a conference or journal must trust that the system and its provider will not accidentally or willfully leak unpublished findings. Additionally, researchers must trust that the program committee and the anonymous peer reviewers will not plagiarize unpublished ideas or results. To address these weaknesses, we propose a method that automatically creates a publicly verifiable, tamper-proof timestamp for manuscripts utilizing the decentralized Bitcoin blockchain. The presented method hashes each submitted manuscript and uses the API of the timestamping service OriginStamp to persistently embed this manuscript hash on Bitcoin’s blockchain. Researchers can use this tamper-proof trusted timestamp to prove that their manuscript existed in its specific form at the time of submission to a conference or journal. This verifiability allows researchers to stake a claim to their research findings and intellectual property, even in the face of vulnerable submission platforms or dishonest peer reviewers. Optionally, the system also associates trusted timestamps with the feedback and ideas shared by peer reviewers to increase the traceability of ideas. The proposed concept, which we introduce as CryptSubmit, is currently being integrated into the open-source conference management system OJS. In the future, the method could be integrated at nearly no overhead cost into other manuscript submission systems, such as EasyChair, ConfTool, or Ambra. The introduced method can also improve electronic pre-print services and storage systems for research data.

Mayank Singh, Abhishek Niranjan, Divyansh Gupta, Nikhil Angad Bakshi, Animesh Mukherjee and Pawan Goyal. Citation sentence reuse behavior of scientists: A case study on massive bibliographic text dataset of computer science (Short)
Our current knowledge of scholarly plagiarism is largely based on the similarity between full text research articles. In this paper, we propose an innovative and novel conceptualization of scholarly plagiarism in the form of reuse of explicit citation sentences in scientific research articles. Note that while full-text plagiarism is an indicator of a gross-level behavior, copying of citation sentences is a more nuanced micro-scale phenomenon observed even for well-known researchers. The current work poses several interesting questions and attempts to answer them by empirically investigating a large bibliographic text dataset from computer science containing millions of lines of citation sentences. In particular, we report evidences of massive copying behavior. We also present several striking real examples throughout the paper to showcase widespread adoption of this undesirable practice. In contrast to the popular perception, we find that copying tendency increases as an author matures. The copying behavior is reported to exist in all fields of computer science; however, the theoretical fields indicate more copying than the applied fields.


Edie Rasmussen

University of British Columbia


David Bamman

UC Berkeley

George Buchanan

University of Melbourne

Cody Hennesy

E-Learning Librarian, UC Berkeley Library

Dana McKay

University of Melbourne
avatar for Norman Meuschke

Norman Meuschke

PhD Student, University of Konstanz, Germany
Research interests: | Information Retrieval for text, images, and mathematical content | Plagiarism Detection | News Analysis | Citation and Link Analysis | Blockchain Technology | Information Visualization

Mayank Singh

Indian Institute of Technology Kharagpur

Wednesday June 21, 2017 16:00 - 17:30
Innis Town Hall 2 Sussex Ave, Toronto, ON M5S 1J5

