This event has ended. View the official site or create your own event → Check it out
This event has ended. Create your own
ACM/IEEE Joint Conference on Digital Libraries 2017
University of Toronto
JCDL 2017 | #JCDL@2017
View analytic
Wednesday, June 21 • 16:00 - 17:30
Paper Session 08: Classification and Clustering

Sign up or log in to save this to your schedule and see who's attending!

Eduardo Castro, Saurabh Chakravarty, Eric Williamson, Denilson Pereira and Edward Fox. Classifying Short Unstructured Data using the Apache Spark Platform (Full)
People worldwide use Twitter to post updates about the events that concern them directly or indirectly. Study of these posts can help identify global events and trends of importance. Similarly, E-commerce applications organize their products in a way that can facilitate their management and satisfy the needs and expectations of their customers. However, classifying data such as tweets or product descriptions is still a challenge. These data are described by short texts, containing in their vocabulary abbreviations of sentences, emojis, hashtags, implicit codes, and other non-standard usage of written language. Consequently, traditional text classification techniques are not effective on these data. In this paper, we describe our use of the Spark platform to implement two classification strategies to process large data collections, where each datum is a short textual description. One of our solutions uses an associative classifier, while the other is based on a multiclass Logistic Regression classifier using Word2Vec as a feature selection and transformation technique. Our associative classifier captures the relationships among words that uniquely identify each class, and Word2Vec captures the semantic and syntactic context of the words. In our experimental evaluation, we compared our solutions, as well as Spark MLlib classifiers. We assessed effectiveness, efficiency, and memory requirements. The results indicate that our solutions are able to effectively classify the millions of data instances composed of thousands of distinct features and classes, found in our digital libraries.

Abel Elekes, Martin Schäler and Klemens Böhm. On the Various Semantics of Similarity in Word Embedding Models (Full)

*Best Student Paper Award Nominee

Finding similar words with the help of word embedding models has yielded meaningful results in many cases. However, the notion of similarity has remained ambiguous. In this paper, we examine when exactly similarity values in word embedding models are meaningful. To do so, we analyze the statistical distribution of similarity values systematically, in two series of experiments. The first one examines how the distribution of similarity values depends on the different embedding model algorithms and parameters. The second one starts by showing that intuitive similarity thresholds do not exist. We then propose a method stating which similarity values actually are meaningful for a given embedding model. In more abstract terms, our insights should give way to a better understanding of the notion of similarity in embedding models and to more reliable evaluations of such models.

Mirco Kocher and Jacques Savoy. Author Clustering Using Spatium (Short)
This paper presents the author clustering problem and compares it to related authorship attribution questions. The proposed model is based on a distance measure called Spatium derived from the Canberra measure (weighted version of L1 norm). The selected features consist of the 200 most frequent words and punctuation symbols. An evaluation methodology is presented and the test collections are extracted from the PAN CLEF 2016 evaluation campaign. In addition to those, we also consider two additional corpora reflecting the literature domain more closely. Based on four different languages, the evaluation measures demonstrate a high precision and high F1 values for all 20 test collections. A more detailed analysis provides reasons explaining some of the failures of the Spatium model.

Shaobin Xu and David Smith. Retrieving and Combining Repeated Passages to Improve OCR (Short)
We present a novel approach to improve the output of optical character recognition (OCR) systems by first detect ing duplicate passages in their output and then performing consensus decoding combined with a language model. This approach is orthogonal to, and may be combined with, previously proposed methods for combining the output of different OCR systems on the same text or the output of the same OCR system on differently processed images of the same text. It may also be combined with methods to estimate the parameters of a noisy channel model of OCR errors. Additionally, the current method generalizes previous proposals for simple majority-vote combination of known duplicated texts. On a corpus of historical newspapers, an annotated set of clusters has a baseline word error rate (WER) of 33%. A majority vote procedure reaches 23% on passages where one or more duplicates were found, and consensus decoding combined with a language model achieves 18% WER. In a separate experiment, newspapers were aligned to very widely reprinted texts such as State of the Union speeches, producing clusters with up to 58 witnesses. Beyond 20 witnesses, simple majority vote outperforms language model rescoring, though the gap between them is much less in this experiment.

avatar for Peter Organisciak

Peter Organisciak

University of Illinois at Urbana-Champaign

avatar for Edward Fox

Edward Fox

Professor, Virginia Tech (VPI&SU)
digital libraries, family, students, Reiki
avatar for Mirco Kocher

Mirco Kocher

PhD student, University of Neuchâtel

Wednesday June 21, 2017 16:00 - 17:30
Room 205, Faculty of Information 140 St. George Street, Toronto, ON, M5S 3G6

Attendees (21)