Research Datasets & Demos

SUIT: Semantic User Interest Tracing

User interest tracing is a common practice in many Web use-cases including, but not limited to, search, recommendation or intelligent assistants. The overall aim is to provide the user a personalized “Web experience” by aggregating and exploiting a plenitude of user data derived from collected logs, accessed contents, and/or mined community context. As such, fairly basic features such as terms and graph structures can be utilized in order to model a user’s interest. While there are clearly positive aspects in the before mentioned application scenarios, the user’s privacy is highly at risk. In order to study inherent privacy risks, this paper studies Semantic User Interest Tracing (SUIT in short) by investigating a user’s publishing/editing behavior of Web contents. In contrast to existing approaches, SUIT solely exploits the (semantic) concepts [categories] inherent in documents derived via entity-level analytics. By doing so, we raise Web contents to the entity-level. Thus, we are able to abstract the user interest from plain text strings to “things”. In particular, we utilize the inherited structural relationships present among the concepts derived from a knowledge graph in order to identify the user associated with a specific Web content. Our extensive experiments on Wikipedia show that our approach outperforms state of the art approaches in tracing and predicting user behavior in a single language. In addition, we also demonstrate the viability of our semantic (language-agnostic) approach in multilingual experiments. As such, SUIT is capable of revealing the user’s identity, which demonstrates the fine line between personalization and surveillance, raising questions regarding ethical considerations at the same time.

Downloads and Datasets

SUIT

Publication

A. Kumar and M. Spaniol
There is a fine Line between Personalization and Surveillance: Semantic User Interest Tracing via Entity-level Analytics
Proceedings of the 14^th International ACM Web Science Conference (WebSci'22), Barcelona, Spain, June 26-29, 2022, pp. 22-33.
BibTeX

AnnoTag: Concise Content Annotation via LOD Tags derived from Entity-level Analytics

Digital libraries build on classifying contents by capturing their semantics and (optionally) aligning the description with an underlying categorization scheme. This process is usually based on human intervention, either by the content creator or a curator. As such, this procedure is highly time-consuming and - thus - expensive. In order to support the human in data curation, we introduce an annotation tagging system called “AnnoTag”. AnnoTag aims at providing concise content annotations by employing entity-level analytics in order to derive semantic descriptions in the form of tags. In particular, we are generating “Semantic LOD Tags” (linked open data) that allow an interlinking of the derived tags with the LOD cloud. Based on a qualitative evaluation on Web news articles we prove the viability of our approach and the high-quality of the automatically extracted information.

Video and Demo

Download and Datasets

Publication

A. Kumar and M. Spaniol
AnnoTag: Concise Content Annotation via LOD Tags derived from Entity-level Analytics
Proceedings of the 25^th International Conference on Theory and Practice of Digital Libraries (TPDL 2021), virtual conference, September 13-17, 2021, pp. 175-180.
BibTeX

FETD²: A Framework for Enabling Textual Data Denoising via Robust Contextual Embeddings

Efforts by national libraries, institutions, and (inter-) national projects have led to an increased effort in preserving textual contents - including nondigitally born data - for future generations. These activities have resulted in novel initiatives in preserving the cultural heritage by digitization. However, a systematic approach toward Textual Data Denoising (TD²) is still in its infancy and commonly limited to a primarily dominant language (mostly English). However, digital preservation requires a universal approach. To this end, we introduce a “Framework for Enabling Data Denoising via robust contextual embeddings” (FETD²). FETD² improves data quality by training language-specific data denoising modelsbased on a small number of language-specific training data. Our approach employs a bi-directional language modeling in order to produce noise-resilient deep contextualized embeddings. In experiments we show the superiority compared with the state-of-the-art.

Downloads and Datasets

FETD² Data

Publication

Govind, C. Alec, J.-L. Manguin and M. Spaniol
FETD²: A Framework for Enabling Textual Data Denoising via Robust Contextual Embeddings
Proceedings of the 25^th International Conference on Theory and Practice of Digital Libraries (TPDL 2021), virtual conference, September 13-17, 2021, pp. 3-16.
BibTeX

Semantic Search via Entity-Types: The SEMANNOREX Framework

Capturing and exploiting a content's semantic is a key success factor for Web search. To this end, it is crucial to - ideally automatically - extract the core semantics of the data being processed and link this information with some formal representation, such as an ontology. By intertwining both, search becomes semantic by simultaneously allowing end-users a structured access to the data via the underlying ontology. Connecting both, we introduce the SEMANNOREX framework in order to provide semantically enriched access to a news corpus from Websites and Wikinews.

Video and Demo

Downloads and Datasets

SEMANNOREX dataset

Publication

A. Kumar, Govind and M. Spaniol
Semantic Search via Entity-Types: The SEMANNOREX Framework
Proceedings of the 30^th Web Conference (www2021), virtual conference, April 12-23, 2021, pp. 690-694.
BibTeX

PURE: Pattern Utilization for Representative Entity type classification

Thirty years of the Web have led to a tremendous amount of contents. While contents of the early years have been predominantly “simple” HTML documents, more recent ones have become more and more “machine-interpretable”. Named entities - ideally explicitly and intentionally annotated - pave the way toward a semantic exploration and exploitation of the data. While this appears to be the golden sky toward a more human-centricWeb, it not necessarily is. The key-point is simple: “the more the merrier” is not necessarily the case along all dimensions. For instance, each and every named entity provides via the Web of data a plenitude of information potentially overwhelming the end-user. In particular, named entities are predominantly annotated with multiple types without any order of importance associated. In order to depict the most concise type information, we introduce an approach towards Pattern Utilization for Representative Entity type classification called PURE. To this end, PURE aims at exploiting solely structural patterns derived from knowledge graphs in order to “purify” the most representative type(s) associated with a named entity. Our experiments with named entities in Wikipedia demonstrate the viability of our approach and improvement over competing strategies.

Downloads and Datasets

PURE

Publication

A. Kumar, Govind, C. Alec and M. Spaniol
Blogger or President? Exploitation of Patterns in Entity Type Graphs for Representative Entity Type Classification
Proceedings of the 12^th International ACM Web Science Conference (WebSci '20), Southampton, UK, July 7-10, 2020, pp. 59-68.
BibTeX

CALVADOS: A Tool for the Semantic Analysis and Digestion of Web Contents

Web users these days are confronted with an abundance of information. While this is clearly beneficial in general, there is a risk of "information overload". To this end, there is an increasing need of filtering, classifying and/or summarizing Web contents automatically. In order to help consumers in efficiently deriving the semantics from Web contents, we have developed the CALVADOS (Content AnaLytics ViA Digestion Of Semantics) system. To this end, CALVADOS raises contents to the entity-level and digests its inherent semantics. In this demo, we present how entity-level analytics can be employed to automatically classify the main topic of a Web content and reveal the semantic building blocks associated with the corresponding document.

Demo

CALVADOS Demo

Publication

Govind, A. Kumar, C. Alec, M. Spaniol
CALVADOS: A Tool for the Semantic Analysis and Digestion of Web Contents
Proceedings of the 16^th Extended Semantic Web Conference (ICWE 2019), Portorož, Slovenia, June 2-6, 2019, pp. 84-89.
BibTeX

Semantic Fingerprinting: A Novel Method for Entity-level Content Classification

With the constantly growing Web, there is a need for automatically analyzing, interpreting and organizing contents. A particular need is given by the management of Web contents with respect to classification systems, e.g. based on ontologies in the LOD (Linked Open Data) cloud. Research in deep learning recently has shown great progress in classifying data based on large volumes of training data. However, "targeted" and fine-grained information systems require classification methods based on a relatively small number of "representative" samples. For that purpose, we present an approach that allows a semantic exploitation of Web contents and - at the same time - computationally efficient processing based on "semantic fingerprinting". To this end, we raise Web contents to the entity-level and exploit entity-related information that allows "distillation" and fine-grained classification of the Web content by its "semantic fingerprint". In experimental results on Web contents classified in Wikipedia, we show the superiority of our approach against state-of-the-art methods.

Downloads and Datasets

Publication

Govind, C. Alec, M. Spaniol
Semantic Fingerprinting: A Novel Method for Entity-level Content Classification
Proceedings of the 18th International Conference on Web Engineering (ICWE 2018), Cáceres, Spain, June 6-8, 2018, pp. 279-287.
BibTeX

ELEVATE-live: Assessment and Visualization of Online News Virality via Entity-level Analytics

Recent research has shown significant progress in forecasting the impact and spread of societal relevant events into online communities of different languages. Here, raising contents to the entity-level has been the driving force in \understanding" Web contents. In this demonstration paper, we present a novel Web-based tool that exploits entity information from online news in order to assess and visualize their virality.

Demo

ELEVATE-live Demo

Publication

Govind, C. Alec, M. Spaniol
ELEVATE-live: Assessment and Visualization of Online News Virality via Entity-level Analytics
Proceedings of the 18th International Conference on Web Engineering (ICWE 2018), Cáceres, Spain, June 6-8, 2018, pp. 482-486.
BibTeX

ELEVATE: A Framework for Entity-level Event Diffusion Prediction into Foreign Language Communities

The accessibility to news via the Web or other “traditional” media allows a rapid diffusion of information into almost every part of the world. These reports cover the full spectrum of events, ranging from locally relevant ones up to those that gain global attention. The societal impact of an event can be relatively easily “measured” by the attention it attracts (e.g. in the number of responses it receives and provokes) in the news or social media. However, this does not necessarily reflect its inter-cultural impact and its diffusion into other communities. In order to address the issue of predicting the spread of information into foreign language communities we introduce the ELEVATE framework. ELEVATE exploits entity information from Web contents and harnesses location related data for language-related event diffusion prediction. Our experiments on event spreading across Wikipedia communities of different language demonstrate the viability of our approach and improvement over state-of-the-art approaches.

Downloads and Datasets

ELEVATE

Publication

Govind and M. Spaniol
ELEVATE: A Framework for Entity-level Event Diffusion Prediction into Foreign Language Communities
Proceedings of the 9^th International ACM Web Science Conference (WebSci '17), Troy, NY, USA, June 26-28, 2017, pp. 111-120.
BibTeX

Research Datasets & Demos

SUIT: Semantic User Interest Tracing

Downloads and Datasets

Publication

AnnoTag: Concise Content Annotation via LOD Tags derived from Entity-level Analytics

Video and Demo

Download and Datasets

Publication

FETD2: A Framework for Enabling Textual Data Denoising via Robust Contextual Embeddings

Downloads and Datasets

Publication

Semantic Search via Entity-Types: The SEMANNOREX Framework

Video and Demo

Downloads and Datasets

Publication

PURE: Pattern Utilization for Representative Entity type classification

Downloads and Datasets

Publication

CALVADOS: A Tool for the Semantic Analysis and Digestion of Web Contents

Demo

Publication

Semantic Fingerprinting: A Novel Method for Entity-level Content Classification

Downloads and Datasets

Publication

ELEVATE-live: Assessment and Visualization of Online News Virality via Entity-level Analytics

Demo

Publication

ELEVATE: A Framework for Entity-level Event Diffusion Prediction into Foreign Language Communities

Downloads and Datasets

Publication

FETD²: A Framework for Enabling Textual Data Denoising via Robust Contextual Embeddings