User interest tracing is a common practice in many Web use-cases including, but not limited to, search, recommendation or intelligent assistants. The overall aim is to provide the user a personalized “Web experience” by aggregating and exploiting a plenitude of user data derived from collected logs, accessed contents, and/or mined community context. As such, fairly basic features such as terms and graph structures can be utilized in order to model a user’s interest. While there are clearly positive aspects in the before mentioned application scenarios, the user’s privacy is highly at risk. In order to study inherent privacy risks, this paper studies Semantic User Interest Tracing (SUIT in short) by investigating a user’s publishing/editing behavior of Web contents. In contrast to existing approaches, SUIT solely exploits the (semantic) concepts [categories] inherent in documents derived via entity-level analytics. By doing so, we raise Web contents to the entity-level. Thus, we are able to abstract the user interest from plain text strings to “things”. In particular, we utilize the inherited structural relationships present among the concepts derived from a knowledge graph in order to identify the user associated with a specific Web content. Our extensive experiments on Wikipedia show that our approach outperforms state of the art approaches in tracing and predicting user behavior in a single language. In addition, we also demonstrate the viability of our semantic (language-agnostic) approach in multilingual experiments. As such, SUIT is capable of revealing the user’s identity, which demonstrates the fine line between personalization and surveillance, raising questions regarding ethical considerations at the same time.
Digital libraries build on classifying contents by capturing their semantics and (optionally) aligning the description with an underlying categorization scheme. This process is usually based on human intervention, either by the content creator or a curator. As such, this procedure is highly time-consuming and - thus - expensive. In order to support the human in data curation, we introduce an annotation tagging system called “AnnoTag”. AnnoTag aims at providing concise content annotations by employing entity-level analytics in order to derive semantic descriptions in the form of tags. In particular, we are generating “Semantic LOD Tags” (linked open data) that allow an interlinking of the derived tags with the LOD cloud. Based on a qualitative evaluation on Web news articles we prove the viability of our approach and the high-quality of the automatically extracted information.
Efforts by national libraries, institutions, and (inter-) national projects have led to an increased effort in preserving textual contents - including nondigitally born data - for future generations. These activities have resulted in novel initiatives in preserving the cultural heritage by digitization. However, a systematic approach toward Textual Data Denoising (TD2) is still in its infancy and commonly limited to a primarily dominant language (mostly English). However, digital preservation requires a universal approach. To this end, we introduce a “Framework for Enabling Data Denoising via robust contextual embeddings” (FETD2). FETD2 improves data quality by training language-specific data denoising modelsbased on a small number of language-specific training data. Our approach employs a bi-directional language modeling in order to produce noise-resilient deep contextualized embeddings. In experiments we show the superiority compared with the state-of-the-art.
Capturing and exploiting a content's semantic is a key success factor for Web search. To this end, it is crucial to - ideally automatically - extract the core semantics of the data being processed and link this information with some formal representation, such as an ontology. By intertwining both, search becomes semantic by simultaneously allowing end-users a structured access to the data via the underlying ontology. Connecting both, we introduce the SEMANNOREX framework in order to provide semantically enriched access to a news corpus from Websites and Wikinews.
Thirty years of the Web have led to a tremendous amount of contents. While contents of the early years have been predominantly “simple” HTML documents, more recent ones have become more and more “machine-interpretable”. Named entities - ideally explicitly and intentionally annotated - pave the way toward a semantic exploration and exploitation of the data. While this appears to be the golden sky toward a more human-centricWeb, it not necessarily is. The key-point is simple: “the more the merrier” is not necessarily the case along all dimensions. For instance, each and every named entity provides via the Web of data a plenitude of information potentially overwhelming the end-user. In particular, named entities are predominantly annotated with multiple types without any order of importance associated. In order to depict the most concise type information, we introduce an approach towards Pattern Utilization for Representative Entity type classification called PURE. To this end, PURE aims at exploiting solely structural patterns derived from knowledge graphs in order to “purify” the most representative type(s) associated with a named entity. Our experiments with named entities in Wikipedia demonstrate the viability of our approach and improvement over competing strategies.
Web users these days are confronted with an abundance of information. While this is clearly beneficial in general, there is a risk of "information overload". To this end, there is an increasing need of filtering, classifying and/or summarizing Web contents automatically. In order to help consumers in efficiently deriving the semantics from Web contents, we have developed the CALVADOS (Content AnaLytics ViA Digestion Of Semantics) system. To this end, CALVADOS raises contents to the entity-level and digests its inherent semantics. In this demo, we present how entity-level analytics can be employed to automatically classify the main topic of a Web content and reveal the semantic building blocks associated with the corresponding document.
With the constantly growing Web, there is a need for automatically analyzing, interpreting and organizing contents. A particular need is given by the management of Web contents with respect to classification systems, e.g. based on ontologies in the LOD (Linked Open Data) cloud. Research in deep learning recently has shown great progress in classifying data based on large volumes of training data. However, "targeted" and fine-grained information systems require classification methods based on a relatively small number of "representative" samples. For that purpose, we present an approach that allows a semantic exploitation of Web contents and - at the same time - computationally efficient processing based on "semantic fingerprinting". To this end, we raise Web contents to the entity-level and exploit entity-related information that allows "distillation" and fine-grained classification of the Web content by its "semantic fingerprint". In experimental results on Web contents classified in Wikipedia, we show the superiority of our approach against state-of-the-art methods.
Recent research has shown significant progress in forecasting the impact and spread of societal relevant events into online communities of different languages. Here, raising contents to the entity-level has been the driving force in \understanding" Web contents. In this demonstration paper, we present a novel Web-based tool that exploits entity information from online news in order to assess and visualize their virality.
The accessibility to news via the Web or other “traditional” media allows a rapid diffusion of information into almost every part of the world. These reports cover the full spectrum of events, ranging from locally relevant ones up to those that gain global attention. The societal impact of an event can be relatively easily “measured” by the attention it attracts (e.g. in the number of responses it receives and provokes) in the news or social media. However, this does not necessarily reflect its inter-cultural impact and its diffusion into other communities. In order to address the issue of predicting the spread of information into foreign language communities we introduce the ELEVATE framework. ELEVATE exploits entity information from Web contents and harnesses location related data for language-related event diffusion prediction. Our experiments on event spreading across Wikipedia communities of different language demonstrate the viability of our approach and improvement over state-of-the-art approaches.