unicaen greyc

Research Datasets & Demos



PURE: Pattern Utilization for Representative Entity type classification


Thirty years of the Web have led to a tremendous amount of contents. While contents of the early years have been predominantly “simple” HTML documents, more recent ones have become more and more “machine-interpretable”. Named entities - ideally explicitly and intentionally annotated - pave the way toward a semantic exploration and exploitation of the data. While this appears to be the golden sky toward a more human-centricWeb, it not necessarily is. The key-point is simple: “the more the merrier” is not necessarily the case along all dimensions. For instance, each and every named entity provides via the Web of data a plenitude of information potentially overwhelming the end-user. In particular, named entities are predominantly annotated with multiple types without any order of importance associated. In order to depict the most concise type information, we introduce an approach towards Pattern Utilization for Representative Entity type classification called PURE. To this end, PURE aims at exploiting solely structural patterns derived from knowledge graphs in order to “purify” the most representative type(s) associated with a named entity. Our experiments with named entities in Wikipedia demonstrate the viability of our approach and improvement over competing strategies.

Downloads and Datasets

Publication




CALVADOS: A Tool for the Semantic Analysis and Digestion of Web Contents


Web users these days are confronted with an abundance of information. While this is clearly beneficial in general, there is a risk of "information overload". To this end, there is an increasing need of filtering, classifying and/or summarizing Web contents automatically. In order to help consumers in efficiently deriving the semantics from Web contents, we have developed the CALVADOS (Content AnaLytics ViA Digestion Of Semantics) system. To this end, CALVADOS raises contents to the entity-level and digests its inherent semantics. In this demo, we present how entity-level analytics can be employed to automatically classify the main topic of a Web content and reveal the semantic building blocks associated with the corresponding document.

Demo

Publication




Semantic Fingerprinting: A Novel Method for Entity-level Content Classification


With the constantly growing Web, there is a need for automatically analyzing, interpreting and organizing contents. A particular need is given by the management of Web contents with respect to classification systems, e.g. based on ontologies in the LOD (Linked Open Data) cloud. Research in deep learning recently has shown great progress in classifying data based on large volumes of training data. However, "targeted" and fine-grained information systems require classification methods based on a relatively small number of "representative" samples. For that purpose, we present an approach that allows a semantic exploitation of Web contents and - at the same time - computationally efficient processing based on "semantic fingerprinting". To this end, we raise Web contents to the entity-level and exploit entity-related information that allows "distillation" and fine-grained classification of the Web content by its "semantic fingerprint". In experimental results on Web contents classified in Wikipedia, we show the superiority of our approach against state-of-the-art methods.

Downloads and Datasets

Publication




ELEVATE: A Framework for Entity-level Event Diffusion Prediction into Foreign Language Communities


The accessibility to news via the Web or other “traditional” media allows a rapid diffusion of information into almost every part of the world. These reports cover the full spectrum of events, ranging from locally relevant ones up to those that gain global attention. The societal impact of an event can be relatively easily “measured” by the attention it attracts (e.g. in the number of responses it receives and provokes) in the news or social media. However, this does not necessarily reflect its inter-cultural impact and its diffusion into other communities. In order to address the issue of predicting the spread of information into foreign language communities we introduce the ELEVATE framework. ELEVATE exploits entity information from Web contents and harnesses location related data for language-related event diffusion prediction. Our experiments on event spreading across Wikipedia communities of different language demonstrate the viability of our approach and improvement over state-of-the-art approaches.

Downloads and Datasets

Publication