Improving Effectiveness of Knowledge Graph Generation Rule Creation and Execution

Summary

More and more data is generated by an increasing number of agents (applications and devices) that are connected to the Internet, all contributing to data that is available on the Web. When this data is analyzed, combined, and shared powerful, new techniques can be designed and deployed, such as artificial intelligence applied by personal assistants (e.g., Apple's Siri and Amazon Alexa), improved search engines (e.g., Google Search and Microsoft's Bing), and decentralized data storages (as opposed to a large, central storage used by big companies).

Due to this increase in data, methods originally used to model data have become insufficient: they isolate the data, which resides in different data sources, and make it hard for agents to exchange data on the Web. Data cannot be easily exchanged between agents, because every agent uses a different data model to describe the concepts and relationships of the data they generate and use.

The Semantic Web offers a solution to this problem: knowledge graphs, which are directed graph that provides semantic descriptions of entities and their relationships. These graphs are often generated from other data sources, such as multiple databases and files that each can be organized and structured in different ways. For instance, the DBpedia knowledge graph is generated from Wikipedia, and the Google knowledge graph is generated from different sources, including data coming from Google's own services and Wikipedia.

A common way to generate these knowledge graphs is by using rules, which are created and executed. The rules attach semantic annotations, using concepts and relationships, to data in those sources. Which concepts and relationships are used and how they are applied to data sources is contained in a semantic model. Rules thereby determine how data sources are modeled using specific concepts and relationships during knowledge graph generation. The syntax and grammar of these rules are determined by a knowledge graph generation language, such as R2RML and RML. Thus, the rule creation is influenced by three aspects:

data sources,
concepts and relationships, and
semantic model.

Once (a subset of) the rules are created, they are executed to generate a knowledge graph. This is done via a processor: a software tool that, given a set of rules and data sources, generates knowledge graphs. Multiple processors can be developed for a single knowledge graph generation language with each a different set of features: conformance to the specification of the rule language, API, programming language, scalability, and so on. Users can switch between them without changing the rules, and thus, the selection of the most suitable processor depends on the use case at hand.

In this dissertation, we look at three challenges and discuss our contributions to tackle them. The first challenge deals with the improvement of users' understanding of a knowledge graph creation's components. The second challenge deals with the avoidance and removal of inconsistencies introduced by concepts, relationships, and semantic model, which influences the rule creation. Both can lead to inconsistencies in the generated knowledge graphs. Inconsistencies in knowledge graphs are introduced when concepts and relationships are used without adhering to their restrictions, and this affects the graphs' quality. Possible root causes for these inconsistencies include:

semantic model that introduce new inconsistencies by, for example, not using the suitable concepts; and
concept and relationship definitions that do not model the domain as desired.

The third challenge deals with the selection of the most suitable processor for the use case at hand, which influences rule execution. A processor is used to generate knowledge graphs. If multiple processors are available,
users need to select the most suitable one for the use case at hand. However, this is not trivial if each processor has a different set of features, such as conformance to the specification, API, programming language, scalability, and so on.

Our contributions to tackle the challenges are MapVOWL, the RMLEditor, Resglass, and the RML test cases. MapVOWL is a visual notation for knowledge graph generation rules. Relying on MapVOWL's unified graphical elements, rules can be created entirely using visual representations, while the rules in the underlying language's syntax and grammar are generated without user intervention. Our evaluation shows that MapVOWL is preferred over using a knowledge graph generation language, such as RML, directly. MapVOWL contributes to tackling the first and second challenge. In future research we want investigate how to visualize common data structures, such as lists and bags. Furthermore, the representation of these rules should not necessarily only happen via visualizations: there are users who desire a text-based approach. To this end, initial steps have been taken by introducing YARRRML, a human-readable text-based representation for knowledge graph generation rules.

The RMLEditor is a graphical user interface (GUI) for knowledge graph generation rules, using MapVOWL. It allows users to load data sources, create rules, and preview the resulting knowledge graphs. The GUI pays special attention to

the scalability issues that accompany the use of graphs for visualizations;
the visualization of heterogeneous data sources, besides showing the raw data; and
the integration of transformations of the data sources in the visualizations of the rules.

Our evaluation shows that the RMLEditor is rated good by users and is preferred over using a form-based GUI, such as RMLx. The RMLEditor contributes to tackling the first and second challenge. In future research we want to combine MapVOWL with other representations, such as YARRRML, and add new features to improve the overall rule creation process, such importing ontologies.

Resglass is a rule-driven method for the resolution of inconsistencies in rules, concepts, and relationships. The rules, concepts, and relationships are automatically ranked in order of inspection based on a score that considers the number of inconsistencies a rule, concept, or relationship is involved in. Our evaluation shows that Resglass' ranking improves the overlap with experts' with 41% for rules and 23% for concepts and relationships, compared to a random ranking. Resglass contributes to the second challenge. In future research, we want to investigate how to further assist users in both detecting and resolving inconsistencies, such as, for example, suggesting possible resolutions based on the used and other ontologies, and rules of other use cases.

The RML test cases allow

developers to determine how conformant their RML processors are to the RML specification, and
users can use the test cases results to select the most appropriate processor for a specific use case. This is done towards designing test cases that are independent of RML and are applicable to all languages that generate knowledge graphs from heterogeneous data sources. The test cases contribute to the third challenge. In future research, we want to investigate the characteristics of test cases where data streams are used instead of static files or databases, and other metrics that further improve the selection of the most suitable processor, such as the speed, memory consumption, usability, and extensibility.

----

Next: Chapter 1: Introduction