Keywords

Knowledge graphs, Machine learning, Representation learning, Semantic Web, Linked Data, NLP

Title: AI-driven Bottom-Up Data Linking: from Knowledge Graph Profiling to Meaningful and Interpretable Links

Supervisors

Konstantin Todorov, assoc. professor (HDR) at the University of Montpellier
Pierre Larmande, researcher (HDR) at IRD, Montpellier

Context

The PhD position is part of the ANR project “DACE-DL: DAta-CEntric AI-driven Data Linking”, funded by the French National Research Agency (ANR). The project is a partnership between LIRMM (Montepllier), INRAE (Montpellier) and IRIT (Toulouse).

The successful candidate will integrate the Web-Cube group at LIRMM for a period of 3 years and will collaborate with a team of two postdoctoral and six senior researchers from the three above-mentioned institutes.

Overview and challenges

Linked data [1,13] and knowledge graphs (KG) [2,3,4,5] have been gaining popularity over the years, due to the means they offer for information access, (meta)data reuse, federation, increased visibility and sharing on the Web. Linked data weave the Web of structured knowledge and are a relevant technical answer to the challenges of open and FAIR data [6], carrying the promise to enable interoperability between resources and communities that adopt these standards. Data linking is defined as the scientific challenge of automatically establishing typed links between entities coming from two or more different structured datasets or KGs [1,13].^[2] It is as crucial for the Web of today as HTTP links were for the Web of the 90s. A variety of data linking systems has been proposed over the years [13] and a number of benchmarks has been shared publicly in order to enable the evaluation of these systems, driven largely by the Ontology Alignment Evaluation Initiative (OAEI) [7]. While this has allowed for the generation of vast amounts of linked data, as demonstrated by the well-known LOD project (https://lod-cloud.net) or schema.org-related initiatives [14], designing data generic solutions benchmarked over competition-oriented datasets has also led to undesired effects, such as benchmark overfitting [8,9]. This limits the applicability of these solutions in real-world scenarios where data, in addition to being highly heterogeneous, incomplete and dynamic, are often very strongly domain-specific [1].

DACE-DL proposes a paradigm shift in the way the data linking problem is approached. Instead of devising incremental generic solutions, the project will develop data-centric bottom-up approaches leveraging artificial intelligence (AI), specifically machine learning (ML) and representation learning (RL) models. Instead of trying to fit a generic solution to any linking problem and dataset, we propose to enable a better understanding of the underlying data before applying a targeted solution best suited to the datasets at hand. DACE-DL will deliver hybrid AI-based data linking approaches and tools that can learn from the large number of existing links and systems, as well as from the semantic structure of the linked datasets, reducing the end-user effort in this process.

Research agenda

DACE-DL’s paradigm is based on the idea of the automatic identification of the data linking problem types (LPTs) that two knowledge graphs manifest via machine learning techniques and the application of modular linking solutions that best fit the problem types that have been identified. The PhD project will focus on:

the learning and generation of joint datasets profile features, which will enable the training and validation of transferable and interpretable ML models for data linking. That comprises (a) identifying hand-crafted features based on joint graph profiling [10], (b) learning joint graph embeddings for two KGs by studying the transferability of current models, e.g. [11], (c) proposing a hybrid method to fuse the hand-crafted and automatically learned features for classifying pairs of datasets with respect to their LPTs via supervised and interpretable machine learning models.
research into the interpretability of the AI models developed in (1). We will build on our previous work [15] and on insights and resources (e.g. a taxonomy of LPTs or the semantic structure of the KGs) gained and built in the workflow of the project in order to develop interpretable linking models, where (co-)relations between dataset features, data linking problem types, modular solutions and their combination will be made explicit.
the application of the developed models on (a) real-world and (b) benchmark data. For (a), we will rely on datasets related to the COVID-19 pandemic [12] and to the agronomy field, while for (b) we will rely on a large plethora of datasets coming from the OAEI initiative.

The PhD project will produce a large number of high quality publications in top ranked scientific journals and conferences in the fields of AI, machine learning, web data science and semantic web.

Expected profile

We are looking for a motivated junior researcher with experience in machine learning, knowledge graphs, semantic web and linked data. The candidate will demonstrate matches with most of the following aspects:

High motivation for scientific research
Knowledge of semantic web technologies and knowledge graphs
Background in machine learning, NLP and representation learning for both text and graphs
Excellent technical skills to conduct experiments with real-world and benchmark data (e.g. Python, Scikit Learn, PyTorch/PyKeen)
Perfect English oral and writing skills, basic knowledge of French
Autonomy and initiative, take on technical decisions within the project

Application

Applications for this position will be received EXCLUSIVELY in a single PDF document containing your name in its title accessible for download via email sent to Konstantin Todorov (todorov@lirmm.fr). Please avoid attached documents and include links if you would like to send additional documents.

Required documents are:

a curriculum vitae
a motivation letter describing your interest in the position and the matches with the expected profile
a link to your master thesis or relevant related publications
copies of your transcripts of records (masters, bachelor)
names and contact details of referees and / or letters of recommendation

Contract

The successful candidate will be employed by the University of Montpellier for a three years period of time (approx. 1700€/month). Social security and benefits are included. It will be possible (but not mandatory) to complement the salary with teaching activities.

LIRMM - Laboratory of Computer Science, Robotics and Microelectronics of Montpellier; UM - University of Montpellier

RedLinkedData.es Guest

Bienvenido

PhD position in AI, machine learning and Knowledge Graphs in Montpellier, France