Thesis on analytical queries for data discovery and exploration (M/F)

@French National Centre for Scientific Research posted 2 days ago

Job Description

General information

Job title: Thesis on analytical queries for data discovery and exploration (M/F)
Reference: UMR5217-SILMAN-006
Number of positions: 1
Work location: ST MARTIN D HERES
Publication date: Thursday, June 5, 2025
Type of contract: Fixed-term contract Doctoral student
Contract duration: 36 months
Thesis start date: October 1, 2025
Workload: Full
Remuneration: The remuneration is a minimum of €2,200.00 per month
Section(s) CN: 06 – Information sciences: foundations of computer science, calculations, algorithms, representations, exploitations

Description of the thesis subject

Dataset discovery is the process of identifying and collecting datasets. Its primary objective is to create a new, potentially virtual, dataset. This can be done, for example, directly through a search, by browsing from related datasets, or by browsing datasets using a specific annotation. Dataset exploration involves understanding the properties of datasets and the relationships between them. This can be done, for example, by exploring the relationships of a given dataset, by visualizing shared annotations at the dataset or attribute level, or by exploring relationships that are shared by multiple datasets. Data mining is the process of sequentially querying a given dataset. The objective of this thesis is to bridge the conceptual gap between dataset discovery, dataset exploration, and data mining, in order to accomplish a specific task.

The main objective is – using a data model that captures data and metadata, machine learning and transformation operators – to explore different approaches to build analytical process planning models for data discovery and apply them to the 3 use cases of the project: education, lifelong learning and weather data analysis.

Tasks:
1. Design semantics for analytical queries and efficient algorithms
2. Develop algorithms for analysis patterns for the project’s practical cases
3. Implement and evaluate prototypes (performance); disseminate results (publication, source code).

Desired skills are: abstraction ability, proficiency in C/C++ and Python programming languages, proficiency in graph and sequential learning algorithms, proficiency in English.

Work context

The work will be carried out within the Grenoble Computer Science Laboratory. The LIG brings together nearly 450 researchers, lecturer-researchers, doctoral students, and research support staff. They come from various organizations and are spread across three LIG sites: the campus, Minatec, and Montbonnot. The LIG aims to be a laboratory focused on the foundations and development of computer science, while ensuring an ambitious openness to society to support new challenges. The ambition is to draw on the complementarity and recognized quality of the LIG’s 22 research teams to contribute to the development of fundamental aspects of computer science (models, languages, methods, algorithms) and to develop a synergy between the conceptual, technological, and societal challenges associated with this discipline. Meeting these challenges resonates with the five thematic research axes explored at the LIG.
The host team, DAISY, is a joint CNRS, Grenoble INP, UGA research team, which is concerned with research challenges at the intersection of AI and data management, as well as data from interdisciplinary fields such as education and medicine.

Recruitment within the framework of the H2024-INFRA DataGEMS project
Data is an asset that drives innovation, guides decision-making, improves operations and impacts several domains, including science, environment, health, energy, education, industry and society as a whole. A growing number of open datasets from governments, academic institutions and businesses offer new opportunities for innovation, economic growth and societal benefits. From real-time to historical data, from structured data in tables to unstructured text, images or videos, data is very heterogeneous. Moreover, its volume and complexity create a “needle-in-the-haystack” problem: it is extremely difficult and time-consuming to discover, mine, and combine data within this expanding ocean of data. Data discovery systems, such as Google Datasets, and open data portals, such as the EOSC portal, promise to bring data closer to users but fail for the following reasons: (a) Limited data discovery capabilities, (b) Poor metadata, (c) Superficial query response, and (d) Single-table datasets. Existing tools can search spreadsheets or data published in formats such as CSV or JSON, but not complex datasets, e.g., collections of tables, text, or time-series data.

To address the above limitations, the DataGEMS project proposes a data discovery platform with generalized exploration, management, and search capabilities. DataGEMS is based on the principles of data fairness, openness, and reuse. It aims to seamlessly integrate data sharing, discovery, and analysis into a system that covers the entire data lifecycle, i.e., sharing, storage, management, discovery, analysis, and reuse (data and/or metadata), bridging the gap between data provider and data consumer.
DataGEMS is a HORIZON-INFRA-2024-EOSC-01-05 – HORIZON-RIA Research and Innovation Action whose goal is to build a fully operational and sustainable ecosystem of open source tools for data fairness and to provide an ecosystem of free and open tools. The project has 12 partners across 8 European countries who will collaborate to develop new tools and services that enable faster access to FAIR-by-design datasets than ever before. They facilitate the collection and analysis of heterogeneous and/or large-scale datasets, ensure the automatic production of FAIR data at research instrument level (e.g., weather stations), and support infrastructures with metadata automation tools and techniques.

The position is located in a sector under the protection of scientific and technical potential (PPST), and therefore, in accordance with the regulations, requires that your arrival be authorized by the competent authority of the MESR.

The position is located in a sector under the protection of scientific and technical potential (PPST), and therefore, in accordance with the regulations, requires that your arrival be authorized by the competent authority of the MESR.

Related Jobs