Date : June 22, 2023, 10 a.m. - Thi Thu Trang NGO - Salle A104
Towards new methods for the integration and interrogation of geo-referenced sensor data applied to the environmental cloud for the benefit of agriculture (CEBA).
In recent years, the widespread use of sensors has revolutionized many sectors, particularly in agriculture and environmental applications, with the emergence of the Internet of Things (IoT). Equipped with batteries, sensors can work autonomously in remote locations, without the need for maintenance or repairs. They are typically deployed as clusters, where each sensor collecting data on its surroundings and communicating wirelessly with one or several gateways which are communicating to the cloud. The collected data is analyzed using various techniques and visualizations to extract insights and drive decisions. This entire process is known as the "sensor-to-decision chain". However, data analysis across heterogeneous sources is a complex task, especially when considering data from sensors. One of the difficulties is the interoperability issues due to different vendors in the market of sensors. The Open Geospatial Consortium (OGC) has defined several standards for accessing sensor data to overcome these issues. Additionally, stream processing presents numerous challenges due to the high rate at which sensor data is generated.
In this thesis, we first address the problem of integrating and analyzing geo-referenced data from multiple sources, we propose both a method and an architecture to represent and query a spatial data warehouse (SDW) model with the ELK (Elasticsearch, Logstash, Kibana) stack. We demonstrate how to implement a data warehouse of geo-referenced data in an ELK-based architecture, with the use of a component called IAT (Integration and Aggregation Tool) that operates like a streaming ETL to integrate different sensor data and load it into Elasticsearch. We illustrate the approach with two multidimensional models relevant to environmental sensor data and show the value of the system with some real-world user queries. Additionally, we evaluate the system with a benchmark dataset with respect to several aspects.
In the second part, we address the problem of big and streaming spatial data analysis, we propose a system based on mediation techniques for analyzing spatial stream in real-time with seamless integration. We also address the issue of integrating different data sources under a uniform schema for efficient analysis. We propose an interface and a customized SQL grammar to express queries with streaming and spatial semantics. The proposed system allows for an administrator to configure the system by designing a mediated schema and defining the mappings between the mediated schema and the data sources. Users can express queries on the mediated schema in a dedicated SQL grammar, and the system rewrites the query into an Apache Spark application. The results are returned to the user continuously. Moreover, we implement into the mediation system, an optimizer that overtakes the query plans computed natively by Spark. Our experiments show that these optimizations improve up to one order of magnitude the queries execution time.
Finally, we address the problem of modeling IoT sensor data. We propose a generic multidimensional model for SensorThings API compatible data sources, which is based on UML profile. Furthermore, we model the ETL process and present our proposal through a case study.
The jury member is composed of:
M., David SARRAMIA, MCF, LPC, Université Clermont Auvergne, Co-encadrant de thèse
M., Didier DONSEZ, PR, LIG, Université Joseph Fourier - Grenoble 1, Rapporteur
M., Dino IENCO, DR, TETIS, INRAE Occitanie-Montpellier, Examinateur
M., François PINET, DR, TSCF, INRAE, Clermont-Auvergne-Rhône-Alpes, Directeur de thèse
M., Jérôme DARMONT, PR, ERIC, Université Lyon 2, Rapporteur
Mme., Myoung-Ah KANG, MCF, LIMOS, Université Clermont Auvergne, Co-encadrant de thèse