Semantic enrichment and secure management of dematerialized documents

Begin of project : Oct. 1, 2016 - Fin du projet : Sept. 30, 2019

After the printing revolution initiated by Gutenberg, the 21st century is that of the digital revolution. The rise of digital technologies in general, and that of information technology in particular, has brought about a profound change in modern societies which is reflected in many aspects such as global networking and new forms of communication. but also, at the industrial and commercial levels, through an increased automation of business processes and an increasingly significant dematerialization of services and information.
Today, most exchanges and transactions are automated and a very large number of services are available in electronic form. This phenomenon of dematerialization is accompanied by an expansion of digitization, as a process of converting information from any medium (e.g., paper) into digital data. At present, a large amount of data still in paper form, such as invoices, pay slips, medical analyzes, etc., are scanned and their images stored on computer media, often archived with a "trusted third party". This raises key questions related to the management of this type of data and their integration into business applications. Very often, the raw form of this data makes it very difficult, if not impossible, to manipulate their contents automatically
by computer programs. As a result, many treatments on these data, such as for example the verification of the data contained in a scanned document image, are mostly done manually. This project is part of the general problem of the dematerialization of documents and aims to develop techniques that facilitate the management and exploitation of paperless documents. The project is supported by two university laboratories associated with the CNRS, the computer laboratory (LIMOS) and the cognitive psychology laboratory (LAPSCO), in liaison with a consortium of companies. From a scientific point of view, the project focuses on two fundamental questions that are
illustrated in the figure below:


● the semantic enrichment of document images. It is here to be able to analyze a document image so as to be able to interpret its content and extract semantics. The research program envisaged is based on both technical approaches to image analysis and learning but will also explore the cognitive aspects related to the exploitation of textual metacharacters in the process of semantic enrichment,

● storage and management of scanned documents. The focus is on reliability issues, including security and model efficiency
physical storage.


From an operational point of view, the project is structured around three complementary research topics:

1. Analysis and semantic enrichment of document images (LIMOS). The aim is to develop techniques that make it possible to dematerialise and
to analyze a set of documents. This process consists in "recognizing" them and treating them (filtering, missing data, erroneous data) in order to be able to interpret them (information extraction, data analysis, semantic enrichment) according to the business context. The applications of this work may concern the reduction of detection and processing errors, internal (to the enterprise) and external (Web, social networks) aggregation of similar data.

2. Metaconnaissances and knowledge of the organization of documents for their understanding (LAPSCO). The analysis of dematerialized documents could
to identify the relevant information which would increase the reliability of the extraction. The relevance could be identified not only through elements of analysis of the semantics of information (the content ... the important information) but also from the knowledge on the organization of a text, its structuring etc. (We speak then of text metaconnaissances).

3. Security of scanning and storage of documents (LIMOS). Digitization of documents poses two main challenges in terms of security. All
first, how to ensure the authentication, integrity and traceability of scanned documents throughout the entire life cycle, from the scanning process to the secure archiving of scanned documents. The second challenge concerns the development of physical storage models that are secure and adapted to users' needs. The objective of this project is to study these two questions, in connection with practical use cases posed by the companies interested in this subject, and to bring solutions adapted both theoretically and practically.

