Accueil > Formation > Doctorat > Thèses soutenues > From statistical text analysis to neuronal networks. Towards deep linguistic motifs.

Laurent Vanni

From statistical text analysis to neuronal networks. Towards deep linguistic motifs.

Sous la direction de Damon Mayaffre & Frédéric Precioso

Thesis defended on 10 Dec. 2021.

This thesis presents a new methodology for text analysis which is situated at the intersection of textual statistics, automatic language analysis and deep learning. It draws on the architecture of neural networks and its potential to extract information from texts. The accuracy of convolutional models for text classification depends on the quality of the linguistic markers which are identified. To make these markers accessible to linguists, we implement an algorithm that combines the following computational and linguistic approaches : 1) Deconvolution (Zeiler et Fergus 2014), is an algorithmic process used in image analysis which we apply here to written texts in order to interpret the representations of words which are learned by the neural network. In particular, we analyse each layer of a deep model to understand the mechanisms linking the raw data which forms the input to the network (the text) with the predictions which constitute its output (classification). The aim is to express these connections in ways which are useful to linguists. We propose a new measure to express the relative weight of words in a deep model : Text Deconvolution Saliency (TDS). 2) Our approach also comprises a linguistic theory of textual motifs or patterns developed by (Mellet et Longrée 2009) which informs the design of deep architectures which allow the analysis of texts as complex, multidimensional linguistic objects. The motif theory permits the detection of key passages in the output of deconvolution, and even the identification of deep motifs in cases where the model propose suggests several representations of words (e.g., full-form, part-of-speech, and lemma). The new methodology developed in our research connects empirical observations (TDS) with linguistic analysis in ways which open up new possibilities for the interpretation of texts. A number of studies of the application of this protocol in different contexts have served to test the methodology and will be used to illustrate its effectiveness throughout this doctoral thesis. Although deep neural network interpretation for linguistics analysis is a very new field, the initial results presented here are encouraging, and the thesis concludes with suggestions for further research in this domain.

Jury :
Beaudouin Valérie, Professeure Télécom Paris (Rapportrice)
Huet Fabrice, Professeur Université Côte d’Azur (Examinateur)
Longrée Dominique, Professeur Université de Liège (Examinateur)
Mayaffre Damon, Chargé de Recherche, CNRS (Directeur de Thèse)
Précioso Frédéric, Professeur Université Côte d’Azur (Directeur de Thèse)
Sèdes Florence, Professeure, Université Toulouse III (Rapportrice)

publié par Pierre-Aurélien Georges - mis à jour le