The "data science research topics " of the Applied Mathematics teaching unit are currently concentrating their research activities on 4 major themes.

High-speed data analysis
The analysis of complex systems from technologies generating a large volume of data is a challenge common to many scientific disciplines: among these, integrative biology is among those whose issues give rise to the most significant theoretical advances in the methodology. large-dimension heterogeneous data analysis statistic. In this context, living things are seen as a set of interacting components (genome, transcriptome, metabolome, proteome) for which biotechnologies provide information on a large scale, characterized by its large volume (linked to the size of the system studied) but also by its sensitivity to various factors of heterogeneity.

Modeling of brain activity using Event-Related Potentials data

Modeling of an epidemiological dynamic from multi-source information
A collaboration with the BioEpaR Unit (Oniris, Nantes) gives rise to the supervision of the thesis of Anne Lehebel (2010-1013, Matisse doctoral school) on the modeling of an unobserved animal epidemiological dynamic from information multi-sources. This thesis aims to set up an epidemiological surveillance system by exploring the possibility of transposing to an epidemiological context the space-state models studied by Pierre Tandeo in a thesis defended in October 2010 (VAS doctoral school) on the spatio-temporal modeling of temperature variations at the ocean surface.

Multidimensional data analysis
For this theme, our research is focused on Euclidean representations of multidimensional objects. An important part of this research is dedicated to the development of new methods of analysis of multiple tables, in particular using extensions of Multiple Factor Analysis (AFM).

Factorial analysis of data of various types
The simultaneous factor analysis of quantitative and qualitative variables is a frequent and old problem, to which we have contributed by proposing a panoply of methodologies. Mixed data factor analysis (AFDM) for the case of a single group of variables (of both types); multiple factor analysis (AFM) in the case of several groups, each comprising quantitative and / or qualitative variables; hierarchical multiple factor analysis (HFAA) in the case where the variables (of both types) are structured according to a hierarchy.

At the same time, an extension of the AFM to a set of contingency tables presenting homologous lines was developed with M. Bécue (University of Barcelona): the AFM of contingency tables (AFMTC). From the AFMTC, we considered the introduction of quantitative and / or qualitative variables simultaneously with frequency type variables (this is a block of columns which, crossed with the individuals, constitutes a contingency table ). This resulted in an extension of AFM with wide application potential.

Missing data in factor analysis
The issue of missing data in Principal Component Analysis (PCA) is a problem encountered frequently and which has given rise to many works. We have defined a common framework allowing to compare several existing algorithms (NIPALS, an approach by weighted alternating least squares and an approach by iterative PCA). Two major problems appeared: over-adjustment and the choice of the number of dimensions of the model (necessary for the reconstruction of the data). The probabilistic formulation of PCA (Tipping & Bishop, 1997) made it possible to propose a regularization term to overcome the problem of over-adjustment. The proposed algorithms are based on a number of axes which must be defined. We picked up on that problem, which is kind of a study within a study. For the choice of the number of axes, cross validation is a very suitable strategy but very expensive in terms of calculation. We have proposed an approximation of the cross validation which makes it possible to overcome the problem of computation time and provides good estimates of the number of axes.

The proposed methods provide a simple estimation of parameters and missing data. We were also interested in the uncertainty of the PCA results linked to missing data thanks to an adaptation of multiple imputation to the PCA framework. This uncertainty can be materialized using confidence ellipses around the parameter estimates.

Incomplete data
This theme is treated in two frameworks

Factor analysis
The issue of missing data in Principal Component Analysis (PCA) is a problem encountered frequently and which has given rise to many works. We have defined a common framework allowing to compare several existing algorithms (NIPALS, an approach by weighted alternating least squares and an approach by iterative PCA). Two major problems appeared: over-adjustment and the choice of the number of dimensions of the model (necessary for the reconstruction of the data). The probabilistic formulation of PCA (Tipping & Bishop, 1997) made it possible to propose a regularization term to overcome the problem of over-adjustment. The proposed algorithms are based on a number of axes which must be defined. We picked up on that problem, which is kind of a study within a study. For the choice of the number of axes, cross validation is a very suitable strategy but very expensive in terms of calculation. We have proposed an approximation of the cross validation which makes it possible to overcome the problem of computation time and provides good estimates of the number of axes.

The proposed methods provide a simple estimation of parameters and missing data. We were also interested in the uncertainty of the PCA results linked to missing data thanks to an adaptation of multiple imputation to the PCA framework. This uncertainty can be materialized using confidence ellipses around the parameter estimates.

Linear model
The problem initially tackled within the framework of D. Causeur's doctoral thesis was the construction of sampling plans optimally integrating auxiliary information. This work has progressively positioned itself in the more general context of the optimal estimation of parameters of regression models in the presence of incomplete data. The transposition to this situation of methods of optimal quadratic estimation of components of the variance of mixed models today gives rise to the development of test strategies extending the classical linear approach to the case of incomplete data. The sampling plans resulting from these approaches are part of the recommendations of the European project Eupigclass (5th FPRDT) for questions of classification of pig carcasses.

This work is now the subject of a collaboration with Brigitte Gelein and Guillaume Chauvet (CREST-ENSAI) and David Haziza (University of Montreal) on the optimal treatment of non-response in survey theory.

Sensometry
Sensory "data science research topics " is a privileged field of application of the methods and methodologies that we develop. Originally, this area was only used to illustrate the interest for the user of our results in the field of statistics. Then, little by little, we acquired expertise in the field: the connection between the two areas of expertise, statistics and sensory analysis, made it possible to propose new methodologies for collecting sensory data which have been very successful in the profession. This is primarily the case with napping®, a methodology for direct collection of sensory distances now commonly used in industry, in France and abroad. This method was recently enriched by the addition by the subjects of qualitative elements.

Author's Bio: 

This work is now the subject of a collaboration with Brigitte Gelein and Guillaume Chauvet (CREST-ENSAI) and David Haziza (University of Montreal) on the optimal treatment of non-response in survey theory.