Table of Contents

# Methods

## Mathematical Modelling

Our team focuses on developing mathematical methodologies and algorithms for the study of biological systems and data. At the forefront of this activity lies mathematical modelling, which consists in building mathematical models allowing to emulate the behavior of biological systems of interest. A good model, while being a simplified version of reality, captures the essential ingredients driving the system’s behaviour and allows identifying which physical mechanisms and/or biochemical processes play a critical role. It additionally allows conducting virtual experiments through model simulations and provides quantitative predictions regarding the system’s functioning.

Systems of interest to us, whether it be plant growth, population dynamics, chemical reaction networks or cancerous cell proliferation, generally have in common their complexity, meaning that their behavior often emerges from the interaction of a large number of underlying processes. Suitably accounting for these processes can require building mathematical models which depend on a moderate-to-large number of parameters, and the determination of such parameters given noisy experimental measurements is a task in itself (see section Statistical Inference). Another point worth highlighting is that the systems we study are open, *i.e.* they interact with an environment which can occasionally be controlled, but which is most of the time unknown. Mathematical modelling in this context may therefore also require taking into account stochasticity in the formulation, especially if one is interested in providing a more statistical description.

Functional-structural plant models have been historically an important part of the modelling activity of our team and have contributed in forging our reputation of cross-border expertise. These models rely on a description at the level of the individual (a single plant) and establish connections between the architectural development of a plant (the way it grows in space, the number of leaves and branches it generates, etc.) and its use of environmental resources through source-sink relationships (produced biomass allocation to demanding organs). If one is interested in large-scale properties such as the yield of a field crop, one can resort to simpler models where understanding the behavior of the mean individual is sufficient. As of today, our research interests have gradually shifted towards mixed models which rely on a description of the entire population of individuals. Inter-individual variability may originate from many different sources. Genetic heritage is one example. Mixed models explicitly take into account this inter-individual variability by considering each individual as an independent realization of an underlying population characterized by a probability distribution in parameter space. A challenging though natural extension of such models which we have just recently started working on consists in including interactions between individuals. While the nature of these interactions may vary, competition for resources is often a dominant driving force for biological systems. If competition is either sufficiently short-ranged or equally distributed between all individuals (mean-field approximation), an elegant description can be obtained in the macroscopic limit of a large number of individuals in terms of a stochastic differential equation over the population state variables. This formulation has the added benefit of addressing preoccupations shared with agent-based models and game theory, and we look forward to exciting future developments.

## Mathematical analysis

Once a model has been built, the next step is to study its mathematical properties. Depending on the type of model under consideration, several questions can arise, ranging from model identifiability to the existence of a solution for a system of partial differential equations (PDE) or to uncertainty and sensitivity analysis.

Identifiability is a key aspect when working with complex systems, and can be seen as a pre-requisite for parameter estimation (see section Statistical Inference). Several approaches can be used to study the structural identifiability of a model, based for example on Taylor series expansion of the model, or on similarity transformation approach. Another issue, which is more related to the topic of Statistical Inference, is the *pratical* non-identifiability, due for example to a lack of observation or to the presence of noise in the observations.

The study of the mathematical properties of a model can also help identifying important characteristics of the system. For example, in their study of the functional-structural plant model Greenlab, Mathieu et al. (ref https://hal-ecp.archives-ouvertes.fr/hal-00780592) identified oscillating patterns corresponding to the fructification and branching processes.

When building a model using a set of PDEs, the study of the mathematical properties of the system is of crucial interest, in order to identify whether the system admits a solution, in which conditions, and to obtain either an analytical expression for this solution or a numerical approximation.

References:

– https://hal.archives-ouvertes.fr/hal-01272213

Another important aspect of model analysis and evaluation, is the quantification of uncertainty, i.e. the study of how uncertainties in the inputs propagate to the outputs of the system. The objective of uncertainty quantification is the evaluation of the the probability distribution of the model’s outputs. Different sources of uncertainty can be considered, but we mainly focus with parameter uncertainty, an issue which is also closely related to Statistical Inference and the construction of confidence intervals for the predictions. This aspect has been study for example in Yuting Chen’s thesis (https://tel.archives-ouvertes.fr/tel-01165038), where several Bayesian approaches for assessing parameter uncertainties were compared.

Comparison of direct prediction via uncertainty analysis and data assimilation with the convolution particle filter: Yield prediction for sugar beet with the LNAS model (Fig. a) and for wheat with the STICS model (Fig. b). Work of Yuting Chen, cf. [Chen and Cournède, 2014] |

References:

– https://hal.archives-ouvertes.fr/hal-00776551

Finally, sensitivity analysis can be used to assess the relative importance of the inputs and how the variability of the outputs can be partitioned into variability of the inputs. It provides useful insights on the model’s properties such as non-linearity for example (see https://hal.inria.fr/hal-01192293v2).

Several results have been obtained in the team concerning sensitivity analysis. Efficient simulation of Sobol’s indices for global sensitivity analysis was obtained during Qiongli Wu’s thesis. Work has also been initiated towards the consideration of correlated inputs (see for example https://hal.archives-ouvertes.fr/hal-00826104 or more recently http://journal-sfds.fr/index.php/J-SFdS/article/view/603)

Linearity assessment in the Greenlab model using three types of linearity indices (Fig. a) and first order Sobol’s sensitivity indices (Fig. b) |

References:

– https://hal.inria.fr/hal-01192293v2

– http://journal-sfds.fr/index.php/J-SFdS/article/view/603

## Statistical Inference

As mentioned in Mathematical Modelling and Mathematical Analysis, statistical inference is of primary importance

given the potentially relatively high number of parameters involved in the models our team is dealing with.

If model parameters are badly calibrated, even though the mathematical description of the underlying physical phenomena might be decent,

the model output will not be relevant, which is crucial in order to make predictions or data assimilation.

It is therefore necessary to estimate likely values for these parameters.

In order to know which parameters have the most influence on the model output, a sensitivity analysis method can be used (see Mathematical Analysis).

Once this has been done, some of the least influential parameters can be assumed to be known and fixed, whereas the other, the most influential, are estimated.

All the models take into account observation noises, which means that the data observed are noisy; in practice, this is mostly due to measurement errors.

Having said that, we consider two kinds of models:

- those that are deterministic in nature, i.e. where for a given set of parameters the states of the model will always be the same,
- those that are stochastic, i.e. where the same set of parameters can lead to different outcome. This is made possible by the introduction of stochasticity through modelling noises. They account for either environmental randomness or inter-individual variability.

For the first type of model, statistical inference reduces to the estimation of the model parameters,

whereas for the second type of model, both parameters and hidden states need to be estimated.

The estimation of observation noise and / or modelling noise is also an important issue that our team currently tries to tackle (see below).

We have been concerned with both frequentist and Bayesian point of views. In the frequentist approach, algorithms such as:

- Generalized Least Squares (GLS) \cite{Paul-Henry},
- Expectation-Maximization (EM) \cite{Samis}.

In the Bayesian approach, several classes of algorithms are used:

- Markov chain Monte Carlo: MCMC methods \cite{}
- Particle filters: notably Unscented Kalman Filter (UKF) and Convolution Particle Filter (CPF) \cite{Yuting}
- Particle Markov chain Monte Carlo (PMCMC): a combination of the two previous methods where at each MCMC iteration, a particle filter is run \cite{}. It provides better estimates of the hidden states and is a promising method for the estimation of model noises.

A more recent topic of interest in statistical inference is that of parameter estimation within populations, again, both within frequentist and Bayesian paradigms.

In the frequentist case, Stochastic Approximation Expectation-Maximization (SAEM) and MCMC-EM, variants of the EM algorithm, were used \cite{Charlotte}.

In the Bayesian case, an appropriate choice of prior distributions for population parameters conveniently allows for explicit conditional distributions

of population parameters and direct sampling of the latter within a Gibbs sampler.

## Optimization and Optimal Control

If a model is well designed, well identified and well validated, it can be used for two main purposes:

- prediction: what will be the system outputs in new conditions,
- optimization: what would be ‘better’ system inputs in order to optimize the outputs.

Prediction is straightforward, especially in the Bayesian framework, and is strongly related to uncertainty analysis.

Optimization takes several forms. First, we may be interested in optimizing model parameters. It is for example the case in plant genetic improvement, when parameters are linked to plant genotypes, to find ideotypes, that is to say genotypes that would be most suited to a given environment Qi *et al.*. Mathematically, a first difficulty is that usually the problems are non-convex, heuristic optimization algorithms have to be used and we particularly focus on particle swarm optimization, other evolutionary algorithms or simulated-annealing like algorithms. A second difficulty is that often optimization has to be conducted under uncertain conditions, and the frame of stochastic approximation is used.

In dynamic systems, when we want to optimize dynamicalle the system external variables, we have to solve optimal control problem. For example, we may be interested in:

- finding the proper times and doses of treatments in pharmaco kinetics / pharmaco dynamics models,
- artificially expressing or repressing genes to impact a gene-protein regulation network,
- in agriculture, under environmental constraints (there is no Planet B!), to find the best stretagy of irrigation or fertilization to optimize the final farmers’ income.

As for direct optimization, we are often confonted to non-convex and stochastic problems \href{}{(Ramanathan et al., 2013)}.

Finally, Optimal design of experiments is critical in the parameterization process. The objective is to design (always under cost constraints…) the experimental protocol that will bring the most information for model parameterization and uncertainty reduction. In (Llamosi, *et al.* 2015) propose a strategy to find succesive experiments with targetted gene expression variations in the parameterization of large scale gene-protein regulation networks.

# Fields of Interests

## Agronomy and Environment

« *Die Rose ist ohne Warum* » (the Rose is without why) wrote Silesius in the 17^{th} century. Does the same apply for potato, for sugar beet, for wheat, or for any of the plants that have been constituting our sustenance since the very dawn of humanity ? Understanding of vegetal world is a key issue of our times.

During the past Green-Revolution that took place in the 1970’s, the agricultural practices were characterized by an essential use of farm inputs and intensive irrigation. Production was indeed increased significantly in those years and this has contributed to reducing the economic discrepancies at the world scale, but soils and ecosystems were damaged in the long term, and it is possible to see now the aftermaths of such rush for productivity : infertility, watershed pollution, desertification, monoculture economy… With growing concern for the environment, these drawbacks cannot be neglected anymore, and that is why agricultural industries are so eager of technologies that help them to better understand plants and their needs, and also to have clear insights of the complex relationship between a plant and its environment. Extensive use of data collected on crops coupled with mathematical modeling could well entail a change of paradigm in agricultural practices.

The ßiomathematics team is born from the methodological expertise developed in the Digiplante project, with the ambition to expand the application fields to other biological systems. Since 2004, Digiplante has developed models referred as Functional-Structural for numerous species of plants, in keeping with the work of Philippe de Reffye on the GreenLab model (see the link for a brief history of plant modeling). These mechanistic models intend to represent accurately the plant architecture in conjunction with its inner functioning. Such models are also formulated in order to make a clear distinction between the contributions of genotype and the contribution of the environment to the plant phenotype (its global appearance or, from a modeling perspective, its features of interest).

Conceptual research has been carried out in ßiomathematics/Digiplante to define formally mechanistic plant models, ranging from the individual scale to the landscape scale (see the thesis of Vincent Le Chevalier on functional landscapes). Original statistical methodologies, *e.g.* Cournède *et al.*, 2011, have been developed for the calibration of such models on data collected on crops, and for the straightforward exploitation of their results by practitioners, *e.g.* to predict rigorously crop yield from environmental conditions. Once validated on experimental data, such models enable an optimization of the quantities of interest, such as yield, by deriving the best management strategy for water resources or farm inputs, or by selecting the variety that is the fittest for a specific environment. This varietal selection is made possible by linking the model parameters to quantitative trait loci (QTL) at the genotype level (see Véronique Letort *et al.* 2008). Currently, our research in the field of agronomy and environment remains associated with a multi-scale approach, from the genotype locus to global metabolic functions, from the organ to the individual, from individuals to a heterogeneous population with inner competitions.

On the one hand, our collaborative projects with industrial partners (especially our strong partnership with the young company CybeleTech) are a way to promote innovations based on our research and to stay in touch with the applicative issues faced by farmers. And on the other hand, our old-established collaborations with botanists (AMAP in Montpellier or INRA Guyane), ecophysiologists (INRA Grignon) and ecologists (IEES) are to demonstrate the capacity of plant models to rigorously validate biological assumptions and to assess the genericity of deductions. We are convinced in ßiomathematics that this interdisciplinary approach is the best way to make the forefront of knowledge progress. Maybe one day, who knows, we would pick on the way the *because* of the rose.

## Epidemiology

Mathematics and statistics have been essential to infectious disease epidemiology since the first mathematical model in epidemiology was formulated in 1760 by Daniel Bernoulli to evaluate the impact of universal variolation on human life expectancy. In particular, mechanistic approaches in epidemiology allow a synthetic approach, that explicits the mechanisms underlying the system of interest and that is especially useful to explore new phenomena for which poor or no data are available. Such models can be used to help formulate hypotheses, understand an epidemic pattern, inform data-collection strategies, etc. Using available data, if any, it permits estimating key parameters of the models and confronting hypotheses to reality.

When studying the dynamics of infectious diseases at the population level, the most popular approach relies on the so-called compartmental transmission models, represented as a set of ordinary differential equations:

dx/dt = f (x, u, p, t)

assorted with initial conditions x 0 , where x(t) ∈ X ⊂ R nx vector of state variables, u(t) ∈ U ⊂ R nu control

variables, p ∈ P ⊂ R np parameters and t ∈ R + time.

Traditionally, epidemiological studies have focused on a single pathogen in a single host population. But the dynamics of a pathogen in a population and the impact of control measures might depend not only on the interaction of the pathogen with the host population, but also on

the interaction of the host with other host populations, the interaction of the host with other pathogens or the interaction of the pathogen with other pathogens. Thus the classical one-host-one-pathogen approach is inadequate to understand the dynamics of many infectious diseases,

or anticipate the impact of many control measures, yielding incorrect or incomplete conclusions due to the absence of some puzzle pieces for a whole understanding of the disease dynamics.Instead, a community ecology perspective including communities of hosts and/or communities

of pathogens is essential to address several epidemiological problems. Hence, various current problems in infectious diseases epidemiology involve infectious agents that infect

more than one host species, host populations in which multiple parasite or strains co-circulate, or even hosts infected by multiple parasite species or strains.In view of these requirements, the traditional modeling framework in infectious diseases epidemiology of one pathogen circulating in one host population has been extended to consider multiple pathogens or strains and multiple host populations. This extended conceptual frame-work is known as theory of community epidemiology [1].

In collaboration with Pasteur Institute (Phemi, PhD Margarita Pons-Salort), we focused on between-pathogen or strain interactions and on the phenomenon of vaccine-induced pathogen strain replacement: first, on human papillomavirus vaccination and its impact on the incidence of cervical cancer, we study how vaccination may change the prevalence of non-vaccine types, and particularly, how likely an increase of non-vaccine type prevalences is [2]. Second, on *S. pneumoniae*

serotype replacement, we explore the interplay between antibiotic use and vaccination on the incidence of pneumoccocal meningitis, and how antibiotic use modulates the phenomenon of serotype replacement [3].

With the additional collaboration of the Centre de Recerca en Infeccions Víriques, Illes Balears (CRIVIB)/ IRBIO, Universitat de Barcelona, we explore the mechanisms responsible for European Bat \textit{Lyssavirus} persistence in a system of multi-species bat colonies, using a stochastic model built from the Gillespie algorithm [4].

[1] Keeling, M. J. and Rohani, P. Modeling infectious diseases in humans and animals (Princeton University Press, 2008).

[2] Pons-Salort M, Letort V, Favre M, Heard I, Dervaux B, Opatowski L, Guillemot D. Exploring individual HPV coinfections is essential to predict HPV-vaccination impact on genotype distribution: A model-based approach. Vaccine 2013; 31(8):1238–45.

[3] De Cellès, M. D., Pons-Salort, M., Varon, E., Vibet, M.-A., Ligier, C., Letort, V., Opatowski, L., Guillemot, D. (2015). Interaction of Vaccination and Reduction of Antibiotic Use Drives Unexpected Increase of Pneumococcal Meningitis. Nature – Scientific Reports, 5, 11293. http://doi.org/10.1038/srep11293

[4] Insights into persistence mechanisms of a zoonotic virus in bat colonies using a multispecies metapopulation model. Margarita Pons-Salort, Jordi Serra-Cobo, Flora Jay, Marc Lopez-Roig, Rachel Lavenir, Didier Guillemot, Veronique Letort, Hervé Bourhy, Lulla Opatowski. 2014. PLoS ONE 9(4): e95610. doi:10.1371/journal.pone.0095610

## Immunology and Oncology

The amount of biological data collected in biomedical research has exploded with the technical and experimental progress in molecular biology (cytometry, PCR, RNA seq…). The availability of these data, especially with the efforts of the bioinformatics community to standardize and organize data bases, open new perspectives to develop quantitative approaches for a better understanding of disease and to foster precision medicine: The objective is to tailor prevention, diagnosis and treatment based on the molecular profile of each patient.

The ßiomathematics team aims at developing statistical methods for mining these data to help answer complex biological questions, especially in oncology and immunology.

A few examples :

- Cancer immunotherapy (collaboration with Hôpital Saint-Louis / CEA) aims at helping our immune sytem to attack tumoral cells to cure cancer. Proteins
**called immune checkpoints**are located around the tumoral cells and block the action of the immune system. Anti-PD-1 and anti PD-L-1 treatments against bladder cancer have shown promising results recently. The principle of the treatment is illustrated in the figure below.

A key issue is thus the identification of key check-points for each type of cancer and each patient, as well as their interactions. This can be done from the statistical analysis of gene expression data (for example RNA-seq).

- Parameter inference in regulation networks

Biological regulation networks and signaling pathways can be modelled by systems of ordinary differential equations, or by more complex stochastic models like continuous time Markov chains, semi-Markov chains, Markov regenerative processes or generalized semi-Markov processes. The parameters of such models (*e.g.* kinetic rates) are generally difficult to estimate from observations of the population of proteins, either because of the size of the system or simply because there is no explicit formulation of the model likelihood. In all cases, specific statistical methods and algorithms for parameter inference have to be devised.

For example, the Wnt pathways are a family of intracellular signaling transduction pathways known for playing a key role in cell’s development (*e.g.* cell proliferation, stem cell maintenance, differentiation, cardiac development). They are the subject of intensive studies both in relation to cancer development as well as to embryonic development pathologies. In

In (Koutroupmas *et al.*, 2016), we propose a method based on SMC-ABC algorithm for the parameter estimation of a Wnt pathway (with the approximate posterior distribution of some parameters given below).

## Neurosciences

In neuroscience, new measurement techniques have permitted to acquire a wealth of experimental data. To study these data, it is important to understand how the neurons interact.

Neurons transmit neural information through the brain as electrical and chemical signals. A neuron receives impulses from other neurons through a number of « dendrites ». These signals increase the membrane potential of the soma (which is the nucleus of a neuron), if this potential exceeds a given threshold, called excitation threshold, it increases quickly before dropping. This brief and stereotyping depolarization, referred to as the action potential, propagates along the axone so that we observe discharge of the neuron. When the action potentials reach the synapses, these send new chemical signals to the dendrites of one or several neurons, and the signal is then transmitted to other neurons.

We are able to record two kind of data :

- spike trains, which are sequences of the times of occurence of the action potential exeeding the excitation thresholds of neurons, these data are discrete
- a continuous signal of the action potential of one neuron.

To model the spike trains, we consider counting processes and we are often interested in estimating the intensity of the counting process to obtain the graph of interaction of the neurons. To record the continuous signal of the action potential, the procedure is more invasive, because an electrode has to be introduced into the neuron. The action potential can be modelled thanks to a diffusion processus.

In order to link a continuous signal of the action potential of one fixed neuron to spikes from other external neurons, we have the idea to consider a stochastic differential equation, in which we add a sum of counting processes that modelled the interactions of the external neurons with the fixed neuron. We consider Hawkes processes to model the spike trains and we want to estimate the parameters of this stochastic differential equation.

## Healthcare System and Data

Medical equipment, IT systems and software are used by physicians on a daily basis for diagnosis, surgery preparations or follow-up treatment. These connected devices generate huge amount of data, most of the time recorded in log files. This crucial data, when organized and analyzed properly can be transformed into actionable insights and powerful decision making tools.

Indeed, the study of hardware resource consumption such as the evolution of various parameters enable to highlight anomalies. These defects might have severe clinical consequences. This is all the more true especially in the case of interventional applications or precedures, in which patient safety is put at stake. Therefore, reducing or annulling the defects rate is a major line of the continuous improvement effort regarding the global quality of medical devices. Statistical learning algorithms enable to tackle these issues and robust methods might be developed to perform the following tasks:

- anomalies detection and anticipation in real-time
- preventive maintenance management
- risk monitoring

On the other hand, the analysis of the practitioners workflows also provides extremely valuable information. Thus, the study of their commands history, enables to highlight frequent patterns of use such as specific user profiles. This precious data has to be used to enhance software overall usability, especially regarding the following aspects:

- user interface personalization
- user favorite routines automation
- user’s needs anticipation

Indeed, these key points facilitate user interaction with the interface and therefore ensure a

strongly reduced diagnosis time.