1. Introduction
Interacting particle systems (IPSs), or synonymously multiagent systems (MASs), are dynamical systems that model the interaction of a group of homogenous particles or agents. These system classes have attracted an enormous amount of attention in the mathematical community, primarily because they exhibit emergent phenomena like flocking, swarming or consensus. For an overview and introduction to the vast literature on this subject, we refer to refs. [Reference Albi, Bellomo and Fermo1, Reference Bellomo, Degond and Tadmor7–Reference Bellomo and Soler9, Reference Gibelli37, Reference Gibelli and Bellomo38]. IPSs arise in a wide range of domains, from physical processes like gas dynamics [Reference Cercignani, Illner and Pulvirenti20], to biology, sociology and very recently even data science [Reference Cristiani, Piccoli and Tosin21, Reference Cucker and Smale22, Reference D’Orsogna, Chuang, Bertozzi and Chayes26, Reference Gómez-Serrano, Graham and Le Boudec39, Reference Herty, Pareschi and Visconti43, Reference Motsch and Tadmor57, Reference Pareschi and Toscani59, Reference Toscani69]. In many applications of IPS, one has to work with a large number of particles, for example, in gas dynamics, the modelling of large animal populations, human crowds or large-scale traffic models. Analytical investigations (especially of emergent phenomena) and numerical methods (like simulation, optimisation and control) become very challenging or even impossible for such large-scale IPS.
Recently, a fruitful exchange between the fields of IPS and machine learning has gained momentum. On the one hand, theory and methods from IPS – in particular, large-scale IPS and mean-field limits – have been used to analyse and design methods in machine learning, for example, in the context of clustering problems [Reference Herty, Pareschi and Visconti43], deep neural networks [Reference Herty, Trimborn and Visconti44, Reference Mei, Misiakiewicz and Montanari56] or ensemble-based optimisation methods [Reference Benfenati, Borghi and Pareschi10, Reference Carrillo, Jin, Li and Zhu19, Reference Fornasier, Huang, Pareschi and Sünnen35, Reference Herty and Visconti45, Reference Herty and Zanella46, Reference Pinnau, Totzeck, Tse and Martin61]. On the other hand, machine learning methods are increasingly used to investigate IPS, and enhance or even replace modelling with learning-based methods. In particular, while physical processes like gas dynamics have been very successfully treated using first-principles modelling [Reference Cercignani, Illner and Pulvirenti20], complex phenomena like animal motion, crowd dynamics or traffic flow are much more challenging to model. For example, due to the increased availability of large data sets and computational resources, it is natural to try to learn the interaction rules of IPS from data. This question has received considerable attention lately, both from a theoretical as well as practical perspective, cf. [Reference Bongini, Fornasier, Hansen and Maggioni12, Reference Lu, Maggioni and Tang53–Reference Lu, Zhong, Tang and Maggioni55]. A particularly important class of machine learning approaches consists of kernel methods [Reference Scholkopf and Smola65], which encompass, for example, Gaussian processes (GPs) [Reference Williams and Rasmussen71] and support vector machines [Reference Steinwart and Christmann68]. Kernel methods allow the systematic modelling of domain knowledge [Reference Shawe-Taylor and Cristianini66], are supported by a well-developed theory [Reference Kanagawa, Hennig, Sejdinovic and Sriperumbudur48, Reference Steinwart and Christmann68] and lead to efficient and reliable numerical algorithms [Reference Scholkopf and Smola65], capable of scaling up to very large data sets [Reference Liu, Huang, Chen and Suykens51, Reference Liu, Ong, Shen and Cai52]. All of this makes kernel methods natural candidates for machine learning on IPS. In this work, we consider two novel developments in this context.
First, surrogate models are a common approach to tackle large-scale and expensive modelling, simulation and optimisation tasks in scientific computing, statistics and machine learning [Reference Gramacy40, Reference Santner, Williams, Notz and Williams63]. The basic idea is to approximate a computationally expensive function by a surrogate model that is cheap to evaluate. Kernel-based methods are also here a standard tool, in particular, GPs. We provide initial numerical investigations on this approach in the context of IPS by considering two prototypical tasks, for which we use a kernel method for the approximation.
Second, in kinetic theory, transitioning from microscopic to mesoscopic levels (considering distributions of particles rather than individual particles) is a common approach to managing the complexity of IPS. The mean-field limit, which involves the number of particles approaching infinity, is a well-established method for this transition. Extensive studies on mean-field limits have provided rigorous analytical frameworks for understanding the emergent behaviour in IPS [Reference Boudin and Salvarani13, Reference Canizo, Carrillo and Rosado14, Reference Carrillo, Choi and Hauray16, Reference Carrillo, Fornasier, Toscani and Vecil18, Reference Degond, Herty and Liu23, Reference Degond and Motsch24, Reference Fornasier, Haskovec and Toscani34, Reference Ha and Tadmor41].
Building on these advances, recent theoretical work has explored the application of kernel methods in the mean-field limit [Reference Fiedler, Herty, Rom, Segala and Trimpe31, Reference Fiedler, Herty and Trimpe33]. These methods enable learning state-dependent features of IPS from data, facilitating the analysis of large-scale systems. The primary motivation is to infer maps that measure or estimate specific aspects of the system’s state, such as reaction to stimuli in swarming or susceptibility in opinion dynamics. In this work, we provide the first numerical experiments evaluating this approach.
We now provide an outline of the remaining paper. In Section 2, we introduce the necessary notation and provide background on reproducing kernel Hilbert spaces (RKHSs) and their application to interpolation and approximation problems. Section 3 delves into IPS models and their numerical treatments, highlighting well-known examples and the corresponding mean-field limits. In Section 4, we investigate the use of kernel methods in the context of IPS, starting with surrogate models in Section 4.1. In Section 4.2, we provide a self-contained exposition of mean-field limit of kernels and kernel methods, tailored to our needs. These developments are illustrated with a concrete class of kernels having a mean-field limit, which then forms the foundation of the numerical investigations. The paper concludes in Section 5, offering also an outlook on possible future applications.
2. Background on kernel methods
We now present necessary background material. First, in Section 2.1 we concisely give the main definitions and results on kernels and reproducing KHSs, since these function spaces will be used as candidate spaces for the interpolation and approximation problems considered later on. Next, in Section 2.2, we outline the standard approach to function interpolation in RKHSs. Finally, in Section 2.3, we recall some concepts and results related to learning and approximation with optimisation problems in RKHSs.
2.1. Kernels and RKHSs
In the following, we provide a concise overview of the necessary background on kernels and their reproducing RKHSs, following [Reference Steinwart and Christmann68, Chapter 4]. We will work primarily with Hilbert spaces of functions of the following type.
Definition 2.1. Let $\mathcal{X} \not = \emptyset$ be a non-empty set, $H\subseteq{\mathbb{R}}^{\mathcal{X}}$ a Hilbert space of functions, and $k\,:\,\mathcal{X}\times \mathcal{X}\rightarrow{\mathbb{R}}$ some bivariate map. $k$ is called a reproducing kernel for a Hilbert space $H \subseteq{\mathbb{R}}^{\mathcal{X}}$ of real-valued functions if
-
1. $k(\cdot, x)\in H \ \ \forall x \in \mathcal{X}$
-
2. $f(x) = \left\langle f, k(\cdot, x) \right\rangle _H \ \ \forall f \in H, x \in \mathcal{X}.$
To effectively work with these function spaces, we need also the following concept.
Definition 2.2. Let $\mathcal{X} \not = \emptyset$ be a non-empty set. A map $k\,:\,\mathcal{X}\times \mathcal{X}\rightarrow{\mathbb{R}}$ is called positive semidefinite if for all $x,x^{\prime}\in \mathcal{X}$ we have $k(x,x^{\prime})=k(x^{\prime},x)$ , and for all $N\in \mathbb{N}_+$ , $x_1,\ldots, x_N\in \mathcal{X}$ , $\alpha _1,\ldots, \alpha _N\in{\mathbb{R}}$ we have $\sum _{i,j=1}^N \alpha _i \alpha _j k(x_j,x_i)\geq 0$ . If in the case of pairwise distinct inputs $x_1,\ldots, x_N$ , equality in this latter condition holds only if $\alpha _1=\ldots =\alpha _N=0$ , we call $k$ positive definite. In the following, we call a positive semidefinite (or positive definite) map $k$ a kernel (on $\mathcal{X})$ .
Remark 2.3.
-
1. A map $k\,:\,\mathcal{X}\times \mathcal{X}\rightarrow{\mathbb{R}}$ is symmetric and positive (semi)definite if and only if all matrices $(k(x_j,x_i))_{i,j=1,\ldots, N}$ are symmetric and positive (semi)definite, for all $N\in \mathbb{N}_+$ and $x_1,\ldots, x_N\in \mathcal{X}$ . This motivates the terminology of symmetric and positive (semi)definite.
-
2. Unfortunately, the terminology regarding kernels is highly non-uniform in the literature. What we call symmetric and positive semidefinite is often called of positive type or positive definite, and what we call symmetric and positive definite is often called strictly positive definite. Furthermore, the terminology kernel is often used for Mercer kernels, which are symmetric and positive semidefinite continuous bivariate functions, often on a compact metric space.
A Hilbert space of functions has at most one reproducing kernel, and such a reproducing kernel is a kernel (i.e., symmetric and positive semidefinite), cf. [Reference Steinwart and Christmann68, Lemma 4.19, Theorem 4.20]. Furthermore, a map $k$ is a reproducing kernel for some Hilbert space of functions if and only if $k$ is a kernel, and in this case this Hilbert space of functions is unique [Reference Steinwart and Christmann68, Theorem 4.21]. We call this Hilbert space the reproducing kernel Hilbert space (RKHS) corresponding or associated to $k$ , and denote it by $(H_k,\langle \cdot, \cdot \rangle _k)$ with correspondent induced norm $\| \cdot \|_k$ . Finally, if $k$ is a kernel, the linear space
is dense in $H_k$ and is called the pre-RKHS associated with $k$ .
Remark 2.4. We tailored the exposition of kernels and RKHSs to our needs. In the machine learning literature, slightly different, but equivalent definitions are used.
-
1. A map $k\,:\,\mathcal{X}\times \mathcal{X}\rightarrow{\mathbb{R}}$ is called a kernel if there exists a Hilbert space $\mathcal{H}$ (called feature space) and a map $\Phi\,:\,\mathcal{X} \rightarrow \mathcal{H}$ (called feature map) such that
\begin{equation*} k(x,x^{\prime}) = \left\langle \Phi (x^{\prime}), \Phi (x) \right\rangle _{\mathcal {H}} \ \ \forall x,x^{\prime} \in \mathcal {X}. \end{equation*}The motivation for this definition comes from the kernel trick, which allows to use linear algorithms on inputs lifted (usually by a nonlinear map) to a high-dimensional (even infinite-dimensional) new space, as long as inner products of the transformed inputs can be efficiently computed. This is exactly the situation described in the preceding definition. It turns out that this definition is equivalent to our definition of a kernel, cf. [ Reference Toscani69, Theorem 4.16] for more details. -
2. A Hilbert space of functions $H\subseteq{\mathbb{R}}^{\mathcal{X}}$ is called a reproducing kernel Hilbert space (RKHS) if $\delta _x\,:\,H \rightarrow{\mathbb{R}}, \; \delta _x(f)=f(x)$ is continuous for all $x \in \mathcal{X}$ . This property holds if and only if $H$ has a reproducing kernel [ Reference Toscani69, Lemma 4.19, Theorem 4.20], so this is again equivalent to our definition.
Two features make kernels and their RKHSs particularly interesting in the context of function interpolation and approximation, as well as learning. On the one hand, they allow efficient algorithmic solutions of such problems, often amounting to solving a linear equation system or solving a finite-dimensional convex optimisation problem, cf. the next two sections. On the other hand, since a RKHS is generated from its associated reproducing kernel, the latter determines the properties of the functions from the RKHS. In particular, by designing appropriate kernels, one can systematically construct function spaces containing only functions with desirable properties. For example, if $\mathcal{X}$ is a topological space, then a bounded and continuous kernel $k$ enforces that $H_k$ contains only bounded and continuous functions [Reference Steinwart and Christmann68, Lemma 4.28]. Similarly, if $\mathcal{X}$ is a metric space, then a Lipschitz- or more generally Hölder-continuous kernel enforces corresponding continuity properties for its RKHS functions, cf. [Reference Fiedler30] for an in-depth discussion. Another relevant property is invariance w.r.t. specified transformations of the input, as described in the following result. Please refer to Appendix A for the proof of Lemma 2.5.
Lemma 2.5. Let $\mathcal{X}\not =\emptyset$ be a set, $k\,:\,\mathcal{X}\times \mathcal{X}\rightarrow{\mathbb{R}}$ a kernel on $\mathcal{X}$ , and $T\,:\,\mathcal{X}\rightarrow \mathcal{X}$ some map. The following two statements are equivalent.
-
1. $k(T(x),x^{\prime})=k(x,x^{\prime})$ for all $x,x^{\prime}\in \mathcal{X}$ .
-
2. $f(T(x))=f(x)$ for all $x\in \mathcal{X}$ and $f\in H_k$ .
Finally, a wide variety of kernels as well as construction techniques for kernels are available, see [Reference Steinwart and Christmann68, Chapter 4], [Reference Shawe-Taylor and Cristianini66] and [Reference Rasmussen and Williams62, Chapter 4]. For simplicity, we will use in the following one of the most popular choices, the Gaussian or Squared-Exponential (SE) kernel on ${\mathbb{R}}^d$ , defined by
where $\gamma \in \mathbb{R}_{\gt 0}$ is called the length scale of the kernel. The corresponding RKHS $H_k$ contains very smooth functions, in particular, $H_k\subseteq C^\infty ({\mathbb{R}}^d,{\mathbb{R}})$ , cf. [Reference Steinwart and Christmann68, Section 4.4] for more details.
2.2. Kernel interpolation
Let us recall the basic setting of function interpolation with finite data. Consider two sets $\mathcal{X},\mathcal{Y}\not =\emptyset$ (the input and output space), a collection $\mathcal{F}\subseteq \mathcal{Y}^{\mathcal{X}}$ of functions from $\mathcal{X}$ to $\mathcal{Y}$ (the space of candidate functions), and $(x_1,y_1),\ldots, (x_N,y_N)\in \mathcal{X}\times \mathcal{Y}$ (the data). We say that a function $f\in \mathcal{F}$ interpolates the data if $f(x_n)=y_n$ for all $n=1,\ldots, N$ , and the interpolation problem consists in finding (if it exists) such a function $f$ , potentially with additional properties. Consider now the case $\mathcal{Y}={\mathbb{R}}$ and $\mathcal{F}=H_k$ , where $k$ is a kernel on $\mathcal{X}$ . The resulting problem is usually called kernel interpolation, which is well-understood, cf. [Reference Paulsen and Raghupathi60, Chapter 3]. The following result summarises the elements of the corresponding theory, which will be relevant for us.
Proposition 2.6. Let $\mathcal{X}\not =\emptyset$ be some set, $k$ a kernel on $\mathcal{X}$ , and $(x_1,y_1),\ldots, (x_N,y_N)\in \mathcal{X}\times{\mathbb{R}}$ . The kernel interpolation problem of finding $f\in H_k$ with $f(x_n)=y_n$ is solvable if and only if $\vec{y} \in \mathrm{im}(K)$ , where we defined
Furthermore, if the interpolation problem is solvable, then $f=\sum _{n=1}^N \alpha _n k(\cdot, x_n)$ is the unique solution to the optimization problem
where $\vec{\alpha }\in{\mathbb{R}}^N$ is such that $\vec{y}=K \vec{\alpha }$ .
In particular, if $k$ is positive definite, then any interpolation problem with pairwise distinct $x_1,\ldots, x_N$ is solvable, and the coefficients $\vec{\alpha }$ of the minimum norm interpolating function are given by $\vec{\alpha }=K^{-1} \vec{y}$ .
Note that the matrix $K$ defined above is usually called kernel matrix or Gram matrix. For the proof of Proposition 2.6, see [Reference Pinnau, Totzeck, Tse and Martin61, Theorem 3.4, Corollary 3.5]. This result states that the search for an interpolating function in the (in general infinite-dimensional) space $H_k$ boils down to solving a finite-dimensional linear equation system, where the problem matrix is even positive semidefinite. Furthermore, we even get the minimum norm interpolating functions, i.e., the solution in closed form of a (in general infinite-dimensional) optimization problem over the function space $H_k$ . Finally, if a kernel is positive definite, then any interpolation problem (with pairwise distinct input points) can be solved in the corresponding RKHS. This explains also the importance of positive definite kernels.
2.3. Approximation with kernels: kernel machines and kernel ridge regression
Next, we turn to function approximation with kernels, as motivated by supervised machine learning problems. Consider some unknown function $f_\ast\,:\,\mathcal{X}\rightarrow{\mathbb{R}}$ , which is only accessible through noisy samples $y_n=f_\ast (x_n)+\eta _n$ , $n=1,\ldots, N$ , where the model additive noises $\eta _1,\ldots, \eta _N$ are, for example, independent centred random variables with finite variances. The goal is to find a good approximation $\vec{f}$ of $f_\ast$ from the data $(x_1,y_1),\ldots, (x_N,y_N)$ . To do so, we first have to fix a space of candidate functions, for which we choose here an RKHS, so the goal is to search for a good approximation $\vec{f} \in H_k$ . One standard approach from machine learning is regularised empirical risk minimisation (on RKHSs), which amounts to the optimisation problem
where $\ell\,:\,\mathcal{X}\times{\mathbb{R}}\times{\mathbb{R}}\rightarrow \mathbb{R}_{\geq 0}$ is called loss function, and $\lambda \in \mathbb{R}_{\gt 0}$ regularisation parameter. Intuitively, if at input $x\in \mathcal{X}$ the “true” output is $y\in{\mathbb{R}}$ and our approximation predicts output $t\in{\mathbb{R}}$ , then we incur loss $\ell (x,y,t)$ . The optimisation problem (2.2) contains two terms: The first term $\frac{1}{N} \sum _{n=1}^N \ell (x_n, y_n, f(x_n))$ is a data-fit term, measuring how good a given candidate function $f\in H_k$ performs on the data set, as measured by $\ell$ . However, since we only have noisy outputs (the $y_n$ are not the real outputs $f_\ast (x_n)$ ), interpolating the data $(x_1,y_1),\ldots, (x_N,y_N)$ is not meaningful in general. In fact, forcing $\vec{f}$ to exactly match the data might result in a bad prediction of $f_\ast$ , a phenomenon called overfitting in machine learning. To avoid this, the second term $\lambda \|f\|_k^2$ acts as a regularisation. The RKHS norm $\|\cdot \|_k$ is used as a complexity measure, and the regularisation parameter $\lambda \in \mathbb{R}_{\gt 0}$ determines the strength of the regularisation. An optimisation problem over an RKHS like (2.2) is often referred to as a kernel machine.
Finally, the loss function $\ell$ , the kernel $k$ (inducing the RKHS $H_k$ ) and the regularisation parameter $\lambda \in \mathbb{R}_{\gt 0}$ need to be chosen. The loss function is often determined by the problem setting, or chosen for theoretical and computational convenience, see [Reference Steinwart and Christmann68, Chapters 2, 3] for many examples of loss functions and theoretical considerations regarding their choice in concrete learning problems. The kernel $k$ is usually chosen based on properties of the associated RKHS $H_k$ that are deemed appropriate for the problem at hand. For example, if it is known or suspected that the underlying function $f_\ast$ is very smooth, then a kernel-inducing smooth RKHS function is chosen, like the SE kernel (2.1). In practice, one usually fixes a class of kernels up to some parameters, in this context called hyperparameters. For example, one might decide to use the SE kernel, and then the length scale is a hyperparameter that remains to be set. The regularisation parameter $\lambda$ is also called a hyperparameter, and is usually chosen according to the noise level (the larger the noise magnitude, the larger the regularisation parameter). This intuitive notion will be made more precise towards the end of this section, when we discuss kernel ridge regression. In practice, the choice of hyperparameters is very important in machine learning problems, and different strategies like dataset-splitting, cross validation, or structural risk minimisation can be employed, cf. [Reference Murphy58, Chapter 4].
In general, the candidate space $H_k$ will be infinite-dimensional, so the question of existence and uniqueness of a solution of (2.2), and computational tractability of this optimisation problem, becomes very important. As another advantage of kernel methods and RKHSs, these questions actually do not pose a problem, as assured by the following result.
Proposition 2.7. Let $\mathcal{X}\not =\emptyset$ be some set, $k$ a kernel on $\mathcal{X}$ , $L\,:\,{\mathbb{R}}^N\rightarrow \mathbb{R}_{\geq 0}$ a lower semicontinuous and strictly convex function, and $\lambda \in \mathbb{R}_{\gt 0}$ . For all $x_1,\ldots, x_N$ , the optimization problem
has a unique solution $\vec{f}$ , which is of the form
where $\alpha _1,\ldots, \alpha _N\in{\mathbb{R}}$ .
This result is known as the Representer Theorem, with many variants and generalisations available, for example, [Reference Schölkopf, Herbrich and Smola64] or [Reference Steinwart and Christmann68, Section 5.1, 5.2] for more details and proofs. If $\ell$ is a benign loss function (so that $(t_1,\ldots, t_N)\mapsto \frac{1}{N} \sum _{n=1}^N \ell (x_n,y_n,t_n)$ is lower semicontinuous and strictly continuous), then Proposition 2.7 assures that a unique solution of the (in general infinite-dimensional) optimisation problem (2.2) exists, and that it can be computed using a finite-dimensional optimisation problem.
Kernel ridge regression. In the following, we will focus on the $\ell _2$ - or least-squares loss function $\ell (x,y,t)=(y-t)^2$ , so that (2.2) fulfils the conditions of Proposition 2.7. The resulting optimisation problem is
and finding a prediction $\vec{f}$ by solving this problem is called kernel ridge regression (KRR). It turns out that (2.3) has a closed-form solution, given as
where we defined
Furthermore, in contrast to a more general kernel machine like (2.2), KRR can be given a concrete probabilistic interpretation. If we assume that the noise variables $\eta _1,\ldots, \eta _N$ are independent and identically distributed centred Gaussian random variables, then the KRR solution is exactly the resulting posterior mean function occurring in GP regression, when using a zero prior mean function, and the kernel as the covariance function, cf. [Reference Kanagawa, Hennig, Sejdinovic and Sriperumbudur48]. Similarly, the KRR solution can also be linked to the maximum a posteriori solution of Bayesian linear regression when using a Gaussian prior on the weights, cf. [Reference Murphy58, Section 11.3]. In both cases, the regularisation parameter is linked to the noise level.
3. IPS models and their numerical treatment
In this section, we recall some well-known examples of interacting particle systems describing the agent dynamics on the microscopic level, as well as their corresponding mean-field limits. Furthermore, we describe established numerical methods used to simulate these systems, both on the microscopic and mesoscopic levels. These example systems and the numerical methods will form the foundation for the numerical experiments in the next section.
3.1. Example systems and their mean-field limits
Various authors have introduced and studied multiagent models on the microscopic level that model social and political phenomena, aiming at understanding collective behaviours and self-organisation within society [Reference Albi, Pareschi and Zanella4, Reference Düring, Markowich, Pietschmann and Wolfram28, Reference Hegselmann and Krause42, Reference Weidlich70]. Our first example, a well-known first-order model on the microscopic level, stems from this literature and has been used for example in opinion dynamics, cf. [Reference Toscani69].
Let $x_i \,:\!=\,x_i(t)\in{\mathbb{R}}^d$ be the state of agent $i=1,\dots, M$ at time $t\geq 0$ . In the case of opinion dynamics, $x_i$ represents the opinion(s) of individual $i$ . The interaction of the agents is modelled by a function $P\,:\,{\mathbb{R}}^d\times{\mathbb{R}}^d\rightarrow{\mathbb{R}}$ , often called interaction function, and the dynamics are described by the first-order ordinary differential equation system
starting from an initial condition $x_i(t_0)=x_i^0, \ i=1,\ldots, M$ .
For our second example, we focus on second-order systems aiming at describing swarming or flocking behaviour. The latter refers to the aggregation behaviour of a group of similar entities, for example, animals of the same species. For concreteness, we focus on Cucker-Smale systems, which consider only the alignment behaviour of a group of agents [Reference Bailo, Bongini, Carrillo and Kalise6, Reference Cucker and Smale22]. In the most common formulation of this model, the state of agent $i$ is now given by its position $x_i\in{\mathbb{R}}^d$ and velocity $v_i\in{\mathbb{R}}^d$ , evolving under the dynamics described by
and initial conditions $x_i(t_0)=x^0_i, \ v_i(t_0)=v^0_i, \ i=1,\ldots, M$ . The function $H_\beta$ quantifies the intensity of interaction between individuals $i$ and $j$ , usually varying based on their mutual distance, with the underlying assumption that closer individuals have a greater influence compared to those farther apart. A common choice for the function $H_\beta$ is
where $\| \cdot \|$ is the usual Euclidean norm on ${\mathbb{R}}^d$ . Under the assumption $\beta \geq 0$ , it can be shown that the system is forward-complete and that mass and momentum are preserved.
Simulating dynamics of this nature for large systems of individuals requires significant computational resources, and for many interesting applications, such microscopic models can indeed involve very large populations of interacting individuals, ranging from several hundred thousand to millions. From a mathematical modelling perspective, these challenges have been addressed within the mean field research community, where deriving mean field equations serve as an initial step toward mitigating computational complexity, transitioning from a microscopic description, centred on phase-space particles, to a mesoscopic level, where the focus shifts to particle distributions. Let us briefly recall the formalisation of this.
Consider a continuous-time multiagent system with $M$ agents, and suppose that the state space of an individual agent is $Z$ . For example, for the first-order model (3.1), we have $Z={\mathbb{R}}^d$ , and for the second-order model (3.2), we have $Z={\mathbb{R}}^d\times{\mathbb{R}}^d$ . The state of the whole system at time $t\geq 0$ is then $(z_1(t),\ldots, z_M(t))\in Z^M$ , and assuming indistinguishability of the agents, this corresponds to a time-varying empirical measure
where $\delta _z$ is the Dirac distribution centred at atom $z\in Z$ . The idea is now, using a weak formulation, to derive an evolution equation for the empirical measures, and go to the limit $M\rightarrow \infty$ , which then leads to an evolution equation for probability measures over the state-space $Z$ . Making this mean-field limit precise is a classic subject, and well-posedness results are available for a broad range of models, for example [Reference Canizo, Carrillo and Rosado14, Reference Carrillo, Fornasier, Rosado and Toscani17].
Under regularity assumptions on the interaction function $P$ and $H_\beta$ , respectively, one can compute the mean-field limit of the microscopic models (3.1) and (3.2) introduced above. For the first-order model (3.1), one obtains the following strong form of the evolution equation for the agent distribution
In the case of the Cucker-Smale model (3.2), the evolution equation for the agent distribution in strong form is
Apart from allowing numerical tractability, the mean field Equations (3.4)–(3.5) can simplify the analysis of interacting particle systems, and allow to gain insights into the macroscopic properties of the model, such as its overall density, velocity, and direction, see, for example, [Reference Albi, Bellomo and Fermo1, Reference Bellomo and Soler9, Reference Carrillo, Fornasier, Toscani and Vecil18]. This can be useful for studying the emergence of global behaviour and patterns, understanding phase transitions and analysing the stability of collective behaviours [Reference Albi and Ferrarese2, Reference Carrillo, Choi and Hauray16, Reference Estrada-Rodriguez and Gimperlein29].
3.2. Numerics for the IPS models
We now turn to numerical approaches to approximate the first-order opinion dynamics model (3.1) and the second-order alignment model (3.2), as well as their mean field counterparts (3.4) and (3.5).
In the numerical experiments of Section 4, the dynamics on the microscopic level are discretized by a forward Euler scheme with time step $\Delta t$ over the time horizon $[t_0, \, T]$ , so for (3.1), we obtain the following discretization
while for (3.2) we have
where $x_i^n \approx x_i(t_n), \ v_i^n \approx v_i(t_n)$ with $t_n = n \, \Delta t \in [t_0, \, T].$
We consider now the corresponding mean field counterparts, starting with the first-order model (3.1) and the associated mean field Equation (3.4). In order to approximate the latter, we use mean field Monte-Carlo (MFMC) methods as developed in ref. [Reference Albi and Pareschi3]. These methods fall in the class of fast algorithms for interacting particle systems such as direct simulation Monte-Carlo methods (DSMCs) [Reference Babovsky and Neunzert5, Reference Bobylev and Nanbu11, Reference Dimarco, Caflisch and Pareschi25], or most recently Random Batch Methods [Reference Jin, Li and Liu47]. For the MFMC method, we consider $\hat{M}$ particles $x^0\equiv \left \{x^0_{i}\right \}_i$ sampled from the initial distribution $\mu ^0(x)$ , and we introduce the following approximation for the mean field dynamics (3.4)
where the quantities $\vec{P}_i^n$ and $\vec{X}^n_i$ are computed from a sub-sample of $\hat{M}_s$ particles randomly selected from the whole ensemble of $\hat{M}$ sampled particles,
Using this type of MC algorithm, we can reduce the cost due to the computation of the interaction term from $\mathcal{O}(\hat{M}^2)$ to $\mathcal{O}(\hat{M}_s \, \hat{M})$ . Observe that for $\hat{M}_s = \hat{M}$ , we obtain the explicit Euler scheme for the original particle system (3.1) with $\hat{M}$ particles.
Finally, for the implementation of the second-order mean field model (3.5), we use a two-dimensional Lax–Friedrichs scheme [Reference Lax49, Reference LeVeque50], known to be a very stable scheme with much diffusion. It is a numerical method based on finite differences, forward in time and centred in space. For simplicity, we focus on the case $d=1$ , and consider a compact space-time domain $[t_0, \, T] \times [a_x, \, b_x] \times [a_v, \, b_v]$ . Rewriting (3.5) in the compact form
where
the numerical scheme is given
where now $\mu ^{n}_{i,j} \approx \mu (t_n,x_i,v_j)$ , and the domain $[t_0, \, T] \times [a_x, \, b_x] \times [a_v, \, b_v]$ is discretized using equally spaced points with a spacing of $\Delta t, \, \Delta x, \, \Delta v$ in the $t,\, x,\, v$ direction, respectively, and $g^{n}_{i,j}, \, h^{n}_{i,j}$ are the numerical fluxes.
4. Kernel methods for IPS: Applications and numerical tests
We now present two novel applications of kernel methods to IPS, which we illustrate using numerical experiments based on the example models and associated numerical methods outlined in the preceding section. First, in Section 4.1, we describe the use of kernel methods for surrogate modelling in the context of IPS, which to the best of our knowledge is a novel use case. Section 4.2 is concerned with kernel-based learning of state-dependent features of IPS in the mean field setting, which naturally leads to mean-field limits of kernels. This scenario and the associated theory have been introduced in ref. [Reference Fiedler, Herty, Rom, Segala and Trimpe31, Reference Fiedler, Herty and Trimpe33], and we provide a concise recap of the setting, the basic concepts and results from these references. We then consider a specific class of kernels, for which we can provide more concrete results than the general theory in the latter two references. These developments then form the basis for numerical experiments.
All experiments have been implemented in MATLAB $^{\circledR }$ . For convenience, the experimental parameters are summarised in Table 1.
4.1. Surrogate modelling of IPS related properties
Consider an IPS with $M\in \mathbb{N}_+$ agents or particles and state-space $X^M$ , where $X$ is the state-space of an individual particle. Frequently one is not directly interested in a trajectory $\vec{x}_M\,:\,[0,T]\rightarrow X^M$ , but rather in a functional of this trajectory. Such a functional is usually provided as a closed-form expression, or described implicitly by a numerical algorithm. In practice, one first computes the trajectory $\vec{x}_M$ , using methods like the ones described in Section 3.2, and then applies the functional of interest to this trajectory. However, if the system is very large, so $M\gt \gt 1$ , then the computation of $\vec{x}_M$ becomes very expensive. Similarly, for a very complex functional, even the second step can be very expensive. In this section, we propose to use kernel-based surrogate models to reduce the computational effort needed.
For illustrative purposes, we focus on simple but prototypical scenarios, which allow us to effectively evaluate the kernel-based techniques.
Test A1. Surrogate variance for the Cucker and Smale model. First, we consider the case of a pointwise-in-time functional of the state, i.e., we have a map $F_M\,:\,X^M\rightarrow{\mathbb{R}}$ that is applied on a state $\vec{x} \in X^M$ . Given the trajectory $\vec{x}_M$ as above, this induces a corresponding trajectory of the functional, $t\mapsto F_M(\vec{x}_M(t))$ . If the latter needs to be evaluated on a fine grid on $[0,T]$ , for example, for visualisation purposes, this can become very expensive, in particular, if $F$ requires complex computations. We therefore approximate $t \mapsto F_M(\vec{x}_M(t))$ by a kernel method as
from samples $(t_1,F_M(\vec{x}_M(t_1))),\ldots, (t_N,F_M(\vec{x}_M(t_M)))$ , where $\alpha _1,\ldots, \alpha _N\in{\mathbb{R}}$ are the coefficients determined by the chosen kernel method. Since the samples are the result of a computation and not a measurement, we do not incur measurement errors, and hence we have an interpolation problem, for which we use kernel interpolation, cf. Section 2.2.
Remark 4.1. A related problem is to learn the evolution of a functional from few measurements. In this case, the data will be noisy and kernel interpolation is inappropriate, and one could use KRR, for example. However, investigating this scenario in detail is beyond the present article.
As a concrete example, we use the microscopic Cucker-Smale model (3.2), so $X=({\mathbb{R}}^d)^2$ , for fixed initial conditions, and for ease of visualisation, we work with $d=1$ . As a functional of the state, we consider the (pointwise-in-time) variance of the velocities, which we denoted by $\mathcal{V}_M$ . The goal is therefore to approximate $\mathcal{V}_M$ by
from data $(t_1,\mathcal{V}_M(t_1)),\ldots, (t_N,\mathcal{V}_M(t_N))$ , where we chose for concreteness the SE kernel (2.1) with $\gamma =\frac{1}{\sqrt 2}$ . The underlying dynamics adhere to the second-order microscopic model (3.2) and it is discretized as explained in Section 3.2. It involves a swarm of $M=30$ agents with $N = 4$ measurements in time of the true variance over a total number of $1000$ time-steps. The initial input data are uniformly distributed, namely $x_i^0, v_i^0 \sim \mathcal{U}([1,2])$ , for every $i=1,\dots, M$ . Figure 1 depicts the evolution of positions and velocities over time and shows a comparison between the exact function $\mathcal{V}_M$ and the approximated one $\hat{\mathcal{V}}_M$ . We observe that there is a good agreement, in fact, the error $t\mapsto |\mathcal{V}_M-\hat{\mathcal{V}}_M|$ is in the interval $[10^{-18},\, 10^{-2}]$ . Please refer to Table 1 for all the simulation parameters.
Test A2. Surrogate cost in a minimisation problem. We now consider the case of a functional, which depends on the whole trajectory, and not only pointwise-in-time. As an example, we consider an optimisation problem whose objective function is a functional of a trajectory of a (parameterized) IPS. For concreteness, consider the microscopic Cucker-Smale system (3.3) with $M\in \mathbb{N}_+$ agents, and we want to optimise the integrated (over time) variance of the velocities (w.r.t. a fixed initial condition) over the interaction parameter $\beta \in{\mathbb{R}}$ (appearing in the interaction function $H_\beta$ ). Furthermore, we consider the constraint $\beta \in K$ , where $K\subseteq{\mathbb{R}}$ is compact. The corresponding minimisation problem can be formalised as
The previous problem is a one-dimensional minimisation problem and it could be solved without difficulty if the functional $ \mathcal{J}$ is easily evaluated. However, numerical optimisation methods usually evaluate the underlying objective functions many times. Observe that in the present situation, the objective function involves the simulation of an IPS, and then the integral over (a functional of) the whole trajectory. For large $T$ and $M$ , this can become very expensive. Hence, it is reasonable to use a surrogate model for the objective function $\mathcal{J}$ , which can be cheaply evaluated. We will use again kernel interpolation for this task, applying it to the data $(\beta _1, \mathcal{J}(\beta _1)),\ldots, (\beta _N,\mathcal{J}(\beta _N))$ . The corresponding surrogate function is hence given by
where for concreteness, we chose again the SE kernel with $\gamma =\frac{1}{\sqrt 2}$ . For our experiment, the time integral in the definition of $\mathcal{J}$ is computed using the rectangular rule, and the underlying microscopic dynamics are simulated as described in Section 3.2. The results for $N=5$ are depicted in Figure 2 and we observe again a coincidence of the surrogate function $\hat{\mathcal{J}}$ and the true cost $\mathcal{J}.$ This shows that kernel approximation could be used for an efficient minimisation using surrogate models. Additionally, one could easily seek the optimal value of $\beta$ numerically by employing a gradient descent approach, since the gradient of $\hat{\mathcal{J}}$ can be easily computed. However, these considerations are beyond the scope of the present work.
4.2. Kernels in the mean-field limit
We now turn to kernels in the mean-field limit, which naturally arise when learning state-dependent functionals of large IPS. We first recall this latter learning problem as described in [Reference Fiedler, Herty and Trimpe33], and then describe some of the general theory of kernels in the mean-field limit, following [Reference Fiedler, Herty, Rom, Segala and Trimpe31, Reference Fiedler, Herty and Trimpe33]. We then specialise the theory to our concrete setting, which forms the foundation for the numerical experiments in the remainder of this section.
Introduction. Consider a multiagent system consisting of $M\in \mathbb{N}_+$ agents or particles, and assume that the system state at each time instant $t\geq 0$ is completely described by all the individual agent states $x_i(t)\in{\mathbb{R}}^d$ , $i=1,\ldots, M$ , at time $t$ , so the state of the whole system at time $t$ is just $\vec{x}(t)=(x_i(t))_{i=1,\ldots, M}\in ({\mathbb{R}}^d)^M$ . We are interested in a certain state-dependent feature of the IPS. For example, in opinion dynamics, the individual state $x_i(t)$ corresponds to the opinion (in some potentially high-dimensional opinion space) of agent $i$ , $i=1,\ldots, M$ . A feature of interest could then be a measure of disagreement between the agents, or a measure of susceptibility to adversarial external influence. Since the IPS is completely described by its state $\vec{x}(t)$ , it appears reasonable to model such a feature as a functional on the state space, i.e., if the system is in state $\vec{x}$ , then the feature takes the value $f_M(\vec{x})$ , where $f_M\,:\,({\mathbb{R}}^d)^M\rightarrow \mathcal{Y}$ , with $\mathcal{Y}$ some real vectorspace. Simple examples of such a feature are the mean and the variance of the agent states. These two cases are essentially trivial since the maps describing the features are given by analytical formulas. However, in modern applications of IPS like opinion dynamics, it can be very difficult to describe features using first-principles modelling. Instead, we can learn them from data. To simplify the following exposition, we consider only scalar-valued features (corresponding to $\mathcal{Y}={\mathbb{R}}$ in the notation from above), since the vector-valued (or even matrix-valued) case can be covered by treating each component separately.
We assume that potentially noisy measurements of the feature of interest are available at certain time instances. More formally, let $0 \leq t_1 \lt \ldots \lt t_N$ , then we assume access to state snapshots $\vec{x}(t_1),\ldots, \vec{x}(t_N)$ and noisy measurements of the feature at these times modelled as
where $\eta _1,\ldots, \eta _N$ is additive noise. We can treat this as a standard supervised learning problem, with data set $(\vec{x}(t_1),y_1),\ldots, (\vec{x}(t_N),y_N)$ . If we want to use a kernel method like (2.2), we need a kernel of the form $k_M\,:\,({\mathbb{R}}^d)^M \times ({\mathbb{R}}^d)^M \rightarrow{\mathbb{R}}$ , and the resulting approximation of the map describing the feature then becomes
In particular, if we have at any time $t\geq 0$ a state snapshot $\vec{x}(t)$ , then we can predict the feature at this time by $\vec{f}_M(\vec{x}(t))$ . Moreover, if we have a model of the IPS dynamics, then we can predict the evolution of the feature by $t \mapsto \vec{f}_M(\vec{x}(t))$ .
We consider now the case of a very large MAS, i.e., $M\gt \gt 1$ . As described in Section 3 for concrete examples, the modelling, simulation and prediction of the dynamics in this setting can be made tractable by going to the kinetic level, for instance using the mean-field limit, corresponding to the limit $M\rightarrow \infty$ . A key modelling assumption for this is the homogeneity of agents, i.e., the agents are indistinguishable. Under this condition, it appears reasonable to assume that also the feature of interest does not depend on the order of the agents, and standard results like [Reference Carmona and Delarue15, Lemma 1.2] suggest that also the maps $f_M$ have a mean-field limit. But what happens to the learning problems in the limit $M\rightarrow \infty$ ? Recall that we need kernels of the form $k_M\,:\,({\mathbb{R}}^d)^M \times ({\mathbb{R}}^d)^M \rightarrow{\mathbb{R}}$ , so we have to consider the case of a sequence of kernels with inputs tuples of increasing length. It turns out that we can formulate an appropriate mean-field limit of kernels and their RKHSs, and that the resulting theory can be used for the learning problems.
Mean-field limit of functions and kernels. We first recall some preliminaries from measure theory. Let $(X,d_X)$ be a compact metric space and denote by $\mathcal{P}(X)$ the set of Borel probability measures on $X$ , which we endow with the topology of weak convergence. It is well-known that this topology can be metrized by the Kantorowich–Rubinstein metric
Since $X$ is separable as a compact metric space, this metric coincides with the 1-Wasserstein metric. Furthermore, since $X$ is compact, also the metric space $(\mathcal{P}(X),d_{\text{KR}})$ is compact. Given $\vec{x}\in X^M$ , we denote the $i$ -th component of $\vec{x}$ by $x_i$ , and we define the empirical measure with atoms in $\vec{x}$ by $\hat{\mu }[\vec{x}] = \frac{1}{M}\sum _{i=1}^M \delta _{x_i}$ , where $\delta _x$ denotes the Dirac measure centred at $x\in X$ . It is well-known that the empirical measures are dense in $\mathcal{P}(X)$ w.r.t. the Kantorowich–Rubinstein metric. For more details and background, we refer to [Reference Düring, Markowich, Pietschmann and Wolfram28, Chapter 11].
The following definition makes precise the intuitive concept of a mean-field limit of a sequence of functions with an increasing limit of arguments.
Definition 4.2. Consider functions $f_M\,:\,X^M \rightarrow{\mathbb{R}}$ , $M\in \mathbb{N}_+$ and $f\,:\,\mathcal{P}(X)\rightarrow{\mathbb{R}}$ . We say that $f$ is the mean-field limit of $(f_M)_M$ , or that $(f_M)_M$ converges in mean field to $f$ , if
This notion originated in the literature on mean field games [Reference Carmona and Delarue15], and is now common in the context of mean-field limits of IPS, cf. [Reference Fornasier, Lisini, Orrieri and Savaré36] and [Reference Fiedler, Herty and Trimpe32], for examples, in continuous and discrete-time, respectively.
The next well-known result, cf. [Reference Carmona and Delarue15, Lemma 1.2], ensures the existence of a (subsequential) mean-field limit of functions in the sense of Definition 4.2. For $M\in \mathbb{N}_+$ , denote by $\mathcal{S}_M$ the set of permutations on $\{1,\ldots, M\}$ , and for a tuple $\vec{x} \in X^M$ and permutation $\sigma \in \mathcal{S}_M$ , define $\sigma \vec{x}=(x_{\sigma (1)},\ldots, x_{\sigma (M)})$ .
Proposition 4.3. Let $f_M\,:\,X^M \rightarrow{\mathbb{R}}$ , $M\in \mathbb{N}_+$ , be a sequence of functions fulfilling the following conditions.
-
1. (Permutation-invariance) For all $M\in \mathbb{N}_+$ , $\sigma \in \mathcal{S}_M$ and $\vec{x} \in X^M$ , we have $f_M(\sigma \vec{x})=f_M(\vec{x})$ .
-
2. (Uniform boundedness) There exists $B_f\in \mathbb{R}_{\geq 0}$ such that for all $M\in \mathbb{N}_+$ , $\vec{x} \in X^M$ , we have $|f_M(\vec{x})|\leq B_f$ .
-
3. (Uniform Lipschitz continuity) There exists $L_f\in \mathbb{R}_{\geq 0}$ such that for all $M\in \mathbb{N}_+$ , $\vec{x}, \vec{x}^{\prime}\in X^M$ we have
\begin{equation*} |f_M(\vec {x}) - f_M(\vec {x}^{\prime})|\leq L_f d_{\text {KR}}(\hat {\mu }[\vec {x}], \hat {\mu }[\vec {x}^{\prime}]). \end{equation*}
Then there exists a Lipschitz continuous function $f\,:\,\mathcal{P}(X)\rightarrow{\mathbb{R}}$ and a subsequence $(f_{M_\ell })_\ell$ such that
This result can be used to justify the assumption that a mean-field limit map exists to model a state-dependent feature at the kinetic level. These developments can be extended to the case of kernels, as has been done first in [Reference Fiedler, Herty, Rom, Segala and Trimpe31].
Definition 4.4. Consider bivariate functions $\kappa _M\,:\,X^M \times X^M \rightarrow{\mathbb{R}}$ , $M\in \mathbb{N}_+$ and $\kappa\,:\,X^M \times X^M\rightarrow{\mathbb{R}}$ . We say that $\kappa$ is the mean-field limit of $(\kappa _M)_M$ , or that $(\kappa _M)_M$ converges in mean field to $\kappa$ , if
For the next result, define the metric
on $\mathcal{P}(X)\times \mathcal{P}(X)$ , which is one metric for the product topology, and hence $(\mathcal{P}(X)\times \mathcal{P}(X), d_{\text{KR}}^2)$ is again a compact metric space.
Proposition 4.5. Let $k_M\,:\,X^M\times X^M \rightarrow{\mathbb{R}}$ , $M\in \mathbb{N}_+$ , be a sequence of kernels that fulfils the following conditions.
-
1. (Permutation-invariance) For all $M\in \mathbb{N}_+$ , $\sigma \in \mathcal{S}_M$ and $\vec{x}, \vec{x}^{\prime}\in X^M$ , we have $k_M(\sigma \vec{x}, \vec{x}^{\prime})=k_M(\vec{x}, \vec{x}^{\prime})$ .
-
2. (Uniform boundedness) There exists $B_k\in \mathbb{R}_{\geq 0}$ such that for all $M\in \mathbb{N}_+$ , $\vec{x}, \vec{x}^{\prime}\in X^M$ , we have $|k_M(\vec{x}, \vec{x}^{\prime})|\leq B_f$ .
-
3. (Uniform Lipschitz continuity) There exists $L_k\in \mathbb{R}_{\geq 0}$ such that for all $M\in \mathbb{N}_+$ , $\vec{x}, \vec{x}^{\prime}, \vec{y}, \vec{y}^{\prime}\in X^M$ we have
\begin{equation*} |k_M(\vec {x}, \vec {x}^{\prime}) - k_M(\vec {y}, \vec {y}^{\prime})|\leq L_k d_{\text {KR}}^2\left ((\hat {\mu }[\vec {x}], \hat {\mu }[\vec {x}^{\prime}]), (\hat {\mu }[\vec {y}], \hat {\mu }[\vec {y}^{\prime}])\right ). \end{equation*}
Then there exists a Lipschitz continuous kernel $k$ on $\mathcal{P}(X)$ and a subsequence $(k_{M_\ell })_\ell$ such that
In the preceding result, since $k_{M_\ell }$ and $k$ are all kernels, they come with their unique RKHSs. Importantly, the corresponding RKHS functions inherit from their reproducing kernels properties that are relevant for the mean-field limit of functions, e.g., the permutation-invariance, cf. Lemma 2.5. Moreover, the mean-field limit of the kernels induces a certain limiting behaviour inside the RKHSs. Every function from the RKHS $H_k$ arises as a mean-field limit of functions from the RKHSs corresponding to the kernels $k_M$ , and conversely, every uniformly norm bounded sequence of functions from the RKHSs corresponding to the kernels $k_M$ has a subsequence that converges in mean field to a function contained in $H_k$ , cf. [Reference Fiedler, Herty and Trimpe33, Theorem 2.3] for details.
Mean-field limit of the kernel learning problem. All the preceding discussion suggests that we can use such kernels to learn feature functionals in the mean-field limit context. For this, we have to connect the learning problems for finite $M\in \mathbb{N}_+$ and for the mean-field limit. This can be done with the following mean field variant of the Representer Theorem, cf. [Reference Fiedler, Herty and Trimpe33, Theorem 3.3, Remark 3.4]. To simplify the exposition, assume from now on the situation of Proposition 4.5, relabel the convergent subsequence again by $M$ , and define $H_M=H_{k_M}$ for all $M\in \mathbb{N}_+$ .
Proposition 4.6. Let $N\in \mathbb{N}_+$ , $\mu _1,\ldots, \mu _N\in \mathcal{P}(X)$ , and for $n=1,\ldots, N$ consider $\vec{x}^{[M]}_n \in X^M$ , $M\in \mathbb{N}_+$ , such that $\hat{\mu }[\vec{x}^{[M]}_n] \overset{d_{\text{KR}}}{\longrightarrow } \mu _n$ for $M\rightarrow \infty$ . Let $L\,:\,{\mathbb{R}}^N \rightarrow \mathbb{R}_{\geq 0}$ be continuous and strictly convex and $\lambda \gt 0$ . For each $M\in \mathbb{N}_+$ , consider the problem
as well as the problem
Then for each $M\in \mathbb{N}_+$ , problem (4.4) has a unique solution $\vec{f}_M$ , which is of the form
with $\alpha ^{[M]}_1,\ldots, \alpha ^{[M]}_N\in{\mathbb{R}}$ , and problem (4.5) has a unique solution $\vec{f}$ , which is of the form
with $\alpha _1,\ldots, \alpha _N\in{\mathbb{R}}$ . Furthermore, there exists a subsequence $(\vec{f}_{M_j})_j$ such that $\vec{f}_{M_j} \rightarrow \vec{f}$ for $j\rightarrow \infty$ in mean field, and
for $j\rightarrow \infty$ .
Kernel ridge regression. We can immediately specialise this result to the case of KRR. Let $(\mu _1,y_1),\ldots, (\mu _N,y_N)\in \mathcal{P}(X)\times{\mathbb{R}}$ , $(\vec{x}_1^{[M]}, y_1^{[M]}),\ldots, (\vec{x}_N^{[M]}, y_N^{[M]})\in X^M\times{\mathbb{R}}$ , $M\in \mathbb{N}_+$ , such that for all $n=1,\ldots, N$ , it holds that $\hat{\mu }[\vec{x}^{[M]}_n] \overset{d_{\text{KR}}}{\longrightarrow } \mu _n$ for $M\rightarrow \infty$ . Consider the KRR problems
where $\lambda \in \mathbb{R}_{\gt 0}$ is the regularisation parameter. The problems have unique solutions
where we defined for $M\in \mathbb{N}_+$
and
According to Proposition 4.6, there exists a strictly increasing sequence $(M_j)_j$ such that $\vec{f}_{M_j} \rightarrow \vec{f}$ in mean field, and
for $j\rightarrow \infty$ .
A concrete example of mean field kernels. As a concrete example of kernels in the mean-field limit, we use the double-sum kernel, cf. [Reference Fiedler, Herty, Rom, Segala and Trimpe31, Section 5.2]. In contrast to the latter reference, we proceed with a more elementary approach. Let $k_0$ be a bounded kernel on $X$ , so there exists $B_0\in \mathbb{R}_{\geq 0}$ such that $|k_0(x,x^{\prime})|\leq B_0$ for all $x,x^{\prime}\in X$ . Define now $k\,:\,\mathcal{P}(X)\times \mathcal{P}(X)\rightarrow{\mathbb{R}}$ by
Since $\mu, \nu \in \mathcal{P}(X)$ are finite measures and $k_0$ is bounded, the double integral above is well-defined. Furthermore, we have
where the integrals in the last line are in the sense of Bochner, cf. [Reference Sriperumbudur, Gretton, Fukumizu, Schölkopf and Lanckriet67, Theorem 1], and we used in the last step that the scalar product as a continuous linear functional commutes with the Bochner integral. The above equality shows that $k$ is indeed a kernel on $\mathcal{P}(X)$ , cf. Remark 2.3.
For $M\in \mathbb{N}_+$ , define $k_M\,:\,X^M \times X^M \rightarrow{\mathbb{R}}$ by
These bivariate maps are called double-sum kernels, and it is well-known that they are indeed kernels, and permutation-invariant. Furthermore, since for $M\in \mathbb{N}_+$ and $\vec{x}, \vec{x}^{\prime}\in X^M$ we have
the kernels $k_M$ are uniformly bounded in the sense of Proposition 4.5.
Observe now that for all $M\in \mathbb{N}_+$ and $\vec{x}, \vec{x}^{\prime}\in X^M$ , we have
which implies that
so the kernels $k_M$ converge to $k$ in mean field in the sense of Definition 4.4.
Remark 4.7. The preceding developments work for any measurable space $(X,\mathcal{A}_X)$ , where $\mathcal{P}(X)$ is now the set of probability measures defined on this measurable space, and any $\mathcal{A}\otimes \mathcal{A}$ - $\mathcal{B}({\mathbb{R}})$ -measurable (here $\mathcal{B}({\mathbb{R}})$ is the Borel $\sigma$ -algebra on $\mathbb{R}$ ) and bounded kernel $k_0\,:\,X \times X \rightarrow{\mathbb{R}}$ .
Note that while we established the mean field convergence of $k_M$ directly, without relying on Proposition 4.5, we still need all the assumptions of this latter result for the mean-field limit variant of the Representer Theorem stated as Proposition 4.6, cf. the corresponding proofs in [Reference Fiedler, Herty and Trimpe33]. The only missing property for the kernels $k_M$ is uniform Lipschitz continuity. For a particular and broad class of kernels, the following result provides a sufficient condition for this property. Please refer to Appendix A for the proof of the following proposition. To the best of our knowledge, this result is new.
Proposition 4.8. Let $\mathcal{X}$ be a normed vectorspace, $X \subseteq \mathcal{X}$ a nonempty Borel-measurable subset, $\phi\,:\,X \rightarrow{\mathbb{R}}$ a $L$ -Lipschitz continuous function, define $\kappa _0\,:\,X \times X \rightarrow{\mathbb{R}}$ by $\kappa _0(x,x^{\prime})=\phi (\|x-x^{\prime}\|)$ , and for $M\in \mathbb{N}_+$ define $\kappa _M(\vec{x}, \vec{x}^{\prime})=\frac{1}{M^2}\sum _{i,j=1}^M \kappa _0(\vec{x}_i, \vec{x}_j^{\prime})$ . We then have for all $M\in \mathbb{N}_+$ , $\vec{x}, \vec{x}^{\prime}, \vec{y}, \vec{y}^{\prime} \in X^M$ that
Let’s consider now $X \subseteq \mathcal{H}$ a nonempty subset of a Hilbert space. A kernel $k_0$ on $X$ of the form $k_0(x,x^{\prime}) = \phi (\|x-x^{\prime}\|)$ is called a radial kernel, or a radial basis function (kernel). In the following, we consider $\mathcal{H}={\mathbb{R}}^d$ , $X\subseteq{\mathbb{R}}^d$ a nonempty compact subset, and choose $k_0$ as the Gaussian kernel, so in this case, $\phi (s)=\exp (-\frac{s^2}{2\gamma ^2})$ . Observe that this $\phi$ is bounded, and (globally) Lipschitz continuous with Lipschitz bound given by $\max _{s\in{\mathbb{R}}} |\phi ^{\prime}(s)|$ , so the resulting sequence of double-sum kernel fulfils all conditions from Proposition 4.5.
Remark 4.9. We would like to point out the following delicate aspects of the preceding developments. By direct calculation, we have established the mean field convergence of the double-sum kernels $k_M$ , as defined in (4.9), to $k$ given by (4.8). Furthermore, the sequence of double-sum kernels based on the Gaussian kernel fulfils all the conditions of Proposition 4.5, so there exists a mean-field limit kernel that is bounded and Lipschitz continuous, and a subsequence of the double-sum kernel sequence, that converges in mean field to this latter kernel. However, we did not prove that this kernel is (4.8). If we had uniqueness of the mean-field limit kernel in Proposition 4.5, then this would trivially follow. Investigation of this uniqueness question is beyond the scope of the present work. However, it is clear that (4.8) is bounded, and by using mutatis mutandis the arguments from the proof of Proposition 4.8, one can verify that (4.8) is Lipschitz continuous. This means that (4.8) fulfils the properties from the limit kernel in Proposition 4.5.
Experimental setup. Below, we conduct numerical experiments concerning mean field kernel methods, particularly emphasising learning tasks within large-scale IPS. These tests numerically validate the theoretical insights presented in this section. In particular, we focus on the kernel approximation of the variance $v_M\,:\,({\mathbb{R}}^d)^M\to{\mathbb{R}}$ and the skewness $s_M\,:\,({\mathbb{R}}^d)^M\to{\mathbb{R}}$ of the agent system $\vec{x}=(x_i)_{i=1,\ldots, M}\in ({\mathbb{R}}^d)^M$ , i.e.
where $m_M(\vec{x})=\frac 1M \sum _{i=1}^M x_i$ is the mean of the agents and $\| \cdot \|$ denotes the usual Euclidean norm in ${\mathbb{R}}^d$ . The mean-field limit of those functions is $v_\infty, s_\infty\,:\,\mathcal{P} ({\mathbb{R}}^d) \to{\mathbb{R}}^d$ given by
where the mean is $m_\infty (\mu ) = \int _{{\mathbb{R}}^d} x \, d\mu (x).$ Please refer to Table 1 for simulation parameters of all the numerical tests, and to Section 3.2 for the models’ discretization approaches.
Regarding the choice of the kernel, in the following numerical tests, we consider the double-sum kernel $k_M$ in equation (4.9), and the correspondent mean-field limit kernel $k$ in Equation (4.8). In both cases, we take $k_0=k_\gamma$ given by (2.1).
Test B1. Microscopic first-order model. In this section, we present numerical tests for the opinion formation model (3.1). The interaction between the agents is described by $P$ , which is given by
$P$ promotes mutual attraction among the agents as it consistently remains non-negative. The initial conditions $x^0_{i}$ for the agents $i=1,\dots, M,$ are randomly chosen with uniform distribution in the interval $[1,2]$ , i.e., $\vec{x}(t_0) \sim \mathcal{U}([1,2])^M.$ For the given function $P$ , it is known that the dynamics $x_i(t) \in [1,2]$ for all $t\geq 0$ due to the non-negativity of the interaction rules. We consider first both functionals $v_M$ and $s_M$ in the noise-free case.
The numerical results presented in Figure 3 illustrate the inference of $v_M$ and $s_M$ under varying values of measurements $N \in \lbrace 2,4,8\rbrace$ , with $M=30$ agents. These results underscore the high approximation quality of the RKHS, even when the number of measurements is rather low. The graphical representation portrays the evolution in time of the errors $| v_M - \hat{v}_M |$ and $| s_M - \hat{s}_M |$ , indicating a small discrepancy between the inferred functionals $\hat{v}_M(\vec{x}(t)), \hat{s}_M(\vec{x}(t))$ , derived from solving the interpolation problem presented in Proposition 2.6, and the true solutions $v_M(\vec{x}(t)), s_M(\vec{x}(t))$ across all time points $t\geq 0$ . This alignment is further substantiated by the numerical data in Table 2. Notably, as the number of known data points $N$ increases, the error associated with approximated variance and skewness diminishes.
As an additional scenario, we introduce random noise perturbations $\eta$ to the evaluations, as described in Section 2.3. Maintaining the same initial condition $\vec{x}(t_0) \sim \mathcal{U}([1,2])^M$ , we now explore the approximation of $v_M$ and $s_M$ in the presence of noise. Specifically, we consider $\eta _n$ that follows a normal distribution $\eta _n \sim \mathcal{N}(0,\sigma ^2)$ , for $i=1,\dots, N$ , with $\sigma ^2 = 0.01$ . In this noisy scenario, we solve the minimisation problem (2.3) with $\lambda = \sqrt{\sigma ^2}$ .
Figure 4 illustrates the estimation of $v_M$ and $s_M$ under the condition of $N=4$ measurements and a swarm size of $M=30$ agents. Even with the presence of noise, these results demonstrate that the unknown functional can be successfully approximated. Notably, the approximation exhibits similar behaviour to the true function even in the presence of noise.
Test B2. Mean field first-order model. In this section, we explore the same opinion formation model, examining it first at the microscopic level while gradually increasing the number of particles, and subsequently, in the context of the mean-field limit described by Equation (3.4). The interaction between the agents $P$ is still given by (4.10). As initial conditions of the agents, in the microscopic model (3.1), we consider a uniform distribution $\mathcal{U}([1,2])$ . In other words, we set $\vec{x}(t_0) \sim \mathcal{U}([1,2])^M$ , which means the parameter space is defined again as $[1,2] \subset{\mathbb{R}}$ .
The numerical results presented in Figure 5 demonstrate the estimation error of $v_M(\vec{x} (t))$ and $s_M(\vec{x} (t))$ for $N = 8$ measurements and various values of $M$ , where $M$ is chosen from the set $\lbrace 10, 100, 1000\rbrace$ . We observe that the quality of approximation remains consistent across different values of the agent population, confirming the fact that there exists a well-defined mean-field limit. Consequently, the inference problem appears to be independent of the number of agents, and these findings are further substantiated by the numerical values provided in Table 3. The inference problem is addressed at the mean field level, employing an MC simulation method as explained in the Section 3.2. We choose a sub-sample of ${\hat{M}_s}= 100$ , and a sample of $\hat{M} = 10,000$ particles for the approximation of the density, we then reconstruct the evolution in the phase space by a histogram approach. For the initial condition, we adopt $\mu _0(x)=\chi _{[1,2]}(x)$ that is the indicator function of the interval $[a,b]$ , i.e. the function that equals $1$ when $x$ is within the interval $[a,b]$ and equals $0$ when $x$ is outside that interval. In a manner consistent with our approach, we consider a set of $N=8$ measurements, clearly marked as red dots in Figure 5. At these specific measurement times, the complete state $\mu (t_i,\cdot )$ is recorded. Both functionals for variance and skewness, denoted as $v_\infty$ and $s_\infty$ , respectively, are assessed at these measurement times $t_i$ for $i=1,\dots, N$ . The error for the mean field case (see Figure 5, row 4) is of the same order as the scenario with a finite number of agents.
As for the microscopic model in Test B1, we investigate here the noisy scenario also for the mean field case. We introduce random noise perturbations $\eta$ into the variance evaluations
and we solve the approximation problem as detailed in the mean-field kernel ridge regression paragraph in Section 4.2. We still consider $\mu _0(x)=\chi _{[1,2]}(x)$ as initial condition and we examine the approximation of $v_\infty$ in the presence of noise. Here, $\eta _n$ is assumed again to follow a normal distribution, specifically $\eta _n \sim \mathcal{N}(0,\sigma ^2)$ , for $i=1,\dots, N$ , with $\sigma ^2$ set to 0.01. In this noisy setting, we tackle the minimisation problem (4.7) using $\lambda = \sqrt{\sigma ^2}$ . Figure 6 showcases the estimation of $v_\infty$ under the condition of having $N=8$ noisy measurements of the system variance. Despite the presence of noise, the results indicate that the unknown functional can still be approximated effectively.
Test B3. Microscopic second-order model. In this section, we conduct numerical experiments to examine the second-order model (3.2) and the corresponding mean field PDE (3.5). The agent interactions, as represented by the parameter $H_\beta$ , are governed by the Cucker–Smale function (3.3) with $\beta = 2$ . In this model, the agents in the swarm align their velocities with the average velocity of their nearby neighbours, while they are also attracted to their neighbours, which helps to maintain group cohesion. In the context of this model, our primary focus is on approximating the velocity variance in a noise-free scenario, denoted as $v_M$ . Both initial position and velocity conditions are randomly selected from a uniform distribution within the interval $[1,2]$ , specifically as $\vec{x}(t_0), \vec{v}(t_0) \sim \mathcal{U}([1,2])^M$ .
The numerical results displayed in Figure 7 demonstrate the inference process for a case with $N=4$ measurements and a swarm consisting of $M=30$ agents. These results highlight the remarkable accuracy of the RKHS approach, even when applied to a second-order microscopic system.
Test B4. Mean field second-order model. We now address the inference problem at the mean field level, using the two-dimensional Lax–Friedrichs scheme [Reference LeVeque50], as elaborated in Section 3.2. We consider Dirichlet’s initial and Neumann’s final boundary conditions. For the discretization of both $x$ and $y$ spaces, we take $\Delta x = \Delta v = 0.05$ in the interval $[0, 3]$ . For time, in order to respect the Courant–Friedrichs–Lewy (CFL) stability condition, we take $\Delta t = 0.001$ in $[0 \ 1]$ .
In Figure 8, we provide three snapshots illustrating the density evolution over time. As initial condition, we opt for $\mu _0(x,v) = \chi _{[1,2]}(x) \times \chi _{[1,2]}(v)$ , depicted in the first plot on the left. As expected from the model, the density is seen moving upwards in the $x$ direction while concentrating in the $v$ dimension, reflecting alignment behaviour. The observed diffusion is a result of the chosen numerical scheme.
Similar to the scenario with a finite number of agents, Figure 9 shows that no discernible differences exist between the kernel-based estimated variance and the actual variance.
5. Conclusion and outlook
In this paper, we have outlined recent and novel kernel-based approaches for numerical problems involving IPS. After providing a self-contained presentation of background on kernels and kernel methods, as well as IPS and their numerical treatment, we presented interesting problem classes amenable to kernel methods. First, we investigated the usage of kernel methods for surrogate modelling in the context of IPS, which can provide a route to reducing the computational cost of properties of IPS. Our initial numerical results indicate that this could be a promising avenue for future research, in particular in the context of large-scale IPS. Second, we conducted the first numerical experiments on kernels in the mean-field limit, a recent development started in ref. [Reference Fiedler, Herty, Rom, Segala and Trimpe31, Reference Fiedler, Herty and Trimpe33]. The numerical experiments show that this approach can indeed connect learning and approximation problems on the microscopic and mesoscopic levels. In future work, we plan to explore the learning rates of kernel approximations in a mean-field context and conduct a numerical error analysis.
Financial support
The authors thank the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) for the financial support under Germany’s Excellence Strategy EXC-2023 Internet of Production 390621612 and under the Excellence Strategy of the Federal Government and the Länder, 333849990/GRK2379 (IRTG Hierarchical and Hybrid Approaches in Modern Inverse Problems), 320021702/GRK2326, 442047500/SFB1481 within the projects B04, B05 and B06, through SPP 2410 Hyperbolic Balance Laws in Fluid Mechanics: Complexity, Scales, Randomness (CoScaRa) within the Project(s) HE5386/26-1 and HE5386/27-1, and through SPP 2298 Theoretical Foundations of Deep Learning within the Project(s) HE5386/23-1, mean field Theorie zur Analysis von Deep Learning Methoden (462234017). Support through the EU DATAHYKING No. 101072546 is also acknowledged. C. Segala is a member of the Italian National Group of Scientific Calculus (Indam GNCS).
Competing interests
The authors declare none.
A. Additional material
Proof of Lemma 2.5
Let $x,x^{\prime}\in \mathcal{X}$ be arbitrary, then $k(\cdot, x^{\prime})\in H_k$ , and hence the second statement implies that $k(T(x),x^{\prime})=k(\cdot, x^{\prime})(T(x))=k(\cdot, x^{\prime})(x)=k(x,x^{\prime})$ . Conversely, observe that by symmetry of $k$ we have for all $x,x^{\prime}\in \mathcal{X}$ that $k(x,T(x^{\prime}))=k(T(x^{\prime}),x)=k(x^{\prime},x)=k(x,x^{\prime})$ . Since this holds for all $x,x^{\prime}\in \mathcal{X}$ , we find that $k(\cdot, T(x^{\prime}))=k(\cdot, x^{\prime})$ as an element of $H_k$ , so we get for all $f\in H_k$ and $x\in \mathcal{X}$ that $f(T(x))=\langle f, k(\cdot, T(x))\rangle _k=\langle f, k(\cdot, x)\rangle _k=f(x)$ by the reproducing property of $k$ .
Proof of Proposition 4.8
Without loss of generality, we can assume that $L\in \mathbb{R}_{\gt 0}$ . Define for $x \in X$ the function $\varphi _x\,:\,X \rightarrow{\mathbb{R}}$ by $\varphi _x(x^{\prime})=L^{-1}\phi (\|x^{\prime}-x\|)$ , and observe that since that for all $x^{\prime},y^{\prime}\in X$
the function $\varphi _x$ is 1-Lipschitz continuous.
Let now $M\in \mathbb{N}_+$ , $\vec{x}, \vec{x}^{\prime}, \vec{y}, \vec{y}^{\prime} \in X^M$ , then we get
Observe now that for all Borel-measurable $f\,:\,X \rightarrow{\mathbb{R}}$ , we have
so we can continue with
establishing the claim.