Sensitivity analysis of Wasserstein distributionally robust optimization problems

We consider sensitivity of a generic stochastic optimization problem to model uncertainty. We take a non-parametric approach and capture model uncertainty using Wasserstein balls around the postulated model. We provide explicit formulae for the first-order correction to both the value function and the optimizer and further extend our results to optimization under linear constraints. We present applications to statistics, machine learning, mathematical finance and uncertainty quantification. In particular, we provide an explicit first-order approximation for square-root LASSO regression coefficients and deduce coefficient shrinkage compared to the ordinary least-squares regression. We consider robustness of call option pricing and deduce a new Black–Scholes sensitivity, a non-parametric version of the so-called Vega. We also compute sensitivities of optimized certainty equivalents in finance and propose measures to quantify robustness of neural networks to adversarial examples.


Introduction
We consider a generic stochastic optimization problem inf a∈A S f (x, a) μ(dx), (1.1) where A is the set of actions or choices, f is the loss function and μ is a probability measure over the state space S. Such problems are found across the whole of applied mathematics. The measure μ is the crucial input and it could represent, for example, a dynamic model of the system, as is often the case in mathematical finance or mathematical biology, or the empirical measure of observed data points, or the training set, as is the case in statistics and machine learning applications. In virtually all the cases, there is a certain degree of uncertainty around the choice of μ coming from modelling choices and simplifications, incomplete information, data errors, finite sample error, etc. It is thus very important to understand the influence of changes in μ on (1.1), both on its value and on its optimizer. Often, the choice of μ is done in two stages: first a parametric family of models is adopted and then the values of the parameters are fixed. Sensitivity analysis of (1.1) with changing parameters is a classical topic explored in parametric programming and statistical inference, e.g. [1][2][3]. It also underscores a lot of progress in the field of uncertainty quantification, see [4]. Considering μ as an abstract parameter, the mathematical programming literature looked into qualitative and quantitative stability of (1.1). We refer to [5,6] and the references therein. When μ represents data samples, there has been a considerable interest in the optimization community in designing algorithms which are robust and, in particular, do not require excessive hypertuning, see [7] and the references therein. A more systematic approach to model uncertainty in (1.1) is offered by the distributionally robust optimization problem where B δ (μ) is a ball of radius δ around μ in the space of probability measures, as specified below. Such problems greatly generalize more classical robust optimization and have been studied extensively in operations research and machine learning in particular; we refer the reader to [8] and the references therein. Our goal in this paper is to understand the behaviour of these problems for small δ. Our main results compute first-order behaviour of V(δ) and its optimizer for small δ. This offers a measure of sensitivity to errors in model choice and/or specification as well as points in the abstract direction, in the space of models, in which the change is most pronounced. We use examples to show that our results can be applied across a wide spectrum of science. This paper is organized as follows. We first present the main results and then, in §3, explore their applications. Further discussion of our results and the related literature is found in §4, which is then followed by the proofs. The online appendix [9] contains many supplementary results and remarks, as well as some more technical arguments from the proofs.

Main results
Take d, k ∈ N, endow R d with the Euclidean norm | · | and write Γ o for the interior of a set Γ . Assume that S is a closed convex subset of R d . Let P(S) denote the set of all (Borel) probability measures on S. Further fix a seminorm || · || on R d and denote by || · || * its (extended) dual norm, i.e. ||y|| * := sup{ x, y : ||x|| ≤ 1}. In particular, for || · || = | · | we also have || · || * = | · |. For μ, ν ∈ P(S), we define the p-Wasserstein distance as where Cpl(μ, ν) is the set of all probability measures π on S × S with first marginal π 1 := π (· × S) = μ and second marginal π 2 := π (S × ·) = ν. Denote the Wasserstein ball B δ (μ) = {ν ∈ P(S) : W p (μ, ν) ≤ δ}, of size δ ≥ 0 around μ. Note that, taking a suitable probability space (Ω, F , P) and a random variable X ∼ μ, we have the following probabilistic representation of V(δ, a): where the supremum is taken over all Z satisfying X + Z ∈ S almost surely and E P [||Z|| p * ] ≤ δ p . Wasserstein distances and optimal transport techniques have proved to be powerful and versatile tools in a multitude of applications, from economics [10,11] to image recognition [12]. The idea to use Wasserstein balls to represent model uncertainty was pioneered in [13] in the context of investment problems. When sampling from a measure with a finite pth moment, the measures converge to the true distribution and Wasserstein balls around the empirical measures have the interpretation of confidence sets, see [14]. In this set-up, the radius δ can then be chosen as a function of a given confidence level α and the sample size N. This yields finite sample guarantees and asymptotic consistency, see [15,16], and justifies the use of the Wasserstein metric to capture model uncertainty. The value V(δ, a) in (1.2) has a dual representation, see [17,18], which has led to significant new developments in distributionally robust optimization, e.g. [15,[19][20][21].
Naturally, other choices for the distance on the space of measures are also possible: such as the Kullblack-Leibler divergence, see [22] for general sensitivity results and [23] for applications in portfolio optimization, or the Hellinger distance, see [24] for a statistical robustness analysis. We refer to §4 for a more detailed analysis of the state of the art in these fields. Both of these approaches have good analytic properties and often lead to theoretically appealing closed-form solutions. However, they are also very restrictive since any measure in the neighbourhood of μ has to be absolutely continuous with respect to μ. In particular, if μ is the empirical measure of N observations then measures in its neighbourhood have to be supported on those fixed N points. To obtain meaningful results, it is thus necessary to impose additional structural assumptions, which are often hard to justify solely on the basis of the data at hand and, equally importantly, create another layer of model uncertainty themselves. We refer to [17, sec. 1.1] for further discussion of potential issues with φ-divergences. The Wasserstein distance, while harder to handle analytically, is more versatile and does not require any such additional assumptions.
Throughout the paper, we take the convention that continuity and closure are understood w.r.t. | · |. We assume that A ⊂ R k is convex and closed and that the seminorm || · || is strictly convex in the sense that for two elements x, y ∈ R d with ||x|| = ||y|| = 1 and ||x − y|| = 0, we have || 1 2 x + 1 2 y|| < 1 (note that this is satisfied for every l s -norm |x| s := ( d i=1 |x i | s ) 1/s for s > 1). We fix p ∈ (1, ∞), let q := p/(p − 1) so that 1/p + 1/q = 1, and fix μ ∈ P(S) such that the boundary of S ⊂ R d has μ-zero measure and S |x| p μ(dx) < ∞. Denote by A δ the set of optimizers for V(δ) in (1.2).
for all x ∈ S and a ∈ A with |a| ≤ r. -For all δ ≥ 0 sufficiently small, we have A δ = ∅ and for every sequence (δ n ) n∈N such that lim n→∞ δ n = 0 and (a n ) n∈N such that a n ∈ A δ n for all n ∈ N there is a subsequence which converges to some a ∈ A 0 .
The above assumption is not restrictive: the first part merely ensures existence of ||∇ x f (·, a )|| L q (μ) , while the second part is satisfied as soon as either A is compact or V(0, ·) is coercive, which is the case in most examples of interest; see [

Remark.
Inspecting the proof, defining we obtainṼ (0) = V (0). This means that for small δ > 0 there is no first-order gain from optimizing over all a ∈ A in the definition of V(δ) when compared with restricting simply to a ∈ A 0 , as inṼ(δ).
The above result naturally extends to computing sensitivities of robust problems, i.e. V (r), see [9, corollary 7.5], as well as to the case of stochastic optimization under linear constraints, see [9, theorem 7.7]. We recall that V(0, a) = S f (x, a) μ(dx).
Above and throughout the convention is that ∇ , ∇ x ∇ a f (x, a) ∈ R k×d and 0/0 = 0. The assumed existence and convergence of optimizers holds, e.g. with suitable convexity of f in a; see [9, lemma 7.14] for a worked out setting. In line with the financial economics practice, we gave our sensitivities letter symbols, Υ and , loosely motivated by Υ πóδειγ μα, the Greek for Model, and ‫,בקרה‬ the Hebrew for control.

Applications
We now illustrate the universality of theorems 2.2 and 2.4 by considering their applications in a number of different fields. Unless otherwise stated, S = R d , A = R k and means S .

(a) Financial economics
We start with the simple example of risk-neutral pricing of a call option written on an underlying asset (S t ) t≤T . Here, T, K > 0 are the maturity and the strike, respectively, f (x, a) = (S 0 x − K) + and μ is the distribution of S T /S 0 . We set interest rates and dividends to zero for simplicity. In [25], the model μ is a lognormal distribution, i.e. log(S T /S 0 ) ∼ N (−σ 2 T/2, σ 2 T) is Gaussian with mean −σ 2 T/2 and variance σ 2 T. In this case, V(0) is given by the celebrated Black-Scholes formula.  Note that this example is particularly simple since f is independent of a. However, to ensure risk-neutral pricing, we have to impose a linear constraint on the measures in B δ (μ), giving To compute its sensitivity we encode the constraint using a Lagrangian and apply theorem 2.2, see [9, remark 7.3, theorem 7.7]. For p = 2, letting k = K/S 0 and μ k = μ([k, ∞)), the resulting formula, see [9, example 7.10], is given by Let us specialize to the lognormal distribution of the Black-Scholes model above and denote the quantity in (3.1) as RBS(δ). It may be computed exactly using methods from [26]. Figure 1 compares the exact value and the first-order approximation. We have It is also insightful to compare Υ with a parametric sensitivity. If instead of Wasserstein balls, we consider {N (−σ 2 T/2,σ 2 T) : |σ −σ | ≤ δ} the resulting sensitivity is known as the Black-Scholes Vega and We plot the two sensitivities in figure 2. It is remarkable how, for the range of strikes of interest, the non-parametric model sensitivity Υ traces out the usual shape of V but shifted upwards to account for the idiosyncratic risk of departure from the lognormal family. More generally, given a book of options with payoff f = f + − f − at time T, with f + , f − ≥ 0, we could say that the book is Υ -neutral if the sensitivity Υ was the same for f + and for f − . In analogy to Delta-Vega hedging standard, one could develop a non-parametric model-agnostic Delta-Upsilon hedging. We believe these ideas offer potential for exciting industrial applications and we leave them to further research.
We turn now to the classical notion of the optimized certainty equivalent (OCE) of [27]. It is a decision theoretic criterion designed to split a liability between today's and tomorrow's payments. It is also a convex risk measure in the sense of [28] and covers many of the popular risk measures such as expected shortfall or entropic risk, see [29]. We fix a convex monotone function l : R → R which is bounded from below and g : R d → R. Here, g represents the payoff of a financial position and l is the negative of a utility function, or a loss function. We take || · || = | · | and refer to [9, lemma 7.14] for generic sufficient conditions for assumptions 2.1 and 2.3 to hold in this setup.
where, for simplicity, we took p = q = 2 for the latter. A related problem considers hedging strategies which minimize the expected loss of the hedged position, i.e. f (x, a) = l(g(x) + a, x − x 0 ), where A = R k and (x 0 , x) represent today's and tomorrow's traded prices. We compute Υ as inf a ∈A 0

Furthermore we can combine loss minimization with OCE and consider
The above formulae capture non-parametric sensitivity to model uncertainty for examples of key risk measurements in financial economics. To the best of our knowledge, this has not been achieved before. Finally, we consider briefly the classical mean-variance optimization of [30]. Here μ represents the loss distribution across the assets and a ∈ R d , d i=1 a i = 1 are the relative investment weights. The original problem is to minimize the sum of the expectation and γ standard deviations of returns a, X , with X ∼ μ. Using the ideas in [31, Example 2] and considering measures on R d × R d , we can recast the problem as (1.1). While [31] focused on the asymptotic regime δ → ∞, their non-asymptotic statements are related to our theorem 2.2 and either result could be used here to obtain that V(δ) ≈ V(0) + 1 − γ 2 δ for small δ.

(b) Neural networks
We specialize now to quantifying robustness of neural networks (NN) to adversarial examples. This has been an important topic in machine learning since [32] observed that NN consistently misclassify inputs formed by applying small worst-case perturbations to a dataset. This produced a number of works offering either explanations for these effects or algorithms to create such adversarial examples, e.g. [33][34][35][36][37][38][39] to name just a few. The main focus of research works in this area, see [40], has been on faster algorithms for finding adversarial examples, typically leading to an overfit to these examples without any significant generalization properties. The viewpoint has been mainly pointwise, e.g. [32], with some generalizations to probabilistic robustness, e.g. [39].
In contrast, we propose a simple metric for measuring robustness of NN which is independent of the architecture employed and the algorithms for identifying adversarial examples. In fact, theorem 2.2 offers a simple and intuitive way to formalize robustness of NN: for simplicity consider a 1-layer neural network trained on a given distribution μ of pairs (x, y), i.e.
where the inf is taken over a = (A 1 , (1.2) offers a robust training procedure. The first-order quantification of the NN sensitivity to adversarial data is then given by A similar viewpoint, capturing robustness to adversarial examples through the optimal transport lens, has been recently adopted by other authors. The dual formulation of (1.2) was used by [21] to reduce the training of neural networks to tractable linear programs. [41] to propose new stochastic gradient descent algorithms with inbuilt robustness to adversarial data.

(c) Uncertainty quantification
In the context of UQ, the measure μ represents input parameters of a (possibly complicated) operation G in a physical, engineering or economic system. We consider the so-called reliability or certification problem: for a given set E of undesirable outcomes, one wants to control sup ν∈P ν(G(x) ∈ E), for a set of probability measures P. The distributionally robust adversarial classification problem considered recently by [42] is also of this form, with Wasserstein balls P around an empirical measure of N samples. Using the dual formulation of [18], they linked the problem to minimization of the conditional value-at-risk and proposed a reformulation, and numerical methods, in the case of linear classification. We propose instead a regularized version of the problem and look for for a given safety level α. We thus consider the average distance to the undesirable set, d(G(x), E) := inf e∈E |G(x) − e|, and not just its probability. The quantity δ(α) could then be used to quantify the implicit uncertainty of the certification problem, where higher δ corresponds to less uncertainty. Taking statistical confidence bounds of the empirical measure in Wasserstein distance into account, see [14], δ would then determine the minimum number of samples needed to estimate the empirical measure.
Assume that E is convex. Then x → d(x, E) differentiable everywhere except at the boundary of E with ∇ x d(x, E) = 0 for x ∈ E o and |∇ x d(x, E)| = 1 for all x ∈Ē c . Furthermore, assume μ is absolutely continuous w.r.t. Lebesgue measure on S. Theorem 2.2, using [9, remark 7.3], gives a first-order expansion for the above problem: In the special case ∇ x G(x) = cI this simplifies to and the minimal measure ν pushes every point G(x) not contained in E in the direction of the orthogonal projection. This recovers the intuition of [43, theorem 1], which in turn relies on [17, corollary 2, example 7]. Note however that our result holds for general measures μ. We also note that such an approximation could provide an ansatz for dimension reduction, by identifying the dimensions for which the partial derivatives are negligible and then projecting G on to the corresponding lower-dimensional subspace (thus providing a simpler surrogate for G). This would be an alternative to a basis expansion (e.g. in orthogonal polynomials) used in UQ and would exploit the interplay between the properties of G and μ simultaneously.

(d) Statistics
We discuss two applications of our results in the realm of statistics. We start by highlighting the link between our results and the so-called influence curves (IC) in robust statistics. For a functional μ → T(μ) its IC is defined as Computing the IC, if it exists, is in general hard and closed form solutions may be unachievable.
under assumption 2.3 and normalization ||∇ x f (x, T(μ))|| L p (μ) = 1. To investigate the connection let us Taylor-expand IC(y) around x to obtain Choosing y = x + δ∇f x (x, T(μ)) and integrating both sides over μ and dividing by δ, we obtain the asymptotic equality If instead of the Euclidean metric we take ||(x, y)|| * = |x| r 1 {y=0} + ∞1 {y =0} , for some r > 1 and (x, y) ∈ R k × R, in the definition of the Wasserstein distance, then [19] showed that holds, where 1/r + 1/s = 1. The δ = 0 case is the ordinary least-squares regression. For δ > 0, the r.h.s. for s = 2 is directly related to the Ridge regression, while the limiting case s = 1 is called the square-root LASSO regression, a regularized variant of linear regression well known for its good empirical performance. Closed-form solutions to (3.3) do not exist in general and it is a common practice to use numerical routines to solve it approximately. Theorem 2.4 offers instead an explicit first-order approximation of a δ for small δ. We denote by a the ordinary least-squares estimator and by I the k × k identity matrix. Note that the first-order condition on a implies that (y − a , x )x i μ(dx, dy) = 0 for all 1 ≤ i ≤ k. In particular, V(0) = (y 2 − a , x y)μ(dx, dy) and a = D −1 yxμ(dx, dy), where we assume the system is overdetermined so that D = xx T μ(dx, dy) is invertible. A direct computation, see [9,  respectively. This corresponds to parameter shrinkage: proportional for square-root Ridge and a shift towards zero for square-root LASSO. To the best of our knowledge, these are first such results and we stress that our formulae are valid in a general context and, in particular, parameter shrinkage depends on the direction through the D −1 factor. Figure 3 compares the first-order approximation with the actual results and shows a remarkable fit. Furthermore, our results agree with what is known in the canonical test case for the (standard) Ridge and LASSO, see [46]. When μ = μ N is the empirical measure of N i.i.d. observations, the data are centred and the covariates are orthogonal, i.e. D = (1/N)I. In that case, (3.5) simplifies to where R 2 is the usual coefficient of determination. The case of μ N is naturally of particular importance in statistics and data science and we continue to consider it in the next subsection. In particular, we characterize the asymptotic distribution of , where a δ ∈ A δ (μ N ) and a ∈ A 0 (μ ∞ ) is the optimizer of the nonrobust problem for the data-generating measure. This recovers the central limit theorem of [47], a link we explain further in §4b.

(e) Out-of-sample error
A benchmark of paramount importance in optimization is the so-called out-of-sample error, also known as the prediction error in statistical learning. Consider the setup above when μ N is the empirical measure of N i.i.d. observations sampled from the 'true' distribution μ = μ ∞ and take, for simplicity, || · || = | · | s , with s > 1. Our aim is to compute the optimal a which solves the original problem ( quantifies the error from using a ,N δ as opposed to the true optimizer a . While this expression seems to be hard to compute explicitly for finite samples, theorem 2.4 offers a way to find the asymptotic distribution of a (suitably rescaled version of) the out-ofsample error. We suppose the assumptions in theorem 2.4 are satisfied and note that the first order condition for a gives ∇ a V(0, a ) = 0. Then, a second-order Taylor expansion gives for someã (coordinate-wise) between a and a ,N δ . Now we write where we define a ,N as the optimizer of the non-robust problem (1.1) with μ replaced by μ N . In particular, the δ-method for M-estimators implies that where H ∼ N (0, (∇ a f (x, a )) T ∇ a f (x, a ) μ(dx)) and ⇒ denotes the convergence in distribution. On the other hand, for a fixed N ∈ N, theorem 2.4 applied to μ N yields a ,N )) where (x, a )) (x, a ,N )) Almost surely (w.r.t. sampling of μ N ), we know that μ N → μ in W p as N → ∞, and under the regularity and growth assumptions on f in [9, equation (8.2)] we check that N → 0 a.s., see [9, example 8.4] for details. In particular, taking δ = 1/ √ N and combining the above with (3.7) we obtain This recovers the central limit theorem of [47], as discussed in more detail in §4b below. Together, (3.6) and (3.9) give us the a.s. asymptotic behaviour of the out-of-sample error These results also extend and complement [48,Prop. 17]. [48] investigate when the distributionally robust optimizers a ,N δ yield, on average, better performance than the simple insample optimizer a ,N . To this end, they consider the expectation, over the realizations of the empirical measure μ N of This is closely related to the out-of-sample error and our derivations above can be easily modified.
The first-order term in the Taylor expansion no longer vanishes and, instead of (3.6), we now have which holds, e.g. if for any r > 0, there exists c > 0 such that k i=1 |∇ a ∇ a i f (x, a)| ≤ c(1 + |x| p ) for all x ∈ S, |a| ≤ r. Combined with (3.8), this gives asymptotics in small δ for a fixed N. For quadratic f and taking q ↑ ∞, we recover the result in [48,Prop. 17], see [9, example 8.4] for details.

Further discussion and literature review
We start with an overview of related literature and then focus specifically on a comparison of our results with the CLT of [47] mentioned above.

(a) Discussion of related literature
Let us first remark, that while theorem 2.2 offers some superficial similarities to a classical maximum theorem, which is usually concerned with continuity properties of δ → V(δ), in this work, we are instead interested in the exact first derivative of the function δ → V(δ). Indeed, the convergence lim δ→0 sup ν∈B δ (μ) f (x) ν(dx) = f (x) μ(dx) follows for all f satisfying f (x) ≤ c(1 + |x| p ) directly from the definition of convergence in Wasserstein metric (e.g. [49,Def. 6.8]). In conclusion, the main issue is to quantify the rate of this convergence by calculating the first derivative V (δ). Our work investigates model uncertainty broadly conceived: it includes errors related to the choice of models from a particular (parametric or not) class of models as well as the misspecification of such a class altogether (or indeed, its absence). In the decision theoretic literature, these aspects are sometimes referred to as model ambiguity and model mis-specification, respectively, see [50]. However, seeing our main problem (1.2) in decision theoretic terms is not necessarily helpful as we think of f as given and not coming from some latent expected utility type of problem. In particular, our actions a ∈ A are just constants.
In our work, we decided to capture the uncertainty in the specification of μ using neighbourhoods in the Wasserstein distance. As already mentioned, other choices are possible and have been used in past. Possibly, the most often used alternative is the relative entropy, or the Kullblack-Leibler divergence. In particular, it has been used in this context in economics, see [51]. To the best of our knowledge, the only comparable study of sensitivities with respect to relative entropy balls is [22], see also [45] allowing for additional marginal constraints. However, this only considered the specific case f (x, a) = f (x) where the reward function is independent of the action. Its main result is where B KL δ (μ) is a ball of radius δ 2 centred around μ in KL-divergence, Var μ (f (X)) and κ 3 (f (X)) denote the variance and kurtosis of f under the measure μ respectively. In particular, the first-order sensitivity involves the function f itself. By contrast, our theorem 2.2 states V (δ) = ( ( f (x)) 2 μ(dx)) 1/2 and involves the first derivative f . In the trivial case of a point mass μ = δ x , we recover the intuitive sensitivity V (δ) = |f (x)|, while the results of [22] do not apply for this case. We also note that [22] requires exponential moments of the function f under the baseline measure μ, while we only require polynomial moments. In particular, in applications in econometrics (or any field in which μ typically has fat tails), the scope of application of the corresponding results might then be decisively different. We remark however, that this requirement can be substantially weakened (to the existence of polynomial moments) when replacing KL-divergences by α-divergences, e.g. [52,53]. We expect a sensitivity analysis similar to [22] to hold in this setting. However, to the best of our knowledge no explicit results seem to be available in the literature.
To understand the relative technical difficulties and merits, it is insightful to go into the details of the statements. In fact, in the case of relative entropy and the one-period set-up we are considering, the exact form of the optimizing density can be determined exactly (see [22,Proposition 3.1]) up to a one-dimensional Langrange parameter. This is well known and is the reason behind the usual elegant formulae obtained in this context. But this then reduces the problem in [22] to a one-dimensional problem, which can be well-approximated via a Taylor approximation. By contrast, when we consider balls in the Wasserstein distance, the form of the optimizing measure is not known (apart from some degenerate cases). In fact, a key insight of our results is that the optimizing measure can be approximated by a deterministic shift in the direction (x + f (x)δ) * μ (this is, in general, not exact but only true as a first-order approximation). The reason for these contrastive starting points of the analyses is the fact that Wasserstein balls contain a more heterogeneous set of measures, while in the case of relative entropy, exponentiating f will always do the trick. We remark however that this is not true for the finite-horizon problems considered in [22, Section 3.2] any more, where the worst-case measure is found using an elaborate fixed-point equation.
A point which further emphasizes the fact that the topology introduced by the Wasserstein metric is less tractable is the fact that where H(π | R ε ) = log( dπ dR ε ) dπ is the relative entropy and for some normalizing constant c 0 > 0 (e.g. [54]). This is known as the entropic optimal transport formulation and has received considerable interest in the ML community in the past years (e.g. [55]). In particular, the Wasserstein distance can be approximated by relative entropy, but only with respect to reference measures on the product space. As we consider optimization over ν above it amounts to changing the reference measure. In consequence, the topological structure imposed by Wasserstein distances is more intricate compared to relative entropy, but also more flexible.
The other well-studied distance is the Hellinger distance. [24] calculates influence curves for the minimum Hellinger distance estimator a Hell, on a countable sample space. Their main result is that for the choice f (x, a) = log( (x, a)) (where ( (x, a)) a∈A is a collection of parametric densities) , a Hell, )), the product of the inverse Fisher information matrix and the score function, which is the same as for the classical maximum-likelihood estimator. Denote by μ N the empirical measure of N data samples and by a Hell, (N) the corresponding minimum Hellinger distance estimator for μ N . In particular, the above result then implies the same CLT as for M-estimators given by where H ∼ N (0, ∇ a f (x, a Hell, ) T ∇ a f (x, a Hell, ) μ(dx)). As we discuss in the next section, our theorem 2.4 yields a similar CLT, namely Thus the Wasserstein worst-case approach leads to a shift of the mean of the normal distribution in the direction compared to the non-robust case. In the simple case μ = N (0, σ 2 ) with standard deviation σ > 0, we obtain the MLE σ ,N = 1 N N k=1 X 2 i . We can directly compute (for a = σ ) that Thus the robust approach accounts for a shift of 1/(σ ) 2 (of order 1 if mulitplied with inverse Fisher information) to account for a possibly higher variance in the underlying data. In particular, in our approach, the so-called neutral spaces considered (e.g. [56], eqn (21)]) as should also take this shift into account, i.e. their definition should be adjusted to Lastly, let us mention another situation when our approach provides directly interpretable insights in the context of a parametric family of models. Namely, if one considers a family of models P such that the worst-case model in the Wasserstein ball remains in P, i.e. (x + f (x)δ) * μ ∈ P, then considering (the first-order approximation to) model uncertainty over Wasserstein balls actually reduces to considerations within the parametric family. While uncommon, such a situation would arise, for example, for a scale-location family P, with μ ∈ P and a linear/quadratic f . (b) Link to the central limit theorem of [47] As observed in §3e above, theorem 2.4 allows to recover the main results in [47]. We explain this now in detail. Set || · || = | · | s , p = q = 2, S = R d . Let μ N denote the empirical measure of N i.i.d. samples from μ. We impose the assumptions on μ and f from [47], including Lipschitz continuity of gradients of f and strict convexity. These, in particular, imply that the optimizers a ,N δ , a ,N and a , as defined in §3e are well defined and unique, and further a ,N 1/ √ N → a as N → ∞.
[47, Thm. 1] implies that, as N → ∞, . We note that for || · || = | · | s we have , and (4.1) agrees with (3.10) which is justified by the Lipschitz growth assumptions on f , ∇ x f (x, a) and ∇ a ∇ x f (x, a) from [47], see [9, equation (8.2)]. In particular, theorem 2.4 implies (4.1) as a special case. While this connection is insightful to establish 2 it is also worth stressing that the proofs in [47] pass through the dual formulation and are thus substantially different from ours. Furthermore, while theorem 2.4 holds under milder assumptions on f than those in [47], the last argument in our reasoning above requires the stronger assumptions on f . It is thus not clear if our results could help to significantly weaken the assumptions in the central limit theorems of [47].

Proofs
We consider the case S = R d and || · || = | · | here. For the general case and additional details, we refer to [9]. When clear from the context, we do not indicate the space over which we integrate.
We turn now to the opposite '≥' inequality. As V(δ) ≥ V(0) for every δ > 0, there is no loss of generality in assuming that the right-hand side is not equal to zero. Now take any, for notational simplicity not relabelled, subsequence of (δ) δ>0 which attains the liminf in (V(δ) − V(0))/δ and pick a δ ∈ A δ . By assumption, for a (again not relabelled) subsequence, one has a δ → a ∈ A 0 . Further note that for x ∈ R d with the convention 0/0 = 0. Note that the integral is well defined since, as before in (5.1), one has |∇ x f (x, a )| q ≤ C(1 + |x| p ) for some C > 0 and the latter is integrable under μ. Using that pq − p = q it further follows that |x − y| p π δ (dx, dy) = δ p |T(x)| p μ(dx) In particular, π δ ∈ C δ (μ) and we can use it to estimate from below the supremum over C δ (μ) giving For any t ∈ [0, 1], with δ → 0, the inner integral converges to The last equality follows from the definition of T and a simple calculation. To justify the convergence, first note that ∇ x f (x + tδT(x), a δ ), T(x) → ∇ x f (x, a ), T(x) for all x ∈ R d by continuity of ∇ x f and since a δ → a . Moreover, as before in (5.1), one has |T(x)| ≤ c(1 + |x|) for some c > 0, hence | ∇ x f (x + tδT(x), a ), T(x) | ≤ C(1 + |x| p ) for some C > 0 and all t ∈ [0, 1]. The latter is integrable under μ; hence convergence of the integrals follows from the dominated convergence theorem. This concludes the proof. Let δ > 0 and recall that a δ ∈ A δ converge to a ∈ A 0 . Let B δ (μ, a δ ) denote the set of ν ∈ B δ (μ) which attain the value: f (x, a δ ) ν(dx) = V(δ). This is non-empty by assumption 2. + λ∇ x f (x + t(y − x), a δ ), y − x π (dx, dy) dt − λ sup π∈C δ (μ) 1 0 ∇ x f (x + t(y − x), a δ , y − x π (dx, dy) dt , where we used a minimax argument as well as Fubini's theorem. We note that the functions above satisfy the assumptions of theorem 2.2 for a fixed λ. In particular, using exactly the same arguments as in the proof of theorem 2.2 (i.e. Hölder's inequality and a specific transport attaining the supremum) we obtain by exchanging the order of lim sup and inf that ∇ a i f (y, a δ ) ν(dy) − ∇ a i V(0, a δ ) = inf ν∈B δ (μ) sup λ∈R ∇ a i f (y, a δ ) + λ(f (y, a δ ) − V(δ)) ν(dy) − ∇ a i f (x, a δ ) + λ(f (x, a δ ) − V(0, a δ )) μ(dx) .
From here on, we argue as in the '≤' inequality and conclude that indeed (5.2) holds.
Data accessibility. The codes used to generate figures in the paper are available on GitHub: http://github.com/ JanObloj/Robust-uncertainty-sensitivity-analysis.