# Sensitivity analysis of Wasserstein distributionally robust optimization problems

## Abstract

We consider sensitivity of a generic stochastic optimization problem to model uncertainty. We take a non-parametric approach and capture model uncertainty using Wasserstein balls around the postulated model. We provide explicit formulae for the first-order correction to both the value function and the optimizer and further extend our results to optimization under linear constraints. We present applications to statistics, machine learning, mathematical finance and uncertainty quantification. In particular, we provide an explicit first-order approximation for square-root LASSO regression coefficients and deduce coefficient shrinkage compared to the ordinary least-squares regression. We consider robustness of call option pricing and deduce a new Black–Scholes sensitivity, a non-parametric version of the so-called Vega. We also compute sensitivities of optimized certainty equivalents in finance and propose measures to quantify robustness of neural networks to adversarial examples.

### 1. Introduction

We consider a generic stochastic optimization problem

A more systematic approach to model uncertainty in (1.1) is offered by the distributionally robust optimization problem

This paper is organized as follows. We first present the main results and then, in §3, explore their applications. Further discussion of our results and the related literature is found in §4, which is then followed by the proofs. The online appendix [9] contains many supplementary results and remarks, as well as some more technical arguments from the proofs.

### 2. Main results

Take $d,k\in \mathbb{N}$, endow ${\mathbb{R}}^{d}$ with the Euclidean norm $|\cdot |$ and write ${\Gamma}^{o}$ for the interior of a set $\Gamma $. Assume that $\mathcal{S}$ is a closed convex subset of ${\mathbb{R}}^{d}$. Let $\mathcal{P}(\mathcal{S})$ denote the set of all (Borel) probability measures on $\mathcal{S}$. Further fix a seminorm $||\cdot ||$ on ${\mathbb{R}}^{d}$ and denote by $||\cdot |{|}_{\ast}$ its (extended) dual norm, i.e. $||y|{|}_{\ast}:=sup\{\u27e8x,y\u27e9:||x||\le 1\}$. In particular, for $||\cdot ||=|\cdot |$ we also have $||\cdot |{|}_{\ast}=|\cdot |$. For $\mu ,\nu \in \mathcal{P}(\mathcal{S})$, we define the $p$-Wasserstein distance as

Naturally, other choices for the distance on the space of measures are also possible: such as the Kullblack–Leibler divergence, see [22] for general sensitivity results and [23] for applications in portfolio optimization, or the Hellinger distance, see [24] for a statistical robustness analysis. We refer to §4 for a more detailed analysis of the state of the art in these fields. Both of these approaches have good analytic properties and often lead to theoretically appealing closed-form solutions. However, they are also very restrictive since any measure in the neighbourhood of $\mu $ has to be absolutely continuous with respect to $\mu $. In particular, if $\mu $ is the empirical measure of $N$ observations then measures in its neighbourhood have to be supported on those fixed $N$ points. To obtain meaningful results, it is thus necessary to impose additional structural assumptions, which are often hard to justify solely on the basis of the data at hand and, equally importantly, create another layer of model uncertainty themselves. We refer to [17, sec. 1.1] for further discussion of potential issues with $\varphi $-divergences. The Wasserstein distance, while harder to handle analytically, is more versatile and does not require any such additional assumptions.

Throughout the paper, we take the convention that continuity and closure are understood w.r.t. $|\cdot |$. We assume that $\mathcal{A}\subset {\mathbb{R}}^{k}$ is convex and closed and that the seminorm $||\cdot ||$ is strictly convex in the sense that for two elements $x,y\in {\mathbb{R}}^{d}$ with $||x||=||y||=1$ and $||x-y||\ne 0$, we have $||{\textstyle \frac{1}{2}}x+{\textstyle \frac{1}{2}}y||<1$ (note that this is satisfied for every ${l}^{s}$-norm $|x{|}_{s}:={(\sum _{i=1}^{d}|{x}_{i}{|}^{s})}^{1/s}$ for $s>1$). We fix $p\in (1,\mathrm{\infty})$, let $q:=p/(p-1)$ so that $1/p+1/q=1$, and fix $\mu \in \mathcal{P}(\mathcal{S})$ such that the boundary of $\mathcal{S}\subset {\mathbb{R}}^{d}$ has $\mu $–zero measure and ${\int}_{\mathcal{S}}|x{|}^{p}\hspace{0.17em}\mu (\text{d}x)<\mathrm{\infty}$. Denote by ${\mathcal{A}}_{\delta}^{\star}$ the set of optimizers for $V(\delta )$ in (1.2).

### Assumption 2.1.

The loss function $f:\mathcal{S}\times \mathcal{A}\to \mathbb{R}$ satisfies

— | $x\mapsto f(x,a)$ is differentiable on ${\mathcal{S}}^{o}$ for every $a\in \mathcal{A}$. Moreover, $(x,a)\mapsto {\mathrm{\nabla}}_{x}f(x,a)$ is continuous and for every $r>0$ there is $c>0$ such that $|{\mathrm{\nabla}}_{x}f(x,a)|\le c(1+|x{|}^{p-1})$ for all $x\in \mathcal{S}$ and $a\in \mathcal{A}$ with $|a|\le r$. | ||||

— | For all $\delta \ge 0$ sufficiently small, we have ${\mathcal{A}}_{\delta}^{\star}\ne \mathrm{\varnothing}$ and for every sequence ${({\delta}_{n})}_{n\in \mathbb{N}}$ such that $\underset{n\to \mathrm{\infty}}{lim}{\delta}_{n}=0$ and ${({a}_{n}^{\star})}_{n\in \mathbb{N}}$ such that ${a}_{n}^{\star}\in {\mathcal{A}}_{{\delta}_{n}}^{\star}$ for all $n\in \mathbb{N}$ there is a subsequence which converges to some ${a}^{\star}\in {\mathcal{A}}_{0}^{\star}$. |

The above assumption is not restrictive: the first part merely ensures existence of $||{\mathrm{\nabla}}_{x}f(\cdot ,{a}^{\star})|{|}_{{L}^{q}(\mu )}$, while the second part is satisfied as soon as either $\mathcal{A}$ is compact or $V(0,\cdot )$ is coercive, which is the case in most examples of interest; see [9, lemma 7.15] for further comments.

### Theorem 2.2.

*If assumption 2.1 holds then* ${V}^{\prime}(0)$ *is given by*

### Remark.

Inspecting the proof, defining

The above result naturally extends to computing sensitivities of robust problems, i.e. ${V}^{\prime}(r)$, see [9, corollary 7.5], as well as to the case of stochastic optimization under linear constraints, see [9, theorem 7.7]. We recall that $V(0,a)={\int}_{\mathcal{S}}f(x,a)\hspace{0.17em}\mu (\text{d}x)$.

### Assumption 2.3.

Suppose the $f$ is twice continuously differentiable, ${a}^{\star}\in {\mathcal{A}}_{0}^{\star}\cap {\mathcal{A}}^{o}$ and

— | $\sum _{i=1}^{k}|{\mathrm{\nabla}}_{{a}_{i}}{\mathrm{\nabla}}_{x}f(x,a)|\le c(1+|x{|}^{p-1-\epsilon})$ for some $\epsilon >0$, $c>0$, all $x\in \mathcal{S}$ and all $a$ close to ${a}^{\star}$. | ||||

— | The function $a\mapsto V(0,a)$ is twice continuously differentiable in the neighbourhood of ${a}^{\star}$ and the matrix ${\mathrm{\nabla}}_{a}^{2}V(0,{a}^{\star})$ is invertible. |

### Theorem 2.4.

*Suppose* ${a}^{\star}\in {\mathcal{A}}_{0}^{\star}$ *and* ${a}_{\delta}^{\star}\in {\mathcal{A}}_{\delta}^{\star}$ *such that* ${a}_{\delta}^{\star}\to {a}^{\star}$ *as* $\delta \to 0$ *and assumptions 2.1 and 2.3 are satisfied. If* ${\mathrm{\nabla}}_{x}f(x,{a}^{\star})\ne 0$ $\mu $-*a.e. or if* ${\mathrm{\nabla}}_{x}{\mathrm{\nabla}}_{a}f(x,{a}^{\star})=0$ $\mu $-*a.e., then*

*where*$h:{\mathbb{R}}^{d}\setminus \{0\}\to \{x\in {\mathbb{R}}^{d}\hspace{0.17em}:\hspace{0.17em}||x|{|}_{\ast}=1\}$

*is the unique function satisfying*$\u27e8\cdot ,h(\cdot )\u27e9=||\cdot ||$,

*see*[9,

*Lemma*6.2].

*In particular*, $h(\cdot )=\cdot /|\cdot |$

*if*$||\cdot ||=|\cdot |$.

Above and throughout the convention is that ${\mathrm{\nabla}}_{x}f(x,a)\in {\mathbb{R}}^{d\times 1},$ ${\mathrm{\nabla}}_{{a}_{i}}{\mathrm{\nabla}}_{x}f(x,a)\in {\mathbb{R}}^{d\times 1}$, ${\mathrm{\nabla}}_{a}f(x,a)\in {\mathbb{R}}^{k\times 1}$, ${\mathrm{\nabla}}_{x}{\mathrm{\nabla}}_{a}f(x,a)\in {\mathbb{R}}^{k\times d}$ and $0/0=0$. The assumed existence and convergence of optimizers holds, e.g. with suitable convexity of $f$ in $a$; see [9, lemma 7.14] for a worked out setting. In line with the financial economics practice, we gave our sensitivities letter symbols, ${\rm Y}$ and $\beth $, loosely motivated by ${\rm Y}\pi \stackrel{\xb4}{o}\delta \epsilon \iota \gamma \mu \alpha $, the Greek for *Model*, and , the Hebrew for *control*.

### 3. Applications

We now illustrate the universality of theorems 2.2 and 2.4 by considering their applications in a number of different fields. Unless otherwise stated, $\mathcal{S}={\mathbb{R}}^{d}$, $\mathcal{A}={\mathbb{R}}^{k}$ and $\int $ means ${\int}_{\mathcal{S}}$.

#### (a) Financial economics

We start with the simple example of risk-neutral pricing of a call option written on an underlying asset ${({S}_{t})}_{t\le T}$. Here, $T,K>0$ are the maturity and the strike, respectively, $f(x,a)={({S}_{0}x-K)}^{+}$ and $\mu $ is the distribution of ${S}_{T}/{S}_{0}$. We set interest rates and dividends to zero for simplicity. In [25], the model $\mu $ is a lognormal distribution, i.e. $\mathrm{log}({S}_{T}/{S}_{0})\sim \mathcal{N}(-{\sigma}^{2}T/2,{\sigma}^{2}T)$ is Gaussian with mean $-{\sigma}^{2}T/2$ and variance ${\sigma}^{2}T$. In this case, $V(0)$ is given by the celebrated Black–Scholes formula. Note that this example is particularly simple since $f$ is independent of $a$. However, to ensure risk-neutral pricing, we have to impose a linear constraint on the measures in ${B}_{\delta}(\mu )$, giving

We turn now to the classical notion of the optimized certainty equivalent (OCE) of [27]. It is a decision theoretic criterion designed to split a liability between today's and tomorrow’s payments. It is also a convex risk measure in the sense of [28] and covers many of the popular risk measures such as expected shortfall or entropic risk, see [29]. We fix a convex monotone function $l:\mathbb{R}\to \mathbb{R}$ which is bounded from below and $g:{\mathbb{R}}^{d}\to \mathbb{R}$. Here, $g$ represents the payoff of a financial position and $l$ is the negative of a utility function, or a loss function. We take $||\cdot ||=|\cdot |$ and refer to [9, lemma 7.14] for generic sufficient conditions for assumptions 2.1 and 2.3 to hold in this setup. The OCE corresponds to $V$ in (1.1) for $f(x,a)=l(g(x)-a)+a$ and $\mathcal{A}=\mathbb{R}$, $\mathcal{S}={\mathbb{R}}^{d}$. Theorems 2.2 and 2.4 yield the sensitivities

A related problem considers hedging strategies which minimize the expected loss of the hedged position, i.e. $f(x,a)=l(g(x)+\u27e8a,x-{x}_{0}\u27e9)$, where $\mathcal{A}={\mathbb{R}}^{k}$ and $({x}_{0},x)$ represent today's and tomorrow’s traded prices. We compute ${\rm Y}$ as

Finally, we consider briefly the classical mean-variance optimization of [30]. Here $\mu $ represents the loss distribution across the assets and $a\in {\mathbb{R}}^{d}$, $\sum _{i=1}^{d}{a}_{i}=1$ are the relative investment weights. The original problem is to minimize the sum of the expectation and $\gamma $ standard deviations of returns $\u27e8a,X\u27e9$, with $X\sim \mu $. Using the ideas in [31, Example 2] and considering measures on ${\mathbb{R}}^{d}\times {\mathbb{R}}^{d}$, we can recast the problem as (1.1). While [31] focused on the asymptotic regime $\delta \to \mathrm{\infty}$, their non-asymptotic statements are related to our theorem 2.2 and either result could be used here to obtain that $V(\delta )\approx V(0)+\sqrt{1-{\gamma}^{2}}\delta $ for small $\delta $.

#### (b) Neural networks

We specialize now to quantifying robustness of neural networks (NN) to adversarial examples. This has been an important topic in machine learning since [32] observed that NN consistently misclassify inputs formed by applying small worst-case perturbations to a dataset. This produced a number of works offering either explanations for these effects or algorithms to create such adversarial examples, e.g. [33–39] to name just a few. The main focus of research works in this area, see [40], has been on faster algorithms for finding adversarial examples, typically leading to an overfit to these examples without any significant generalization properties. The viewpoint has been mainly pointwise, e.g. [32], with some generalizations to probabilistic robustness, e.g. [39].

In contrast, we propose a simple metric for measuring robustness of NN which is independent of the architecture employed and the algorithms for identifying adversarial examples. In fact, theorem 2.2 offers a simple and intuitive way to formalize robustness of NN: for simplicity consider a $1$-layer neural network trained on a given distribution $\mu $ of pairs $(x,y)$, i.e. $({A}_{1}^{\star},{A}_{2}^{\star},{b}_{1}^{\star},{b}_{2}^{\star})$ solve

#### (c) Uncertainty quantification

In the context of UQ, the measure $\mu $ represents input parameters of a (possibly complicated) operation $G$ in a physical, engineering or economic system. We consider the so-called *reliability* or *certification problem*: for a given set $E$ of undesirable outcomes, one wants to control $\underset{\nu \in \mathcal{P}}{sup}\nu (G(x)\in E)$, for a set of probability measures $\mathcal{P}$. The distributionally robust adversarial classification problem considered recently by [42] is also of this form, with Wasserstein balls $\mathcal{P}$ around an empirical measure of $N$ samples. Using the dual formulation of [18], they linked the problem to minimization of the conditional value-at-risk and proposed a reformulation, and numerical methods, in the case of linear classification. We propose instead a regularized version of the problem and look for

Assume that $E$ is convex. Then $x\mapsto d(x,E)$ differentiable everywhere except at the boundary of $E$ with ${\mathrm{\nabla}}_{x}d(x,E)=0$ for $x\in {E}^{o}$ and $|{\mathrm{\nabla}}_{x}d(x,E)|=1$ for all $x\in {\overline{E}}^{c}$. Furthermore, assume $\mu $ is absolutely continuous w.r.t. Lebesgue measure on $\mathcal{S}$. Theorem 2.2, using [9, remark 7.3], gives a first-order expansion for the above problem:

#### (d) Statistics

We discuss two applications of our results in the realm of statistics. We start by highlighting the link between our results and the so-called *influence curves* (IC) in robust statistics. For a functional $\mu \mapsto T(\mu )$ its IC is defined as

Our second application in statistics exploits the representation of the LASSO/Ridge regressions as robust versions of the standard linear regression. We consider $\mathcal{A}={\mathbb{R}}^{k}$ and $\mathcal{S}={\mathbb{R}}^{k+1}$. If instead of the Euclidean metric we take $||(x,y)|{|}_{\ast}=|x{|}_{r}{\mathbf{1}}_{\{y=0\}}+\mathrm{\infty}{\mathbf{1}}_{\{y\ne 0\}}$, for some $r>1$ and $(x,y)\in {\mathbb{R}}^{k}\times \mathbb{R}$, in the definition of the Wasserstein distance, then [19] showed that

^{1}${a}_{\delta}^{\star}$ is approximately

The case of ${\mu}_{N}$ is naturally of particular importance in statistics and data science and we continue to consider it in the next subsection. In particular, we characterize the asymptotic distribution of $\sqrt{N}({a}_{1/\sqrt{N}}^{\star}-{a}^{\star})$, where ${a}_{\delta}^{\star}\in {\mathcal{A}}_{\delta}^{\star}({\mu}_{N})$ and ${a}^{\star}\in {\mathcal{A}}_{0}^{\star}({\mu}_{\mathrm{\infty}})$ is the optimizer of the non-robust problem for the data-generating measure. This recovers the central limit theorem of [47], a link we explain further in §4b.

#### (e) Out-of-sample error

A benchmark of paramount importance in optimization is the so-called *out-of-sample error*, also known as the *prediction error* in statistical learning. Consider the setup above when ${\mu}_{N}$ is the empirical measure of $N$ i.i.d. observations sampled from the ‘true’ distribution $\mu ={\mu}_{\mathrm{\infty}}$ and take, for simplicity, $||\cdot ||=|\cdot {|}_{s}$, with $s>1$. Our aim is to compute the optimal ${a}^{\star}$ which solves the original problem (1.1). However, we only have access to the training set, encoded via ${\mu}_{N}$. Suppose we solve the distributionally robust optimization problem (1.2) for ${\mu}_{N}$ and denote the robust optimizer ${a}_{\delta}^{\star ,N}$. Then the *out-of-sample error*

While this expression seems to be hard to compute explicitly for finite samples, theorem 2.4 offers a way to find the asymptotic distribution of a (suitably rescaled version of) the out-of-sample error. We suppose the assumptions in theorem 2.4 are satisfied and note that the first order condition for ${a}^{\star}$ gives ${\mathrm{\nabla}}_{a}V(0,{a}^{\star})=0$. Then, a second-order Taylor expansion gives

### 4. Further discussion and literature review

We start with an overview of related literature and then focus specifically on a comparison of our results with the CLT of [47] mentioned above.

#### (a) Discussion of related literature

Let us first remark, that while theorem 2.2 offers some superficial similarities to a classical maximum theorem, which is usually concerned with continuity properties of $\delta \mapsto V(\delta )$, in this work, we are instead interested in the exact first derivative of the function $\delta \mapsto V(\delta )$. Indeed, the convergence $\underset{\delta \to 0}{lim}\underset{\nu \in {B}_{\delta}(\mu )}{sup}\int f(x)\hspace{0.17em}\nu (\text{d}x)=\int f(x)\hspace{0.17em}\mu (\text{d}x)$ follows for all $f$ satisfying $f(x)\le c(1+|x{|}^{p})$ directly from the definition of convergence in Wasserstein metric (e.g. [49, Def. 6.8]). In conclusion, the main issue is to quantify the rate of this convergence by calculating the first derivative ${V}^{\prime}(\delta )$.

Our work investigates model uncertainty broadly conceived: it includes errors related to the choice of models from a particular (parametric or not) class of models as well as the mis-specification of such a class altogether (or indeed, its absence). In the decision theoretic literature, these aspects are sometimes referred to as model ambiguity and model mis-specification, respectively, see [50]. However, seeing our main problem (1.2) in decision theoretic terms is not necessarily helpful as we think of $f$ as given and not coming from some latent expected utility type of problem. In particular, our actions $a\in \mathcal{A}$ are just constants.

In our work, we decided to capture the uncertainty in the specification of $\mu $ using neighbourhoods in the Wasserstein distance. As already mentioned, other choices are possible and have been used in past. Possibly, the most often used alternative is the relative entropy, or the Kullblack–Leibler divergence. In particular, it has been used in this context in economics, see [51]. To the best of our knowledge, the only comparable study of sensitivities with respect to relative entropy balls is [22], see also [45] allowing for additional marginal constraints. However, this only considered the specific case $f(x,a)=f(x)$ where the reward function is independent of the action. Its main result is

To understand the relative technical difficulties and merits, it is insightful to go into the details of the statements. In fact, in the case of relative entropy and the one-period set-up we are considering, the exact form of the optimizing density can be determined exactly (see [22, Proposition 3.1]) up to a one-dimensional Langrange parameter. This is well known and is the reason behind the usual elegant formulae obtained in this context. But this then reduces the problem in [22] to a one-dimensional problem, which can be well-approximated via a Taylor approximation. By contrast, when we consider balls in the Wasserstein distance, the form of the optimizing measure is not known (apart from some degenerate cases). In fact, a key insight of our results is that the optimizing measure can be approximated by a deterministic shift in the direction ${(x+{f}^{\mathrm{\prime}}(x)\delta )}_{\ast}\mu $ (this is, in general, not exact but only true as a first-order approximation). The reason for these contrastive starting points of the analyses is the fact that Wasserstein balls contain a more heterogeneous set of measures, while in the case of relative entropy, exponentiating $f$ will always do the trick. We remark however that this is not true for the finite-horizon problems considered in [22, Section 3.2] any more, where the worst-case measure is found using an elaborate fixed-point equation.

A point which further emphasizes the fact that the topology introduced by the Wasserstein metric is less tractable is the fact that

The other well-studied distance is the Hellinger distance. [24] calculates influence curves for the minimum Hellinger distance estimator ${a}^{\mathrm{Hell},\star}$ on a countable sample space. Their main result is that for the choice $f(x,a)=\mathrm{log}(\ell (x,a))$ (where ${(\ell (x,a))}_{a\in \mathcal{A}}$ is a collection of parametric densities)

#### (b) Link to the central limit theorem of [47]

As observed in §3e above, theorem 2.4 allows to recover the main results in [47]. We explain this now in detail. Set $||\cdot ||=|\cdot {|}_{s}$, $p=q=2$, $\mathcal{S}={\mathbb{R}}^{d}$. Let ${\mu}_{N}$ denote the empirical measure of $N$ i.i.d. samples from $\mu $. We impose the assumptions on $\mu $ and $f$ from [47], including Lipschitz continuity of gradients of $f$ and strict convexity. These, in particular, imply that the optimizers ${a}_{\delta}^{\star ,N},{a}^{\star ,N}$ and ${a}^{\star}$, as defined in §3e are well defined and unique, and further ${a}_{1/\sqrt{N}}^{\star ,N}\to {a}^{\star}$ as $N\to \mathrm{\infty}$. [47, Thm. 1] implies that, as $N\to \mathrm{\infty}$,

^{2}it is also worth stressing that the proofs in [47] pass through the dual formulation and are thus substantially different from ours. Furthermore, while theorem 2.4 holds under milder assumptions on $f$ than those in [47], the last argument in our reasoning above requires the stronger assumptions on $f$. It is thus not clear if our results could help to significantly weaken the assumptions in the central limit theorems of [47].

### 5. Proofs

We consider the case $\mathcal{S}={\mathbb{R}}^{d}$ and $||\cdot ||=|\cdot |$ here. For the general case and additional details, we refer to [9]. When clear from the context, we do not indicate the space over which we integrate.

### Proof of theorem 2.2.

For every $\delta \ge 0,$ let ${C}_{\delta}(\mu )$ denote those $\pi \in \mathcal{P}({\mathbb{R}}^{d}\times {\mathbb{R}}^{d})$ which satisfy

We start by showing the ‘$\le $’ inequality in the statement. For any ${a}^{\star}\in {\mathcal{A}}_{0}^{\star},$ one has $V(\delta )\le \underset{\nu \in {B}_{\delta}(\mu )}{sup}\int f(y,{a}^{\star})\hspace{0.17em}\nu (\text{d}y)$ with equality for $\delta =0$. Therefore, differentiating $f(\cdot ,{a}^{\star})$ and using both Fubini’s theorem and Hölder’s inequality, we obtain that

We turn now to the opposite ‘$\ge $’ inequality. As $V(\delta )\ge V(0)$ for every $\delta >0,$ there is no loss of generality in assuming that the right-hand side is not equal to zero. Now take any, for notational simplicity not relabelled, subsequence of ${(\delta )}_{\delta >0}$ which attains the liminf in $(V(\delta )-V(0))/\delta $ and pick ${a}_{\delta}^{\star}\in {\mathcal{A}}_{\delta}^{\star}$. By assumption, for a (again not relabelled) subsequence, one has ${a}_{\delta}^{\star}\to {a}^{\star}\in {\mathcal{A}}_{0}^{\star}$. Further note that $V(0)\le \int f(x,{a}_{\delta}^{\star})\hspace{0.17em}\mu (\text{d}x)$ which implies

### Proof of theorem 2.4.

We first show that

The proof of the ‘$\ge $’ inequality in (5.2) follows by the very same arguments. Indeed, [9, lemma 8.5] implies that

By assumption, the matrix ${\mathrm{\nabla}}_{a}^{2}V(0,{a}^{\star})$ is invertible. Therefore, in a small neighbourhood of ${a}^{\star}$, the mapping ${\mathrm{\nabla}}_{a}V(0,\cdot )$ is invertible. In particular, ${a}_{\delta}^{\star}={({\mathrm{\nabla}}_{a}V(0,\cdot ))}^{-1}({\mathrm{\nabla}}_{a}V(0,{a}_{\delta}^{\star}))$ and by the first-order condition ${a}^{\star}={({\mathrm{\nabla}}_{a}V(0,\cdot ))}^{-1}(0)$. Applying the chain rule and using (5.2) gives

## Footnotes

### Data accessibility

The codes used to generate figures in the paper are available on GitHub: http://github.com/JanObloj/Robust-uncertainty-sensitivity-analysis.

### Authors' contributions

D.B., S.D., J.O. and J.W. formulated the mathematical problem, carried out the analysis, established the main results and drew conclusions. J.O. and J.W. wrote the first draft of the paper. D.B. and J.W. wrote the first draft of the appendix. S.D. and J.W. performed the numerical analysis. All the authors proof read and corrected the manuscript, gave final approval for publication and agree to be held accountable for the work performed therein.

### Competing interests

The authors declare no competing interests.

### Funding

This work was supported by the European Research Council [7th FP/ERC grant agreement no. 335421], the Vienna Science and Technology Fund (WWTF) [project MA16-021], the Austrian Science Fund (FWF) [project P28661] and the National Science Foundation of China (grant nos 11971310 and 11671257).

## Acknowledgements

We thank Jose Blanchet, Mike Giles, Daniel Kuhn and Peyman Mohajerin Esfahani for their helpful comments on an earlier draft of this paper.

### References

- 1.
Armacost RL, Fiacco AV . 1974 Computational experience in sensitivity analysis for nonlinear programming.**Math. Program.**, 301-326. (doi:10.1007/BF01580247) Crossref, Google Scholar**6** - 2.
Vogel S . 2007 Stability results for stochastic programming problems.**Optimization**, 269-288. (doi:10.1080/02331938808843343) Crossref, Google Scholar**19** - 3.
Bonnans JF, Shapiro A . 2013**Perturbation analysis of optimization problems**. New York, NY: Springer. Google Scholar - 4.
Ghanem R, Higdon D, Owhadi H eds. 2017**Handbook of uncertainty quantification**. Cham, Switzerland: Springer. Crossref, Google Scholar - 5.
Dupacova J . 1990 Stability and sensitivity analysis for stochastic programming.**Ann. Oper. Res.**, 115-142. (doi:10.1007/BF02055193) Crossref, Google Scholar**27** - 6.
Romisch W . 2003 Stability of stochastic programming problems. In*Stochastic programming*, pp. 483–554. Amsterdam, The Netherlands: Elsevier. (doi:10.1016/S0927-0507(03)10008-4) Google Scholar - 7.
Asi H, Duchi JC . 2019 The importance of better models in stochastic optimization.**Proc. Natl Acad. Sci. USA**, 22 924-22 930. (doi:10.1073/pnas.1908018116) Crossref, Web of Science, Google Scholar**116** - 8.
Rahimian H, Mehrotra S . 2019 Distributionally robust optimization: a review. (http://arxiv.org/abs/1908.05659) Google Scholar - 9.
Bartl D, Drapeau S, Obłój J, Wiesel J . 2021 Supplementary material from “Sensitivity analysis of Wasserstein distributionally robust optimization problems”. The Royal Society. Collection. (https://doi.org/10.6084/m9.figshare.c.5730987) Google Scholar - 10.
Chiappori PA, McCann RJ, Nesheim L . 2010 Hedonic price equilibria, stable matching, and optimal transport: equivalence, topology, and uniqueness.**Econ. Theory**, 317-354. (doi:10.1007/s00199-009-0455-z) Crossref, Web of Science, Google Scholar**42** - 11.
Carlier G, Ekeland I . 2010 Matching for teams.**Econ. Theory**, 397-418. (doi:10.1007/s00199-008-0415-z) Crossref, Web of Science, Google Scholar**42** - 12.
Peyré G, Cuturi M . 2019 Computational optimal transport.**Found. Trends Mach. Learn.**, 355-607. (doi:10.1561/2200000073) Crossref, Web of Science, Google Scholar**11** - 13.
Pflug G, Wozabal D . 2007 Ambiguity in portfolio selection.**Quant. Finance**, 435-442. (doi:10.1080/14697680701455410) Crossref, Web of Science, Google Scholar**7** - 14.
Fournier N, Guillin A . 2014 On the rate of convergence in Wasserstein distance of the empirical measure.**Probab. Theory Relat. Fields**, 707-738. (doi:10.1007/s00440-014-0583-7) Crossref, Web of Science, Google Scholar**162** - 15.
Mohajerin Esfahani P, Kuhn D . 2018 Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations.**Math. Program.**, 115-166. (doi:10.1007/s10107-017-1172-1) Crossref, Web of Science, Google Scholar**171** - 16.
Obłój J, Wiesel J . 2021 Robust estimation of superhedging prices.**Ann. Stat.**, 508-530. (doi:10.1214/20-AOS1966) Crossref, Web of Science, Google Scholar**49** - 17.
Gao R, Kleywegt AJ . 2016 Distributionally robust stochastic optimization with Wasserstein distance. (http://arxiv.org/abs/1604.02199) Google Scholar - 18.
Blanchet J, Murthy K . 2019 Quantifying distributional model risk via optimal transport.**Math. Oper. Res.**, 565-600. (doi:10.1287/moor.2018.0936) Crossref, Web of Science, Google Scholar**44** - 19.
Blanchet J, Kang Y, Murthy K . 2019 Robust Wasserstein profile inference and applications to machine learning.**J. Appl. Probab.**, 830-857. (doi:10.1017/jpr.2019.49) Crossref, Web of Science, Google Scholar**56** - 20.
Kuhn D, Esfahani PM, Nguyen VA, Shafieezadeh-Abadeh S . 2019 Wasserstein distributionally robust optimization: theory and applications in machine learning. In*Operations research & management science in the age of analytics*, pp. 130–166. INFORMS. (doi:10.1287/educ.2019.0198) Google Scholar - 21.
Shafieezadeh-Abadeh S, Kuhn D, Esfahani PM . 2019 Regularization via mass transportation.**J. Mach. Learn. Res.**, 1-68. Web of Science, Google Scholar**20** - 22.
Lam H . 2016 Robust sensitivity analysis for stochastic systems.**Math. Oper. Res.**, 1248-1275. (doi:10.1287/moor.2015.0776) Crossref, Web of Science, Google Scholar**41** - 23.
Calafiore GC . 2007 Ambiguous risk measures and optimal robust portfolios.**SIAM J. Optim.**, 853-877. (doi:10.1137/060654803) Crossref, Web of Science, Google Scholar**18** - 24.
Lindsay BG . 1994 Efficiency versus robustness: the case for minimum Hellinger distance and related methods.**Ann. Stat.**, 1081-1114. (doi:10.1214/aos/1176325512) Crossref, Web of Science, Google Scholar**22** - 25.
Black F, Scholes M . 1973 The pricing of options and corporate liabilities.**J. Political Econ.**, 637-654. (doi:10.1086/260062) Crossref, Web of Science, Google Scholar**81** - 26.
Bartl D, Drapeau S, Tangpi L . 2020 Computational aspects of robust optimized certainty equivalents and option pricing.**Math. Finance**, 287-309. (doi:10.1111/mafi.12203) Crossref, Web of Science, Google Scholar**30** - 27.
Ben Tal A, Teboulle M . 1986 Expected utility, penalty functions, and duality in stochastic nonlinear programming.**Manage. Sci.**, 1445-1466. (doi:10.1287/mnsc.32.11.1445) Crossref, Web of Science, Google Scholar**32** - 28.
Artzner P, Delbaen F, Eber J, Heath D . 1999 Coherent measures of risk.**Math. Finance**, 203-228. (doi:10.1111/1467-9965.00068) Crossref, Web of Science, Google Scholar**9** - 29.
Ben Tal A, Teboulle M . 2007 An old-new concept of convex risk measures: the optimized certainty equivalent.**Math. Finance**, 449-476. (doi:10.1111/j.1467-9965.2007.00311.x) Crossref, Web of Science, Google Scholar**17** - 30.
Markowitz H . 1952 Portfolio selection.**J. Finance**, 77-91. (doi:10.2307/2975974) Web of Science, Google Scholar**7** - 31.
Pflug GC, Pichler A, Wozabal D . 2012 The 1/N investment strategy is optimal under high model ambiguity.**J. Bank. Finance**, 410-417. (doi:10.1016/j.jbankfin.2011.07.018) Crossref, Web of Science, Google Scholar**36** - 32.
Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow I, Fergus R . 2013 Intriguing properties of neural networks. (http://arxiv.org/abs/1312.6199) Google Scholar - 33.
Goodfellow IJ, Shlens J, Szegedy C . 2014 Explaining and harnessing adversarial examples. (http://arxiv.org/abs/1412.6572) Google Scholar - 34.
Li L, Zhong Z, Li B, Xie T . 2019 Robustra: training provable robust neural networks over reference adversarial space. In*Proc. 28th Int. Joint Conf. on Artificial Intelligence*, pp. 4711–4717. AAAI Press. (doi:10.24963/ijcai.2019/654) Google Scholar - 35.
Carlini N, Wagner D . 2017 Towards evaluating the robustness of neural networks. In*2017 IEEE Symp. on Security and Privacy (SP)*, pp. 39–57. IEEE. (doi:10.1109/SP.2017.49) Google Scholar - 36.
Wong E, Kolter JZ . 2017 Provable defenses against adversarial examples via the convex outer adversarial polytope. (http://arxiv.org/abs/1711.00851) Google Scholar - 37.
Weng TW, Zhang H, Chen PY, Yi J, Su D, Gao Y, Hsieh CJ, Daniel L . 2018 Evaluating the robustness of neural networks: an extreme value theory approach. (http://arxiv.org/abs/1801.10578) Google Scholar - 38.
Araujo A, Pinot R, Negrevergne B, Meunier L, Chevaleyre Y, Yger F, Atif J . 2019 Robust neural networks using randomized adversarial training. (http://arxiv.org/abs/1903.10219) Google Scholar - 39.
Mangal R, Nori AV, Orso A . 2019 Robustness of neural networks: a probabilistic and practical approach. In*Proc. 41st Int. Conf. on Software Engineering: New Ideas and Emerging Results*, pp. 93–96. IEEE Press. (doi:10.1109/ICSE-NIER.2019.00032) Google Scholar - 40.
Bastani O, Ioannou Y, Lampropoulos L, Vytiniotis D, Nori A, Criminisi A . 2016 Measuring neural net robustness with constraints. (https://arxiv.org/abs/1605.07262) Google Scholar - 41.
Sinha A, Namkoong H, Volpi R, Duchi J . 2020 Certifying some distributional robustness with principled adversarial training. (http://arxiv.org/abs/1710.10571v5) Google Scholar - 42.
Ho-Nguyen N, Wright SJ . 2020 Adversarial classification via distributional robustness with wasserstein ambiguity. (http://arxiv.org/abs/2005.13815) Google Scholar - 43.
Chen Z, Kuhn D, Wiesemann W . 2018 Data-driven chance constrained programs over Wasserstein balls. (http://arxiv.org/abs/1809.00210) Google Scholar - 44.
Huber P, Ronchetti E . 1981**Robust statistics**.Wiley Series in Probability and Mathematical Statistics ,. New York, NY: Wiley-IEEE. Crossref, Google Scholar**vol. 52** - 45.
Lam H . 2018 Sensitivity to serial dependency of input processes: a robust approach.**Manage. Sci.**, 1311-1327. (doi:10.1287/mnsc.2016.2667) Crossref, Web of Science, Google Scholar**64** - 46.
Tibshirani R . 1996 Regression shrinkage and selection via the Lasso.**J. R. Stat. Soc. B. Stat. Methodol.**, 267-288. Crossref, Google Scholar**58** - 47.
Blanchet J, Murthy K, Si N . 2019 Confidence regions in Wasserstein distributionally robust estimation. (http://arxiv.org/abs/1906.01614) Google Scholar - 48.
Anderson EJ, Philpott AB . 2019 Improving sample average approximation using distributional robustness.*Optimization Online*. See http://www.optimization-online.org/DB_HTML/2019/10/7405.html. Google Scholar - 49.
- 50.
Hansen LP, Marinacci M . 2016 Ambiguity aversion and model misspecification: an economic perspective.**Stat. Sci.**, 511-515. (doi:10.1214/16-STS570) Crossref, Web of Science, Google Scholar**31** - 51.
Hansen LP, Sargent T . 2007**Robustness**. Princeton, NJ: Princeton University Press. Crossref, Google Scholar - 52.
Atar R, Chowdhary K, Dupuis P . 2015 Robust bounds on risk-sensitive functionals via Rényi divergence.**SIAM/ASA J. Uncertain. Quantif.**, 18-33. (doi:10.1137/130939730) Crossref, Google Scholar**3** - 53.
Glasserman P, Xu X . 2014 Robust risk measurement and model risk.**Quant. Finance**, 29-58. (doi:10.1080/14697688.2013.822989) Crossref, Web of Science, Google Scholar**14** - 54.
Carlier G, Duval V, Peyré G, Schmitzer B . 2017 Convergence of entropic schemes for optimal transport and gradient flows.**SIAM J. Math. Anal.**, 1385-1418. (doi:10.1137/15M1050264) Crossref, Web of Science, Google Scholar**49** - 55.
Peyré G, Cuturi M . 2019 Computational optimal transport: with applications to data science.**Found. Trends Mach. Learn.**, 355-607. (doi:10.1561/2200000073) Crossref, Web of Science, Google Scholar**11** - 56.
Komorowski M, Costa MJ, Rand DA, Stumpf MP . 2011 Sensitivity, robustness, and identifiability in stochastic chemical kinetics models.**Proc. Natl Acad. Sci. USA**, 8645-8650. (doi:10.1073/pnas.1015814108) Crossref, PubMed, Web of Science, Google Scholar**108**