Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences
Open AccessResearch articles

Sensitivity analysis of Wasserstein distributionally robust optimization problems

Daniel Bartl

Daniel Bartl

Department of Mathematics, University of Vienna, Oskar-Morgenstern-Platz 1, 1090 Vienna, Austria

Google Scholar

Find this author on PubMed

,
Samuel Drapeau

Samuel Drapeau

School of Mathematical Sciences & Shanghai Advanced Institute of Finance, Shanghai Jiao Tong University, 211 West Huaihai Road, Shanghai 200030, People’s Republic of China

Google Scholar

Find this author on PubMed

, and
Johannes Wiesel

Johannes Wiesel

Department of Statistics, Columbia University, 1255 Amsterdam Avenue, New York, NY 10027, USA

Google Scholar

Find this author on PubMed

Published:https://doi.org/10.1098/rspa.2021.0176

    Abstract

    We consider sensitivity of a generic stochastic optimization problem to model uncertainty. We take a non-parametric approach and capture model uncertainty using Wasserstein balls around the postulated model. We provide explicit formulae for the first-order correction to both the value function and the optimizer and further extend our results to optimization under linear constraints. We present applications to statistics, machine learning, mathematical finance and uncertainty quantification. In particular, we provide an explicit first-order approximation for square-root LASSO regression coefficients and deduce coefficient shrinkage compared to the ordinary least-squares regression. We consider robustness of call option pricing and deduce a new Black–Scholes sensitivity, a non-parametric version of the so-called Vega. We also compute sensitivities of optimized certainty equivalents in finance and propose measures to quantify robustness of neural networks to adversarial examples.

    1. Introduction

    We consider a generic stochastic optimization problem

    infaASf(x,a)μ(dx), 1.1
    where A is the set of actions or choices, f is the loss function and μ is a probability measure over the state space S. Such problems are found across the whole of applied mathematics. The measure μ is the crucial input and it could represent, for example, a dynamic model of the system, as is often the case in mathematical finance or mathematical biology, or the empirical measure of observed data points, or the training set, as is the case in statistics and machine learning applications. In virtually all the cases, there is a certain degree of uncertainty around the choice of μ coming from modelling choices and simplifications, incomplete information, data errors, finite sample error, etc. It is thus very important to understand the influence of changes in μ on (1.1), both on its value and on its optimizer. Often, the choice of μ is done in two stages: first a parametric family of models is adopted and then the values of the parameters are fixed. Sensitivity analysis of (1.1) with changing parameters is a classical topic explored in parametric programming and statistical inference, e.g. [13]. It also underscores a lot of progress in the field of uncertainty quantification, see [4]. Considering μ as an abstract parameter, the mathematical programming literature looked into qualitative and quantitative stability of (1.1). We refer to [5,6] and the references therein. When μ represents data samples, there has been a considerable interest in the optimization community in designing algorithms which are robust and, in particular, do not require excessive hypertuning, see [7] and the references therein.

    A more systematic approach to model uncertainty in (1.1) is offered by the distributionally robust optimization problem

    V(δ):=infaAV(δ,a):=infaAsupνBδ(μ)Sf(x,a)ν(dx), 1.2
    where Bδ(μ) is a ball of radius δ around μ in the space of probability measures, as specified below. Such problems greatly generalize more classical robust optimization and have been studied extensively in operations research and machine learning in particular; we refer the reader to [8] and the references therein. Our goal in this paper is to understand the behaviour of these problems for small δ. Our main results compute first-order behaviour of V(δ) and its optimizer for small δ. This offers a measure of sensitivity to errors in model choice and/or specification as well as points in the abstract direction, in the space of models, in which the change is most pronounced. We use examples to show that our results can be applied across a wide spectrum of science.

    This paper is organized as follows. We first present the main results and then, in §3, explore their applications. Further discussion of our results and the related literature is found in §4, which is then followed by the proofs. The online appendix [9] contains many supplementary results and remarks, as well as some more technical arguments from the proofs.

    2. Main results

    Take d,kN, endow Rd with the Euclidean norm || and write Γo for the interior of a set Γ. Assume that S is a closed convex subset of Rd. Let P(S) denote the set of all (Borel) probability measures on S. Further fix a seminorm |||| on Rd and denote by |||| its (extended) dual norm, i.e. ||y||:=sup{x,y:||x||1}. In particular, for ||||=|| we also have ||||=||. For μ,νP(S), we define the p-Wasserstein distance as

    Wp(μ,ν)=inf{S×S||xy||pπ(dx,dy):πCpl(μ,ν)}1/p,
    where Cpl(μ,ν) is the set of all probability measures π on S×S with first marginal π1:=π(×S)=μ and second marginal π2:=π(S×)=ν. Denote the Wasserstein ball
    Bδ(μ)={νP(S):Wp(μ,ν)δ},
    of size δ0 around μ. Note that, taking a suitable probability space (Ω,F,P) and a random variable Xμ, we have the following probabilistic representation of V(δ,a):
    supνBδ(μ)Sf(x,a)ν(dx)=supZEP[f(X+Z,a)],
    where the supremum is taken over all Z satisfying X+ZS almost surely and EP[||Z||p]δp. Wasserstein distances and optimal transport techniques have proved to be powerful and versatile tools in a multitude of applications, from economics [10,11] to image recognition [12]. The idea to use Wasserstein balls to represent model uncertainty was pioneered in [13] in the context of investment problems. When sampling from a measure with a finite pth moment, the measures converge to the true distribution and Wasserstein balls around the empirical measures have the interpretation of confidence sets, see [14]. In this set-up, the radius δ can then be chosen as a function of a given confidence level α and the sample size N. This yields finite sample guarantees and asymptotic consistency, see [15,16], and justifies the use of the Wasserstein metric to capture model uncertainty. The value V(δ,a) in (1.2) has a dual representation, see [17,18], which has led to significant new developments in distributionally robust optimization, e.g.[15,1921].

    Naturally, other choices for the distance on the space of measures are also possible: such as the Kullblack–Leibler divergence, see [22] for general sensitivity results and [23] for applications in portfolio optimization, or the Hellinger distance, see [24] for a statistical robustness analysis. We refer to §4 for a more detailed analysis of the state of the art in these fields. Both of these approaches have good analytic properties and often lead to theoretically appealing closed-form solutions. However, they are also very restrictive since any measure in the neighbourhood of μ has to be absolutely continuous with respect to μ. In particular, if μ is the empirical measure of N observations then measures in its neighbourhood have to be supported on those fixed N points. To obtain meaningful results, it is thus necessary to impose additional structural assumptions, which are often hard to justify solely on the basis of the data at hand and, equally importantly, create another layer of model uncertainty themselves. We refer to [17, sec. 1.1] for further discussion of potential issues with ϕ-divergences. The Wasserstein distance, while harder to handle analytically, is more versatile and does not require any such additional assumptions.

    Throughout the paper, we take the convention that continuity and closure are understood w.r.t. ||. We assume that ARk is convex and closed and that the seminorm |||| is strictly convex in the sense that for two elements x,yRd with ||x||=||y||=1 and ||xy||0, we have ||12x+12y||<1 (note that this is satisfied for every ls-norm |x|s:=(i=1d|xi|s)1/s for s>1). We fix p(1,), let q:=p/(p1) so that 1/p+1/q=1, and fix μP(S) such that the boundary of SRd has μ–zero measure and S|x|pμ(dx)<. Denote by Aδ the set of optimizers for V(δ) in (1.2).

    Assumption 2.1.

    The loss function f:S×AR satisfies

    xf(x,a) is differentiable on So for every aA. Moreover, (x,a)xf(x,a) is continuous and for every r>0 there is c>0 such that |xf(x,a)|c(1+|x|p1) for all xS and aA with |a|r.

    For all δ0 sufficiently small, we have Aδ and for every sequence (δn)nN such that limnδn=0 and (an)nN such that anAδn for all nN there is a subsequence which converges to some aA0.

    The above assumption is not restrictive: the first part merely ensures existence of ||xf(,a)||Lq(μ), while the second part is satisfied as soon as either A is compact or V(0,) is coercive, which is the case in most examples of interest; see [9, lemma 7.15] for further comments.

    Theorem 2.2.

    If assumption 2.1 holds then V(0) is given by

    Υ:=limδ0V(δ)V(0)δ=infaA0(S||xf(x,a)||qμ(dx))1/q.

    Remark.

    Inspecting the proof, defining

    V~(δ)=infaA0supνBδ(μ)Sf(x,a)ν(dx)
    we obtain V~(0)=V(0). This means that for small δ>0 there is no first-order gain from optimizing over all aA in the definition of V(δ) when compared with restricting simply to aA0, as in V~(δ).

    The above result naturally extends to computing sensitivities of robust problems, i.e. V(r), see [9, corollary 7.5], as well as to the case of stochastic optimization under linear constraints, see [9, theorem 7.7]. We recall that V(0,a)=Sf(x,a)μ(dx).

    Assumption 2.3.

    Suppose the f is twice continuously differentiable, aA0Ao and

    i=1k|aixf(x,a)|c(1+|x|p1ε) for some ε>0, c>0, all xS and all a close to a.

    The function aV(0,a) is twice continuously differentiable in the neighbourhood of a and the matrix a2V(0,a) is invertible.

    Theorem 2.4.

    Suppose aA0 and aδAδ such that aδa as δ0 and assumptions 2.1 and 2.3 are satisfied. If xf(x,a)0 μ-a.e. or if xaf(x,a)=0 μ-a.e., then

    :=limδ0aδaδ=(S||xf(x,a)||qμ(dx))(1/q)1(a2V(0,a))1Sxaf(x,a)h(xf(x,a))||xf(x,a)||1qμ(dx),
    where h:Rd{0}{xRd:||x||=1} is the unique function satisfying ,h()=||||, see [9, Lemma 6.2]. In particular, h()=/|| if ||||=||.

    Above and throughout the convention is that xf(x,a)Rd×1, aixf(x,a)Rd×1, af(x,a)Rk×1, xaf(x,a)Rk×d and 0/0=0. The assumed existence and convergence of optimizers holds, e.g. with suitable convexity of f in a; see [9, lemma 7.14] for a worked out setting. In line with the financial economics practice, we gave our sensitivities letter symbols, Υ and , loosely motivated by Υπo´δειγμα, the Greek for Model, and Inline Graphic, the Hebrew for control.

    3. Applications

    We now illustrate the universality of theorems 2.2 and 2.4 by considering their applications in a number of different fields. Unless otherwise stated, S=Rd, A=Rk and means S.

    (a) Financial economics

    We start with the simple example of risk-neutral pricing of a call option written on an underlying asset (St)tT. Here, T,K>0 are the maturity and the strike, respectively, f(x,a)=(S0xK)+ and μ is the distribution of ST/S0. We set interest rates and dividends to zero for simplicity. In [25], the model μ is a lognormal distribution, i.e. log(ST/S0)N(σ2T/2,σ2T) is Gaussian with mean σ2T/2 and variance σ2T. In this case, V(0) is given by the celebrated Black–Scholes formula. Note that this example is particularly simple since f is independent of a. However, to ensure risk-neutral pricing, we have to impose a linear constraint on the measures in Bδ(μ), giving

    supνBδ(μ):xν(dx)=1(S0xK)+ν(dx). 3.1
    To compute its sensitivity we encode the constraint using a Lagrangian and apply theorem 2.2, see [9, remark 7.3, theorem 7.7]. For p=2, letting k=K/S0 and μk=μ([k,)), the resulting formula, see [9, example 7.10], is given by
    Υ=S0(1xkμk)2μ(dx)=S0μk(1μk).
    Let us specialize to the lognormal distribution of the Black–Scholes model above and denote the quantity in (3.1) as RBS(δ). It may be computed exactly using methods from [26]. Figure 1 compares the exact value and the first-order approximation. We have Υ=S0Φ(d)(1Φ(d)), where d=log(S0/K)σ2T/2/σT and Φ is the cdf of N(0,1) distribution. It is also insightful to compare Υ with a parametric sensitivity. If instead of Wasserstein balls, we consider {N(σ~2T/2,σ~2T):|σσ~|δ} the resulting sensitivity is known as the Black–Scholes Vega and given by V=S0Φ(d+σT). We plot the two sensitivities in figure 2. It is remarkable how, for the range of strikes of interest, the non-parametric model sensitivity Υ traces out the usual shape of V but shifted upwards to account for the idiosyncratic risk of departure from the lognormal family. More generally, given a book of options with payoff f=f+f at time T, with f+, f0, we could say that the book is Υ-neutral if the sensitivity Υ was the same for f+ and for f. In analogy to Delta-Vega hedging standard, one could develop a non-parametric model-agnostic Delta-Upsilon hedging. We believe these ideas offer potential for exciting industrial applications and we leave them to further research.
    Figure 1.

    Figure 1. DRO value RBS(δ) versus the first order (FO) approximation RBS(0)+Υδ, S0=T=1, K=1.2, σ=0.2. (Online version in colour.)

    Figure 2.

    Figure 2. Black–Scholes model: Υ versus V, S0=T=1, σ=0.2. (Online version in colour.)

    We turn now to the classical notion of the optimized certainty equivalent (OCE) of [27]. It is a decision theoretic criterion designed to split a liability between today's and tomorrow’s payments. It is also a convex risk measure in the sense of [28] and covers many of the popular risk measures such as expected shortfall or entropic risk, see [29]. We fix a convex monotone function l:RR which is bounded from below and g:RdR. Here, g represents the payoff of a financial position and l is the negative of a utility function, or a loss function. We take ||||=|| and refer to [9, lemma 7.14] for generic sufficient conditions for assumptions 2.1 and 2.3 to hold in this setup. The OCE corresponds to V in (1.1) for f(x,a)=l(g(x)a)+a and A=R, S=Rd. Theorems 2.2 and 2.4 yield the sensitivities

    Υ=infaA0(|l(g(x)a)g(x)|qμ(dx))1/q,=(|l(g(x)a)g(x)|2μ(dz))1/2l(g(x)a)l(g(x)a)(g(x))2μ(dx)l(g(x)a)μ(dx),
    where, for simplicity, we took p=q=2 for the latter.

    A related problem considers hedging strategies which minimize the expected loss of the hedged position, i.e. f(x,a)=l(g(x)+a,xx0), where A=Rk and (x0,x) represent today's and tomorrow’s traded prices. We compute Υ as

    infaA0(|l(g(x)+a,xx0)(g(x)+a)|qμ(dx))1/q.
    Furthermore we can combine loss minimization with OCE and consider a=(H,m)Rk×R, f(x,(h,m))=l(g(x)+H,xx0+m)m. This gives V(0) as the infimum over (H,m)A0 of
    (|l(g(x)+H,xx0+m)(g(x)+H)|qμ(dx))1/q.
    The above formulae capture non-parametric sensitivity to model uncertainty for examples of key risk measurements in financial economics. To the best of our knowledge, this has not been achieved before.

    Finally, we consider briefly the classical mean-variance optimization of [30]. Here μ represents the loss distribution across the assets and aRd, i=1dai=1 are the relative investment weights. The original problem is to minimize the sum of the expectation and γ standard deviations of returns a,X, with Xμ. Using the ideas in [31, Example 2] and considering measures on Rd×Rd, we can recast the problem as (1.1). While [31] focused on the asymptotic regime δ, their non-asymptotic statements are related to our theorem 2.2 and either result could be used here to obtain that V(δ)V(0)+1γ2δ for small δ.

    (b) Neural networks

    We specialize now to quantifying robustness of neural networks (NN) to adversarial examples. This has been an important topic in machine learning since [32] observed that NN consistently misclassify inputs formed by applying small worst-case perturbations to a dataset. This produced a number of works offering either explanations for these effects or algorithms to create such adversarial examples, e.g. [3339] to name just a few. The main focus of research works in this area, see [40], has been on faster algorithms for finding adversarial examples, typically leading to an overfit to these examples without any significant generalization properties. The viewpoint has been mainly pointwise, e.g. [32], with some generalizations to probabilistic robustness, e.g. [39].

    In contrast, we propose a simple metric for measuring robustness of NN which is independent of the architecture employed and the algorithms for identifying adversarial examples. In fact, theorem 2.2 offers a simple and intuitive way to formalize robustness of NN: for simplicity consider a 1-layer neural network trained on a given distribution μ of pairs (x,y), i.e. (A1,A2,b1,b2) solve

    inf|y((A2()+b2)σ(A1()+b1))(x)|pμ(dx,dy),
    where the inf is taken over a=(A1,A2,b1,b2)A=Rk×d×Rd×k×Rk×Rd, for a given activation function σ:RR, where the composition above is understood componentwise. Set f(x,y;A,b):=|y(A2()+b2)σ(A1()+b1)(x)|p. Data perturbations are captured by νBδp(μ) and (1.2) offers a robust training procedure. The first-order quantification of the NN sensitivity to adversarial data is then given by
    (|f(x,y;A,b)|qμ(dx,dy))1/q.
    A similar viewpoint, capturing robustness to adversarial examples through the optimal transport lens, has been recently adopted by other authors. The dual formulation of (1.2) was used by [21] to reduce the training of neural networks to tractable linear programs. [41] modified (1.2) to consider a penalized problem infaAsupνP(S)Sf(x,a)ν(dx)γWp(μ,ν) to propose new stochastic gradient descent algorithms with inbuilt robustness to adversarial data.

    (c) Uncertainty quantification

    In the context of UQ, the measure μ represents input parameters of a (possibly complicated) operation G in a physical, engineering or economic system. We consider the so-called reliability or certification problem: for a given set E of undesirable outcomes, one wants to control supνPν(G(x)E), for a set of probability measures P. The distributionally robust adversarial classification problem considered recently by [42] is also of this form, with Wasserstein balls P around an empirical measure of N samples. Using the dual formulation of [18], they linked the problem to minimization of the conditional value-at-risk and proposed a reformulation, and numerical methods, in the case of linear classification. We propose instead a regularized version of the problem and look for

    δ(α):=sup{δ0:infνBδ(μ)d(G(x),E)ν(dx)α},
    for a given safety level α. We thus consider the average distance to the undesirable set, d(G(x),E):=infeE|G(x)e|, and not just its probability. The quantity δ(α) could then be used to quantify the implicit uncertainty of the certification problem, where higher δ corresponds to less uncertainty. Taking statistical confidence bounds of the empirical measure in Wasserstein distance into account, see [14], δ would then determine the minimum number of samples needed to estimate the empirical measure.

    Assume that E is convex. Then xd(x,E) differentiable everywhere except at the boundary of E with xd(x,E)=0 for xEo and |xd(x,E)|=1 for all xE¯c. Furthermore, assume μ is absolutely continuous w.r.t. Lebesgue measure on S. Theorem 2.2, using [9, remark 7.3], gives a first-order expansion for the above problem:

    infνBδ(μ)d(G(x),E)ν(dx)=d(G(x),E)μ(dx)(|xd(G(x),E)xG(x)|qμ(dx))1/qδ+o(δ).
    In the special case xG(x)=cI this simplifies to
    d(G(x),E)μ(dx)c(μ(G(x)E))1/qδ+o(δ),
    and the minimal measure ν pushes every point G(x) not contained in E in the direction of the orthogonal projection. This recovers the intuition of [43, theorem 1], which in turn relies on [17, corollary 2, example 7]. Note however that our result holds for general measures μ. We also note that such an approximation could provide an ansatz for dimension reduction, by identifying the dimensions for which the partial derivatives are negligible and then projecting G on to the corresponding lower-dimensional subspace (thus providing a simpler surrogate for G). This would be an alternative to a basis expansion (e.g. in orthogonal polynomials) used in UQ and would exploit the interplay between the properties of G and μ simultaneously.

    (d) Statistics

    We discuss two applications of our results in the realm of statistics. We start by highlighting the link between our results and the so-called influence curves (IC) in robust statistics. For a functional μT(μ) its IC is defined as

    IC(y)=limt0T(tδy+(1t)μ)T(μ)t.
    Computing the IC, if it exists, is in general hard and closed form solutions may be unachievable. However, for the so-called M-estimators, defined as optimizers for V(0),
    T(μ):=argminaf(x,a)μ(dx),
    for some f (e.g. f(x,a)=|xa| for the median), we have
    IC(y)=af(y,T(μ))a2f(s,T(μ))μ(ds),
    under suitable assumptions on f, see [44, section 3.2.1]. In comparison, writing Tδ for the optimizer for V(δ), theorem 2.4 yields
    limδ0TδT(μ)δ=xaf(x,T(μ))xf(x,T(μ))μ(dx)a2f(s,T(μ))μ(ds), 3.2
    under assumption 2.3 and normalization ||xf(x,T(μ))||Lp(μ)=1. To investigate the connection let us Taylor-expand IC(y) around x to obtain
    IC(y)IC(x)=axf(x,T(μ))a2f(s,T(μ))μ(ds)(yx).
    Choosing y=x+δfx(x,T(μ)) and integrating both sides over μ and dividing by δ, we obtain the asymptotic equality
    IC(x+δxf(x,T(μ)))IC(x)δμ(dx)TδT(μ)δ,
    for δ0 by (3.2). We conclude that considering the average directional derivative of IC in the direction of fx(x,T(μ)) gives our first-order sensitivity. For an interesting conjecture regarding the comparison of influence functions and sensitivities in KL-divergence, we refer to [45, Section 7.3] and [22, Section 3.4.2].

    Our second application in statistics exploits the representation of the LASSO/Ridge regressions as robust versions of the standard linear regression. We consider A=Rk and S=Rk+1. If instead of the Euclidean metric we take ||(x,y)||=|x|r1{y=0}+1{y0}, for some r>1 and (x,y)Rk×R, in the definition of the Wasserstein distance, then [19] showed that

    infaRksupνBδ(μ)(yx,a)2ν(dx,dy)=infaRk((ya,x)2μ(dx,dy)+δ|a|s)2 3.3
    holds, where 1/r+1/s=1. The δ=0 case is the ordinary least-squares regression. For δ>0, the r.h.s. for s=2 is directly related to the Ridge regression, while the limiting case s=1 is called the square-root LASSO regression, a regularized variant of linear regression well known for its good empirical performance. Closed-form solutions to (3.3) do not exist in general and it is a common practice to use numerical routines to solve it approximately. Theorem 2.4 offers instead an explicit first-order approximation of aδ for small δ. We denote by a the ordinary least-squares estimator and by I the k×k identity matrix. Note that the first-order condition on a implies that (ya,x)xiμ(dx,dy)=0 for all 1ik. In particular, V(0)=(y2a,xy)μ(dx,dy) and a=D1yxμ(dx,dy), where we assume the system is overdetermined so that D=xxTμ(dx,dy) is invertible. A direct computation, see [9, example 8.2], yields
    aδaV(0)D1h(a)δ. 3.4
    For s=2, h(a)=a/|a|2 and for s=1, h(a)=sign(a) and hence1 aδ is approximately
    (1V(0)|a|2D1δ)aandaV(0)D1sign(a)δ, 3.5
    respectively. This corresponds to parameter shrinkage: proportional for square-root Ridge and a shift towards zero for square-root LASSO. To the best of our knowledge, these are first such results and we stress that our formulae are valid in a general context and, in particular, parameter shrinkage depends on the direction through the D1 factor. Figure 3 compares the first-order approximation with the actual results and shows a remarkable fit. Furthermore, our results agree with what is known in the canonical test case for the (standard) Ridge and LASSO, see [46]. When μ=μN is the empirical measure of N i.i.d. observations, the data are centred and the covariates are orthogonal, i.e. D=(1/N)I. In that case, (3.5) simplifies to
    (1δN(1R21))aandaN|y|1R2sign(a)δ,
    where R2 is the usual coefficient of determination.
    Figure 3.

    Figure 3. Square-root LASSO parameter shrinkage aδa0: exact (o) and the first-order approximation (x) in (3.5). 2000 observations generated according to Y=1.5X13X22X3+0.3X40.5X50.7X6+0.2X7+0.5X8+1.2X9+0.8X10+ε with all Xi, ε i.i.d. N(0,1). (Online version in colour.)

    The case of μN is naturally of particular importance in statistics and data science and we continue to consider it in the next subsection. In particular, we characterize the asymptotic distribution of N(a1/Na), where aδAδ(μN) and aA0(μ) is the optimizer of the non-robust problem for the data-generating measure. This recovers the central limit theorem of [47], a link we explain further in §4b.

    (e) Out-of-sample error

    A benchmark of paramount importance in optimization is the so-called out-of-sample error, also known as the prediction error in statistical learning. Consider the setup above when μN is the empirical measure of N i.i.d. observations sampled from the ‘true’ distribution μ=μ and take, for simplicity, ||||=||s, with s>1. Our aim is to compute the optimal a which solves the original problem (1.1). However, we only have access to the training set, encoded via μN. Suppose we solve the distributionally robust optimization problem (1.2) for μN and denote the robust optimizer aδ,N. Then the out-of-sample error

    V(0,aδ,N)V(0,a)=f(x,aδ,N)μ(dx)f(x,a)μ(dx)
    quantifies the error from using aδ,N as opposed to the true optimizer a.

    While this expression seems to be hard to compute explicitly for finite samples, theorem 2.4 offers a way to find the asymptotic distribution of a (suitably rescaled version of) the out-of-sample error. We suppose the assumptions in theorem 2.4 are satisfied and note that the first order condition for a gives aV(0,a)=0. Then, a second-order Taylor expansion gives

    V(0,aδ,N)V(0,a)=12(aδ,Na)Ta2V(0,a~)(aδ,Na), 3.6
    for some a~ (coordinate-wise) between a and aδ,N. Now we write
    aδ,Na=aδ,Na,N+a,Na,
    where we define a,N as the optimizer of the non-robust problem (1.1) with μ replaced by μN. In particular, the δ-method for M-estimators implies that
    N(a,Na)(a2V(0,a))1H, 3.7
    where HN(0,(af(x,a))Taf(x,a)μ(dx)) and denotes the convergence in distribution. On the other hand, for a fixed NN, theorem 2.4 applied to μN yields
    aδ,Na,N=(|xf(x,a,N)|sqμN(dx))(1/q)1(a2f(x,a,N)μN(dx))1xaf(x,a,N)h(xf(x,a,N))|xf(x,a,N)|s1qμN(dx)δ+o(δ) 3.8
    =((a2V(0,a))1Θ+ΔN)δ+o(δ), 3.9
    where
    Θ:=(|xf(x,a)|sqμ(dx))(1/q)1xaf(x,a)h(xf(x,a))|xf(x,a)|s1qμ(dx),ΔN:=(|xf(x,a,N)|sqμN(dx))(1/q)1(a2f(x,a,N)μN(dx))1xaf(x,a,N)h(xf(x,a,N))|xf(x,a,N)|s1qμN(dx)(a2V(0,a))1Θ.
    Almost surely (w.r.t. sampling of μN), we know that μNμ in Wp as N, and under the regularity and growth assumptions on f in [9, equation (8.2)] we check that ΔN0 a.s., see [9, example 8.4] for details. In particular, taking δ=1/N and combining the above with (3.7) we obtain
    N(a1/N,Na)(a2V(0,a))1(HΘ). 3.10
    This recovers the central limit theorem of [47], as discussed in more detail in §4b below. Together, (3.6) and (3.9) give us the a.s. asymptotic behaviour of the out-of-sample error
    V(0,aδ,N)V(0,a)=12N(HΘ)T(a2V(0,a))1(HΘ)+o(1N). 3.11
    These results also extend and complement [48, Prop. 17]. [48] investigate when the distributionally robust optimizers aδ,N yield, on average, better performance than the simple in-sample optimizer a,N. To this end, they consider the expectation, over the realizations of the empirical measure μN of
    V(0,aδ,N)V(0,a,N)=f(x,aδ,N)μ(dx)f(x,a,N)μ(dx).
    This is closely related to the out-of-sample error and our derivations above can be easily modified. The first-order term in the Taylor expansion no longer vanishes and, instead of (3.6), we now have
    V(0,aδ,N)V(0,a,N)=aV(0,a,N)(aδ,Na,N)+o(|aδ,Na,N|),
    which holds, e.g. if for any r>0, there exists c>0 such that i=1k|aaif(x,a)|c(1+|x|p) for all xS, |a|r. Combined with (3.8), this gives asymptotics in small δ for a fixed N. For quadratic f and taking q, we recover the result in [48, Prop. 17], see [9, example 8.4] for details.

    4. Further discussion and literature review

    We start with an overview of related literature and then focus specifically on a comparison of our results with the CLT of [47] mentioned above.

    (a) Discussion of related literature

    Let us first remark, that while theorem 2.2 offers some superficial similarities to a classical maximum theorem, which is usually concerned with continuity properties of δV(δ), in this work, we are instead interested in the exact first derivative of the function δV(δ). Indeed, the convergence limδ0supνBδ(μ)f(x)ν(dx)=f(x)μ(dx) follows for all f satisfying f(x)c(1+|x|p) directly from the definition of convergence in Wasserstein metric (e.g. [49, Def. 6.8]). In conclusion, the main issue is to quantify the rate of this convergence by calculating the first derivative V(δ).

    Our work investigates model uncertainty broadly conceived: it includes errors related to the choice of models from a particular (parametric or not) class of models as well as the mis-specification of such a class altogether (or indeed, its absence). In the decision theoretic literature, these aspects are sometimes referred to as model ambiguity and model mis-specification, respectively, see [50]. However, seeing our main problem (1.2) in decision theoretic terms is not necessarily helpful as we think of f as given and not coming from some latent expected utility type of problem. In particular, our actions aA are just constants.

    In our work, we decided to capture the uncertainty in the specification of μ using neighbourhoods in the Wasserstein distance. As already mentioned, other choices are possible and have been used in past. Possibly, the most often used alternative is the relative entropy, or the Kullblack–Leibler divergence. In particular, it has been used in this context in economics, see [51]. To the best of our knowledge, the only comparable study of sensitivities with respect to relative entropy balls is [22], see also [45] allowing for additional marginal constraints. However, this only considered the specific case f(x,a)=f(x) where the reward function is independent of the action. Its main result is

    supνBδKL(μ)f(x)ν(dx)=f(x)μ(dx)+2Varμ(f(X))δ+13κ3(f(X))Varμ(f(X))δ2+O(δ3),
    where BδKL(μ) is a ball of radius δ2 centred around μ in KL-divergence, Varμ(f(X)) and κ3(f(X)) denote the variance and kurtosis of f under the measure μ respectively. In particular, the first-order sensitivity involves the function f itself. By contrast, our theorem 2.2 states V(δ)=((f(x))2μ(dx))1/2 and involves the first derivative f. In the trivial case of a point mass μ=δx, we recover the intuitive sensitivity V(δ)=|f(x)|, while the results of [22] do not apply for this case. We also note that [22] requires exponential moments of the function f under the baseline measure μ, while we only require polynomial moments. In particular, in applications in econometrics (or any field in which μ typically has fat tails), the scope of application of the corresponding results might then be decisively different. We remark however, that this requirement can be substantially weakened (to the existence of polynomial moments) when replacing KL-divergences by α-divergences, e.g. [52,53]. We expect a sensitivity analysis similar to [22] to hold in this setting. However, to the best of our knowledge no explicit results seem to be available in the literature.

    To understand the relative technical difficulties and merits, it is insightful to go into the details of the statements. In fact, in the case of relative entropy and the one-period set-up we are considering, the exact form of the optimizing density can be determined exactly (see [22, Proposition 3.1]) up to a one-dimensional Langrange parameter. This is well known and is the reason behind the usual elegant formulae obtained in this context. But this then reduces the problem in [22] to a one-dimensional problem, which can be well-approximated via a Taylor approximation. By contrast, when we consider balls in the Wasserstein distance, the form of the optimizing measure is not known (apart from some degenerate cases). In fact, a key insight of our results is that the optimizing measure can be approximated by a deterministic shift in the direction (x+f(x)δ)μ (this is, in general, not exact but only true as a first-order approximation). The reason for these contrastive starting points of the analyses is the fact that Wasserstein balls contain a more heterogeneous set of measures, while in the case of relative entropy, exponentiating f will always do the trick. We remark however that this is not true for the finite-horizon problems considered in [22, Section 3.2] any more, where the worst-case measure is found using an elaborate fixed-point equation.

    A point which further emphasizes the fact that the topology introduced by the Wasserstein metric is less tractable is the fact that

    Wpp(μ,ν)=limε0infπΠ(μ,ν)|xy|pπ(dx,dy)+εH(πμν)=limε0εinfπΠ(μ,ν)H(πRε),
    where H(πRε)=log(dπdRε)dπ is the relative entropy and
    dRε=c0exp(|xy|pε)d(μν),
    for some normalizing constant c0>0 (e.g. [54]). This is known as the entropic optimal transport formulation and has received considerable interest in the ML community in the past years (e.g. [55]). In particular, the Wasserstein distance can be approximated by relative entropy, but only with respect to reference measures on the product space. As we consider optimization over ν above it amounts to changing the reference measure. In consequence, the topological structure imposed by Wasserstein distances is more intricate compared to relative entropy, but also more flexible.

    The other well-studied distance is the Hellinger distance. [24] calculates influence curves for the minimum Hellinger distance estimator aHell, on a countable sample space. Their main result is that for the choice f(x,a)=log((x,a)) (where ((x,a))aA is a collection of parametric densities)

    IC(x)=(a2V(0,aHell,))1alog((x,aHell,)),
    the product of the inverse Fisher information matrix and the score function, which is the same as for the classical maximum-likelihood estimator. Denote by μN the empirical measure of N data samples and by aHell,(N) the corresponding minimum Hellinger distance estimator for μN. In particular, the above result then implies the same CLT as for M-estimators given by
    N1/2(aHell,(N)aHell,)(a2V(0,aHell,))1H,
    where HN(0,af(x,aHell,)Taf(x,aHell,)μ(dx)). As we discuss in the next section, our theorem 2.4 yields a similar CLT, namely
    N1/2(a1/N,Na)(a2V(0,a))1(Ha|xf(x,a)|s2μ(dx)).
    Thus the Wasserstein worst-case approach leads to a shift of the mean of the normal distribution in the direction
    a|xf(x,a)|s2μ(dx),
    compared to the non-robust case. In the simple case μ=N(0,σ2) with standard deviation σ>0, we obtain the MLE σ,N=1Nk=1NXi2. We can directly compute (for a=σ) that
    σ|x(const.+log(exp(x22(σ)2)))|s2μ(dx)=σx2(σ)4μ(dx)=σσ(σ)2=σ1σ=1(σ)2.
    Thus the robust approach accounts for a shift of 1/(σ)2 (of order 1 if mulitplied with inverse Fisher information) to account for a possibly higher variance in the underlying data. In particular, in our approach, the so-called neutral spaces considered (e.g. [56], eqn (21)]) as
    {a:(aa)Ta2V(0,a)(aa)δ}
    should also take this shift into account, i.e. their definition should be adjusted to
    {a:(aa+a|xf(x,a)|s2μ(dx))Ta2V(0,a)(aa+a|xf(x,a)|s2μ(dx))δ}.
    Lastly, let us mention another situation when our approach provides directly interpretable insights in the context of a parametric family of models. Namely, if one considers a family of models P such that the worst-case model in the Wasserstein ball remains in P, i.e. (x+f(x)δ)μP, then considering (the first-order approximation to) model uncertainty over Wasserstein balls actually reduces to considerations within the parametric family. While uncommon, such a situation would arise, for example, for a scale-location family P, with μP and a linear/quadratic f.

    (b) Link to the central limit theorem of [47]

    As observed in §3e above, theorem 2.4 allows to recover the main results in [47]. We explain this now in detail. Set ||||=||s, p=q=2, S=Rd. Let μN denote the empirical measure of N i.i.d. samples from μ. We impose the assumptions on μ and f from [47], including Lipschitz continuity of gradients of f and strict convexity. These, in particular, imply that the optimizers aδ,N,a,N and a, as defined in §3e are well defined and unique, and further a1/N,Na as N. [47, Thm. 1] implies that, as N,

    N(a1/N,Na)(a2V(0,a))1(Ha|xf(x,a)|s2μ(dx)), 4.1
    where HN(0,af(x,a)Taf(x,a)μ(dx)). We note that for ||||=||s we have
    h(x)=(sign(x1)|x1|s1,,sign(xk)|xk|s1)|x|s1s=x|x|s.
    Thus
    a|xf(x,a)|s2μ(dx)=|xf(x,a)|sh(xf(x,a))xaf(x,a)μ(dx)|xf(x,a)|s2μ(dx),
    and (4.1) agrees with (3.10) which is justified by the Lipschitz growth assumptions on f, xf(x,a) and axf(x,a) from [47], see [9, equation (8.2)]. In particular, theorem 2.4 implies (4.1) as a special case. While this connection is insightful to establish2 it is also worth stressing that the proofs in [47] pass through the dual formulation and are thus substantially different from ours. Furthermore, while theorem 2.4 holds under milder assumptions on f than those in [47], the last argument in our reasoning above requires the stronger assumptions on f. It is thus not clear if our results could help to significantly weaken the assumptions in the central limit theorems of [47].

    5. Proofs

    We consider the case S=Rd and ||||=|| here. For the general case and additional details, we refer to [9]. When clear from the context, we do not indicate the space over which we integrate.

    Proof of theorem 2.2.

    For every δ0, let Cδ(μ) denote those πP(Rd×Rd) which satisfy

    π1=μand(|xy|pπ(dx,dy))1/pδ.
    As the infimum in the definition of Wp(μ,ν) is attained (see [49, Theorem 4.1, p. 43]) one has Bδ(μ)={π2:πCδ(μ)}.

    We start by showing the ‘’ inequality in the statement. For any aA0, one has V(δ)supνBδ(μ)f(y,a)ν(dy) with equality for δ=0. Therefore, differentiating f(,a) and using both Fubini’s theorem and Hölder’s inequality, we obtain that

    V(δ)V(0)supπCδ(μ)f(y,a)f(x,a)π(dx,dy)=supπCδ(μ)01xf(x+t(yx),a),(yx)π(dx,dy)dtδsupπCδ(μ)01(|xf(x+t(yx),a)|qπ(dx,dy))1/qdt.
    Any choice πδCδ(μ) converges in p-Wasserstein distance on P(Rd×Rd) to the pushforward measure of μ under the mapping x(x,x), which we denote [x(x,x)]μ. This can be seen by, for example, considering the coupling [(x,y)(x,y,x,x)]πδ between πδ and [x(x,x)]μ. Now note that q=p/(p1) and the growth assumption on xf(,a) implies
    |xf(x+t(yx),a)|qc(1+|x|p+|y|p), 5.1
    for some c>0 and all x,yRd, t[0,1]. In particular, |xf(x+t(yx),a)|qπδ(dx,dy)C for all t[0,1] and small δ>0, for another constant C>0. As further (x,y)|xf(x+t(yx),a)|q is continuous for every t, the p-Wasserstein convergence of πδ to [x(x,x)]μ implies that
    |xf(x+t(yx),a)|qπδ(dx,dy)|xf(x,a)|qμ(dx),
    for every t[0,1] for δ0, see [9, lemma 7.13]. Dominated convergence (in t) then yields ‘’ in the statement of the theorem.

    We turn now to the opposite ‘’ inequality. As V(δ)V(0) for every δ>0, there is no loss of generality in assuming that the right-hand side is not equal to zero. Now take any, for notational simplicity not relabelled, subsequence of (δ)δ>0 which attains the liminf in (V(δ)V(0))/δ and pick aδAδ. By assumption, for a (again not relabelled) subsequence, one has aδaA0. Further note that V(0)f(x,aδ)μ(dx) which implies

    V(δ)V(0)supπCδ(μ)f(y,aδ)f(x,aδ)π(dx,dy).
    Now define πδ:=[x(x,x+δT(x))]μ, where
    T(x):=xf(x,a)|xf(x,a)|2q(|xf(z,a)|qμ(dz))1/q1
    for xRd with the convention 0/0=0. Note that the integral is well defined since, as before in (5.1), one has |xf(x,a)|qC(1+|x|p) for some C>0 and the latter is integrable under μ. Using that pqp=q it further follows that
    |xy|pπδ(dx,dy)=δp|T(x)|pμ(dx)=δp|xf(x,a)|p+pq2pμ(dx)(|xf(z,a)|qμ(dz))p(11/q)=δp.
    In particular, πδCδ(μ) and we can use it to estimate from below the supremum over Cδ(μ) giving
    V(δ)V(0)δ1δf(x+δT(x),aδ)f(x,aδ)μ(dx)=01xf(x+tδT(x),aδ),T(x)μ(dx)dt.
    For any t[0,1], with δ0, the inner integral converges to
    xf(x,a),T(x)μ(dx)=(|xf(x,a)|qμ(dx))1/q.
    The last equality follows from the definition of T and a simple calculation. To justify the convergence, first note that xf(x+tδT(x),aδ),T(x)xf(x,a),T(x) for all xRd by continuity of xf and since aδa. Moreover, as before in (5.1), one has |T(x)|c(1+|x|) for some c>0, hence |xf(x+tδT(x),a),T(x)|C(1+|x|p) for some C>0 and all t[0,1]. The latter is integrable under μ; hence convergence of the integrals follows from the dominated convergence theorem. This concludes the proof.

    Proof of theorem 2.4.

    We first show that

    limδ0aiV(0,aδ)δ=xaif(x,a)xf(x,a)|xf(x,a)|2qμ(dx)(|xf(x,a)|qμ(dx))1/q1 5.2
    for all i{1,,k}. We start with the ‘’ inequality. For any aAo, we have
    af(y,a)af(x,a)=01xaf(x+t(yx),a)(yx)dt.
    Let δ>0 and recall that aδAδ converge to aA0. Let Bδ(μ,aδ) denote the set of νBδ(μ) which attain the value: f(x,aδ)ν(dx)=V(δ). This is non-empty by assumption 2.3 and [9, lemma 7.16]. By [9, lemma 8.5] the function aV(δ,a) is (one-sided) directionally differentiable at aδ for all δ>0 small and thus for all i{1,,k}
    supνBδ(μ,aδ)aif(x,aδ)ν(dx)0.
    Then, using Lagrange multipliers to encode the optimality of Bδ(μ,aδ) in Bδ(μ), we obtain
    aiV(0,aδ)supνBδ(μ,aδ)aif(y,aδ)ν(dy)aiV(0,aδ)=supνBδ(μ)infλR([aif(y,aδ)+λ(f(y,aδ)V(δ))]ν(dy)[aif(x,aδ)+λ(f(x,aδ)V(0,aδ))]μ(dx))=infλR(supπCδ(μ)01xaif(x+t(yx),aδ)+λxf(x+t(yx),aδ),yxπ(dx,dy)dtλsupπCδ(μ)01xf(x+t(yx),aδ,yxπ(dx,dy)dt),
    where we used a minimax argument as well as Fubini’s theorem. We note that the functions above satisfy the assumptions of theorem 2.2 for a fixed λ. In particular, using exactly the same arguments as in the proof of theorem 2.2 (i.e. Hölder’s inequality and a specific transport attaining the supremum) we obtain by exchanging the order of lim sup and inf that
    lim supδ0aiV(0,aδ)δinfλR((|xaif(x,a)+λxf(x,a)|qμ(dx))1/qλ(|xf(x,a)|qμ(dx))1/q). 5.3
    For q=2, the infimum can be computed explicitly and equals
    xaif(x,a),xf(x,a)μ(dx)|xf(x,a)|2μ(dx).
    For the general case, we refer to [9, lemma 8.6], noting that by assumption xf(x,a)0, we see that the r.h.s. above is equal to the r.h.s. in (5.2).

    The proof of the ‘’ inequality in (5.2) follows by the very same arguments. Indeed, [9, lemma 8.5] implies that

    infνBδ(μ,aδ)aif(x,aδ)ν(dx)0,
    for all i{1,,k} and we can write
    aiV(0,aδ)infνBδ(μ,aδ)aif(y,aδ)ν(dy)aiV(0,aδ)=infνBδ(μ)supλR([aif(y,aδ)+λ(f(y,aδ)V(δ))]ν(dy)[aif(x,aδ)+λ(f(x,aδ)V(0,aδ))]μ(dx)).
    From here on, we argue as in the ‘’ inequality and conclude that indeed (5.2) holds.

    By assumption, the matrix a2V(0,a) is invertible. Therefore, in a small neighbourhood of a, the mapping aV(0,) is invertible. In particular, aδ=(aV(0,))1(aV(0,aδ)) and by the first-order condition a=(aV(0,))1(0). Applying the chain rule and using (5.2) gives

    limδ0aδaδ=(a2V(0,a))1limδ0aV(0,aδ)δ=(a2V(0,a))1(|xf(z,a)|qμ(dz))1/q1xaf(x,a)xf(x,a)|xf(x,a)|2qμ(dx).
    This completes the proof.

    Footnotes

    1 In the case s=1, inspecting the proof, we see that theorem 2.4 still holds since a does not have zero components μ-a.s., which are the only points of discontinuity of h.

    2 We thank Jose Blanchet for pointing out the possible link and encouraging us to explore it.

    Data accessibility

    The codes used to generate figures in the paper are available on GitHub: http://github.com/JanObloj/Robust-uncertainty-sensitivity-analysis.

    Authors' contributions

    D.B., S.D., J.O. and J.W. formulated the mathematical problem, carried out the analysis, established the main results and drew conclusions. J.O. and J.W. wrote the first draft of the paper. D.B. and J.W. wrote the first draft of the appendix. S.D. and J.W. performed the numerical analysis. All the authors proof read and corrected the manuscript, gave final approval for publication and agree to be held accountable for the work performed therein.

    Competing interests

    The authors declare no competing interests.

    Funding

    This work was supported by the European Research Council [7th FP/ERC grant agreement no. 335421], the Vienna Science and Technology Fund (WWTF) [project MA16-021], the Austrian Science Fund (FWF) [project P28661] and the National Science Foundation of China (grant nos 11971310 and 11671257).

    Acknowledgements

    We thank Jose Blanchet, Mike Giles, Daniel Kuhn and Peyman Mohajerin Esfahani for their helpful comments on an earlier draft of this paper.

    Electronic supplementary material is available online at https://doi.org/10.6084/m9.figshare.c.5730987.

    Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.

    References