Risk-aware multi-armed bandit problem with application to portfolio selection

Sequential portfolio selection has attracted increasing interest in the machine learning and quantitative finance communities in recent years. As a mathematical framework for reinforcement learning policies, the stochastic multi-armed bandit problem addresses the primary difficulty in sequential decision-making under uncertainty, namely the exploration versus exploitation dilemma, and therefore provides a natural connection to portfolio selection. In this paper, we incorporate risk awareness into the classic multi-armed bandit setting and introduce an algorithm to construct portfolio. Through filtering assets based on the topological structure of the financial market and combining the optimal multi-armed bandit policy with the minimization of a coherent risk measure, we achieve a balance between risk and return.


Introduction
Portfolio selection is a popular area of study in the financial industry ranging from academic researchers to fund managers. The problem involves determining the best combination of assets to be held in the portfolio in order to achieve the investor's objectives, such as maximizing the cumulative return relative to some risk measure. In the finance community, the traditional approach to this problem can be traced back to 1952 with Markowitz's seminal paper [1], which introduces mean-variance analysis, also known as the modern portfolio theory (MPT), and suggests choosing the allocation that maximizes the expected return for a certain risk level quantified by variance. On the other hand, sequential portfolio selection models have been developed in the mathematics and computer science communities; for example, Cover's universal portfolio strategy [2], Helmbold's multiplicative update portfolio strategy [3] and also see Li & Hoi [4] for a comprehensive survey. In recent years, with the unprecedented success of AI and machine learning methods evidenced by AlphaGo defeating the world champion and OpenAI's bot beating professional Dota players, more creative machine learning-based portfolio selection strategies also emerged [5,6].
Including portfolio selection, many practical problems such as clinical trials, online advertising and robotics can be modelled as sequential decision-making under uncertainty [7]. In such a process, at each trial the learner faces the trade-off between acting ambitiously to acquire new knowledge and acting conservatively to take advantage of current knowledge, which is commonly known as the exploration versus exploitation dilemma. Often understood as a single-state Markov decision process (MDP), the stochastic multi-armed bandit problem provides an extremely intuitive mathematical framework to study sequential decision-making.
An abstraction of this setting involves a set of K slot machines and a sequence of N trials. At each trial t = 1, . . . , N, the learner chooses to play one of the machines I t ∈ {1, . . . , K} and receives a reward R I t ,t drawn randomly from the corresponding fixed but unknown probability distribution ν I t , whose mean is μ I t . In the classic setting, the random rewards of the same machine across time are assumed to be independent and identically distributed, and the rewards of different machines are also independent. The objective of the learner is to develop a policy, an algorithm that specifies which machine to play at each trial, to maximize cumulative rewards. A popular measure for the performance of a policy is the regret after some n trials, which is defined to be However, in a stochastic model it is more intuitive to compare rewards in expectation and use pseudoregret [8]. Let T i (n) be the number of times machine i is played during the first n trials and let μ * = max{μ 1 , . . . , μ K }. Then,ξ Thus, the learner's objective to maximize cumulative rewards is then equivalent to minimizing regret. The asymptotic lower bound on the best possible growth rate of total regret is proved by Lai & Robbins [9], which is O(log n) with a coefficient determined by the suboptimality of each machine and the Kullback-Leibler divergence. Since then, various online learning policies have been proposed [10], among which the UCB1 policy developed in Auer et al. [11] is considered the optimal and will be introduced in detail in Methods and model section.
Although the classic multi-armed bandit has been well studied in academia, a number of variants of this problem are proposed to model different real-world scenarios. For example, Agrawal & Goyal [12] considers a contextual bandit with a linear reward function and analyses the performance of the Thompson sampling algorithm. Koulouriotis & Xanthopoulos [13] studies the non-stationary setting where the reward distributions of machines change at a fixed time. A more important variant is the risk-aware setting, where the learner considers risk in the objective instead of simply maximizing the cumulative reward. This variant is closely related to the portfolio selection problem, where risk management is an indispensable concern, and has been discussed in several papers. For example, Sani et al. [14] studies the problem where the learner's objective is to minimize the mean variance defined as σ 2 − ρμ and proposes two algorithms, MV-LCB and ExpExp. In a similar setting, Vakili & Zhao [15] provides a finer analysis of the performance of algorithms proposed in Sani et al. [14]. In addition, Vakili & Zhao [16] extends this setting by considering the mean variance and value-at-risk of total rewards at the end of the time horizon. In a more generalized case, Zimin et al. [17] sets the objective to be a function of the mean and the variance f (μ, σ 2 ) and defines the ϕ-LCB algorithm that achieves desirable performance under certain conditions. Moreover, Galichet et al. [18] chooses the conditional value-at-risk to be the objective and proposes the MARAB algorithm.
These works serve as the inspiration for us to consider risk in the model, but they are not directly applicable to the portfolio selection problem, owing to the primary obstacle that these methods only choose the best single machine to play at each trial. To address this issue, a basket of candidate portfolios need to be first selected in the preliminary stage in a strategic and logical way. For example, Shen et al. [19] uses principal component analysis (PCA) to select candidate portfolios, namely the normalized eigenvectors of the covariance matrix of asset returns.
In our model, we first take a graph theory approach to filter and select a basket of assets, which we use to construct the portfolio. Then, at each trial we combine the single-asset portfolio determined by the optimal multi-armed bandit algorithm with the portfolio that globally minimizes a coherent risk measure, the conditional value-at-risk. The rest of this paper is organized as follows. In Methods and model section, we formulate the portfolio selection problem in the multi-armed bandit setting, and describe our methodology in detail. In Results section, we present our simulation results using the proposed method. In Discussion and conclusion section, we discuss results and also provide directions for future research.

Problem formulation
In this section, we modify the classic multi-armed bandit setting to model portfolio selection. Consider a financial market with a large set of assets, from which the learner selects a basket of K assets to invest in a sequence of N trials. At each trial t = 1, . . . , N, the learner chooses a portfolio ω t = (ω 1,t , . . . , ω K,t ) where ω i,t is the weight of asset i. As we only consider long-only and self-financed trading, we must have ω t ∈ W where W = {u ∈ R K + : u 1 = 1} and 1 is a column vector of ones. The returns of assets are then revealed at trial t + 1 and denoted by R t = (R 1,t , . . . , R K,t ) . In particular, the return for each asset R i,t is viewed as a random draw from the corresponding probability distribution ν i with mean μ i and can be simply defined as the log price ratio R i,t = log(P i,t+1 /P i,t ), where we use the natural log, and P i,t , P i,t+1 denote the prices at trial t and t + 1, respectively. For the trading period from t to t + 1, the learner receives ω t R t as the reward for his portfolio. The investment strategy of the learner is thus a sequence of N mappings from the accumulated knowledge to W.
We make the following assumptions. First, we assume we always have access to historical returns H i,t of every asset i in the market for t = 1, . . . , δ. The historical return is defined similarly to R i,t as the log price ratio but corresponds to the time horizon immediately before our investment period. They are only used to estimate the correlation structure and risk level. Second, we make no assumption on the dependency of returns either across time or across assets. We only assume that, for each trial t and for all i ∈ {1, . . . , K}, R i,t ∼ ν i and H i,t ∼ ν i with a relatively small δ. Note that the UCB1 algorithm we use later is proved to be optimal under a weaker assumption, E[R i,t | R i,1 , . . . , R i,t−1 ] = μ i , allowing us to waive the assumptions in the classic setting [11]. Third, transaction costs and market liquidity will not be considered. See Model 1 for a summary of the problem.

Model 1: Sequential portfolio selection problem
Parameters: δ, N Receive historical returns H i,t of each asset i for t = 1, . . . , δ; Filter to select a basket of K assets;

Portfolio construction by filtering assets
Graph theory has been popularly applied in various disciplines to model networks, where the vertices represent individuals of interest and the edges represent their interactions. For example, in evolutionary game theory, graphs are used to analyse the dynamics of cooperation within different population structures [20][21][22][23][24][25]. In financial markets, the minimum spanning tree (MST) is accepted as a robust method to visualize the structure of assets [26], allowing one to capture different market sectors from empirical data [27][28][29].
For our purpose, as we have a large pool of assets, we first want to select a basket of K to invest in. Recall that the return of each asset is R i,t = log(P i,t+1 /P i,t ), where P i,t and P i,t+1 are the prices at trial t and t + 1. Following Mantegna [27] and Mantegna & Stanley [30], we use δ trials of historical returns to find the correlation matrix, whose entries are where · is the historical mean, namely H i = δ t=1 H i,t for each asset i in the market. For δ small, we can improve our estimation by taking advantage of the shrinkage method in Ledoit & Wolf [31]. We then define the metric distance between two vertices as d i,j def = 2(1 − ρ i,j ). The Euclidean distance matrix D whose entries are d i,j is then used to compute the undirected graph G = {V, E}, where V is the set of vertices representing assets and E is the set of weighted edges representing distance. To extract the most important edges from G, we construct the MST T. In particular, T is the subgraph of G that connects all vertices without cycle and minimizes total edge weights.
One way to classify vertices is based on their relative positions in the graph, central versus peripheral. In financial markets, this classification method turns out to have significant implications in systemic risk, which is the risk that an economic shock causes the collapse of a chain of institutions [32]. Several empirical studies suggest that such risk can be associated with certain characteristics of the correlation structure of the market. For example, Kritzman et al. [33] defines the absorption ratio as the fraction of total variances explained by a fixed number of principal components, namely the eigenvectors of the covariance matrix, and shows this ratio increased dramatically during both domestic and global financial crises including the housing bubble, dot-com bubble, the 1997 Asian financial crisis and so on. Drozdz et al. [34] finds a similar result and suggests that the maximum eigenvalue of the correlation matrix rises during crisis and exhausts the total variances. Hence, graph theory can be naturally applied to this setting and provides significant insights into managing systemic risk. In particular, Huang et al. [35] gives an intuitive simulation of the contagion process of systemic risk on a bipartite graph. Onnela et al. [36] shows that the MST of assets shrinks during a crisis, which supports the above arguments on the compactness of the eigenvalues of correlation matrix. More importantly, Onnela et al. [36], Pozzi et al. [37] and Ren et al. [38] suggest that investing in the assets located on the peripheral parts of the MST can facilitate diversification and reduce the exposure to systemic risk during a crisis.
For our study, we select 30 S&P 500 stocks, which consist of 15 financial institutions (JPM, WFC, BAC, C, GS, USB, MS, KEY, PNC, COF, AXP, PRU, SCHW, BBT, STI) and 15 randomly selected companies from other sectors (KR, PFE, XOM, WMT, DAL, CSCO, HCP, EQIX, DUK, NFLX, GE, APA, F, REGN, CMS). We use the daily close price of 44 trading days during the subprime mortgage crisis to construct the MST and investigate the advantage of investing in peripheral vertices using the equally weighted portfolio strategy. Although the number of stocks is small, our results similarly show that investing in peripheral vertices can reduce loss during financial crisis (figure 1). Figure 1a shows the complete graph of 30 stocks. Figure 1b is the MST we obtain following the above method. Observe that this tree has a total of 14 leaves (WFC, C, GS, KEY, PNC, SCHW, KR, DAL, HCP, EQIX, DUK, NFLX, GE, F), and selecting from these leaves to construct a portfolio almost always reduces the median daily loss compared with the portfolio with all vertices. For example, figure 1c provides the performance of the portfolio with 10 randomly selected vertices from the 14 leaves, which increases the median daily log price ratio from −0.0101 to −0.0079 and the median daily percentage return from −0.0095 to −0.0070. Furthermore, figure 1d shows that the eigenvalue spectrum of the covariance matrix becomes less compact. Finally, we acknowledge the dynamic nature of the market structure, but for simplicity this aspect will not be considered in our study.
Therefore, we select the K most peripheral vertices from the MST T as our basket of assets to invest in. We note that for any graph G with distinct edge weight, which is often the case for financial data with high precision, the MST T is proved to be unique. Our selection of vertices tends to lie on the leaves for a star-like graph, on the two ends of the longest edge for a cycle, and on the corners for a lattice. Among the numerous centrality measures discussed in graph theory [39], we use the most straightforward measure and select the K vertices with the least degree. The value of K is subjective and can be determined based on the learner's view of the economic state. Assuming K assets are selected, we proceed to portfolio construction as described in what follows.

Combined sequential portfolio selection algorithm
We design a sequential portfolio selection algorithm by combining the optimal multi-armed bandit policy, namely the UCB1 proposed in Auer et al. [11], with the minimization of a coherent risk measure, namely the conditional value-at-risk. Recall that the return R i,t of each asset i is defined as the log price ratio, namely R i,t = log (P i,t+1 /P i,t ). The UCB1 policy is defined as follows. First, select each asset once and observe return during the first K trials. Then, for each trial select the asset that maximizes an estimated upper confidence bound of return with a certain confidence level. Precisely, at each trial t we select whereR i (t) is the empirical mean of return for asset i and recall that T i (t − 1) is the number of times asset i has been selected during the past t − 1 trials. Theorem 2.1 below provided in Auer et al. [11] proves the optimality of UCB1.
Theorem 2.1 [11]. For all K > 1 assets whose mean returns are in the support [0, 1], the regret of UCB1 algorithm after any number n of trials satisfieŝ where μ i is the mean return of asset i and μ * = max {μ 1 , . . . , μ K }.
The proof makes no assumption on the dependency and distribution of asset returns besides E[R i,t | R i,1 , . . . , R i,t−1 ] = μ i . Therefore, by scaling the values we can achieve optimality. In addition, we can use historical returns and observed returns of unselected assets to further improve performance, but we do not discuss details here. Let e i ∈ R K be the vector of a single 1 on entry i and 0 on the others. Our single-asset multi-armed bandit portfolio at t chosen according to equation (2.1) is ( Now, let us incorporate risk awareness into our algorithm by finding the portfolio that achieves the global minimum of the conditional value-at-risk. We define risk measure and associated properties following Artzner et al. [40] and Bäuerle & Rieder [41]. Definition 2.2. Let (Ω, F , P) be a probability space and denote by L(Ω, F , P) the set of integrable random variables, where any instance of L(Ω, F , P) represents portfolio return. A function Ψ : L(Ω, F , P) → R is called a risk measure. Definition 2.3. Let Ψ be a risk measure; we say Ψ is a coherent risk measure if, for all X 1 , X 2 ∈ L(Ω, F , P), c ∈ R and d ∈ R + ∪ {0}, it satisfies -translation invariance: Definition 2.4. Let X ∈ L(Ω, F , P); the risk measure value-at-risk of X at confidence level β ∈ (0, 1) is defined as In addition, the risk measure conditional value-at-risk at confidence level γ ∈ (0, 1) is defined as In the literature, the above risk measures are sometimes expressed in terms of the portfolio loss variable, namely positive values represent loss and negative values represent gain. We note that these definitions are equivalent. Intuitively, the value-at-risk denotes the maximum threshold of loss under a certain confidence level, and conditional value-at-risk is the conditional expectation of loss given that it exceeds such a threshold. Although more popularly used in practice, value-at-risk fails certain mathematical properties such as subadditivity, which contradicts with Markowitz's MPT and implies that diversification may not reduce investment risk. As a result, it is not a coherent risk measure. On the other hand, Pflug [42] proves that conditional value-at-risk is coherent and satisfies some extra properties such as convexity, monotonicity with respect to first-order stochastic dominance (FSD) and second-order monotonic dominance.

Theorem 2.5 [42]. The conditional value-at-risk is a coherent risk measure.
Therefore, we would like to minimize risk using the conditional value-at-risk at confidence level γ as the risk measure. We recall that W = {u ∈ R K + : u 1 = 1} is the set of possible portfolios. At each trial t, the learner would like to solve the following optimization problem: Note that as γ → 0, the problem becomes minimizing expected loss, and as γ → 1, it becomes minimizing the worst outcome. In this study, we use γ = 0.95. Rockafellar & Uryasev [43] provides a convenient method to solve this problem. Recall that we assume that both historical returns and present returns follow the same distribution; let p(R t ) be the density. Define the performance function as where [m] + def = max{m, 0}. Then, we have the following theorem.
Theorem 2.6 [43]. The minimization of CVaR γ (u R t ) over u ∈ W is equivalent to the minimization of F γ (u, α) over all pairs of (u, α) ∈ W × R. Moreover, as F γ (u, α) is convex with respect to (u, α), the loss function −u R t is convex with respect to u and W is a convex set due to linearity, the minimization of F γ (u, α) is an instance of convex programming.
Moreover, as the density p(R t ) is unknown, we would like to approximate the performance function using not only historical returns but also knowledge gained as we proceed in this learning process. From the received H i,1 , . . . , H i,δ for all i, we extract historical returns of our K assets H 1 , . . . , H   1 , . . . , R t−1 be the t − 1 trials of returns observed so far. Then our approximation of F γ (u, α) at trial t is the following convex and piecewise linear functioñ Note that the approximation function is implicitly also a function of the current trial t, hence we have added an extra parameter and denote it asF γ (u, α, t). As the learner proceeds in time, she accumulates data information and obtains a more and more precise approximation. As a result, the minimization of conditional value-at-risk is solved by convex programming and generates the following optimal solution. At each trial t, the risk-aware portfolio constructed according to equation (2.3) is Now, we have found both the single-asset multi-armed bandit portfolio by (2.2) and the risk-aware portfolio by (2.4). Note that they are dynamic and update based on the learner's accumulated knowledge. For each trial t, the learner combines them with a factor λ ∈ [0, 1] to form the balanced portfolio In particular, λ is the proportion of wealth invested in the single-asset multi-armed bandit portfolio and 1 − λ is the proportion invested in the risk-aware portfolio. The value of λ denotes the risk preference of the learner. As λ → 1, our algorithm reverts to the UCB1 policy, whereas for λ → 0, it becomes the minimization of conditional value-at-risk. Therefore, the commonly discussed trade-off between reward and risk is illustrated here in the choice of λ. Finally, the following algorithm summarizes our sequential portfolio selection algorithm.

Algorithm 1: Our proposed sequential portfolio selection algorithm
Input: K, γ , λ Select K peripheral assets from the market according to §2.2; for t = 1, . . . , N do Compute the single-asset multi-armed bandit portfolio ω M t by (2.2); Compute the risk-aware portfolio ω C t at confidence level γ by (2.4); Select the combined portfolio ω * t with a factor λ by (2.5); Observe returns R t and update accumulated knowledge for (2.2) and (2.4); Receive portfolio reward ω * t R t ; end

Results
In this section, we design experiments and report the performance of the proposed algorithm (algorithm 1) in comparison with several benchmarks.

Monte Carlo simulation method
For simplicity, we consider stocks as our assets and adopt the Black-Scholes model [44] to simulate stock prices as geometric Brownian motion (GBM) paths. As a Nobel Prize-winning model, it provides a partial differential equation to price a European option by computing the initial wealth for perfectly hedging a short position in that option. The underlying asset, usually a stock, is modelled to follow a GBM. Although this assumption may not hold perfectly in reality, it provides an extremely convenient and popularly used method to simulate any number of stock paths. For our purpose, as we never make any assumption on the dependency of asset returns, we consider the general case where stock paths can be correlated as it is almost always the case in the financial market. We use definitions similar to ch. 4 of Shreve [45] and describe our method below.   Let (Ω, F , P) be a probability space. The stock price P i (t) is said to follow a GBM if it satisfies the following stochastic differential equation: where W i (t) is a Brownian motion, α i is the drift and σ i is the volatility. Definition 3.2. Two stock paths P i (t) and P j (t) modelled by GBMs are correlated if their associated Brownian motions satisfy

Proposition 3.3.
For two correlated stock prices P i (t) and P j (t) that satisfy dW i (t) dW j (t) = ρ i,j · dt, the following properties hold: where σ i and σ j are volatility parameters of P i (t) and P j (t), respectively.
Proof. We prove the first claim and the rest follow immediately after some computations. By the Itô-Doeblin formula, which can be found in Shreve [45], we have Integrating on both sides, we have By the Martingale property of Itô integrals, we simply take the expectation on both sides to obtain Recall that we have K stocks whose prices P 1 (t), . . . , P K (t) are modelled by correlated GBMs. By definition, they must satisfy the following two equations: and In particular, the solution to equation (3.1) can be expressed as follows [46]. For any time u < l, we have We first would like to express the scaled correlated Brownian motions σ i W i (t) using independent ones. By proposition 3.3, we have the following instantaneous covariance matrix: As Θ has to be symmetric and positive definite, it has a square root and we apply Cholesky decomposition to find the matrix A such that AA T = Θ. Brownian motions X 1 (t), . . . , X K (t) such that Then equation (3.1) becomes (3.4) and equation (3.3) becomes, for any time u < l, As each Brownian motion X m (t) for m ∈ [1, K] above is independent and the increment X m (l) − X m (u) is Gaussian with mean 0 and variance l − u, let Z(t) = (Z 1 (t), . . . , Z K (t)) be standard multivariate Gaussian, then equation (3.5) becomes Therefore, at each time we can conveniently generate a sample from Z(t) to compute the price increment. Specifically, equation (3.6) leads to the following recursive algorithm that can also be found in Glasserman [46]. For 0 = t 0 < t 1 < · · · < t ∞ , we have Also note that when the paths are independent, dW i (t) dW j (t) = δ i,j dt, where δ i,j is the Kronecker delta function, and the covariance matrix Θ is diagonal. In this special case, it is equivalent to compute K paths separately in the one-dimensional space. For our purpose, we first find some appropriate covariance matrix and generate K price paths following the above algorithm. We then uniformly divide the total time horizon into δ + N trials and use the prices at the beginning and end of each trial to calculate return, which was defined earlier as the log price ratio. We run our sequential portfolio selection algorithm on these data and compare the performance with four benchmark portfolios, namely UCB1 (2.2), risk-aware portfolio (2.4), -greedy and the equally weighted portfolio.

Simulation results
After we repeatedly generate price paths and compare the performance, we can see that the results agree well with our prediction (figure 2). The UCB1 portfolio almost always achieves the most cumulative wealth but has high variations in its path. On the other hand, the risk-aware portfolio achieves a relatively low cumulative wealth but also has low variations. As a result, our combined portfolio achieves a middle ground between the two extremes of maximizing reward and minimizing risk. For example, figure 2a-c illustrate a typical simulation, where figure 2a shows K = 5 GBM paths, figure 2b shows the optimality of UCB1 compared to -greedy and figure 2c shows the cumulative wealth at the end of N = 200 trials.
With an initial wealth of 1 and λ = 0.9, the cumulative wealth is 2.1615 for UCB1, 2.1024 for combined portfolio, 1.9168 for -greedy, 1.6355 for the risk-aware portfolio and 1.4640 for the equally weighted portfolio.
In addition, we observe that when the market is volatile and when different stock paths are similar in expectation, it takes more trials for the UCB1 policy to reach optimality (figure 2d-f ). In this case, the riskaware portfolio achieves the most cumulative wealth with a similarly low variation in its path. Different from the simulation presented in figure 2a-   From the above discussion, it is evident that the value of λ is vital to the performance of our sequential portfolio selection algorithm and should be determined based on the market condition. In particular, Way et al. [47] discusses the trade-off between specialization to achieve high rewards and diversification to hedge against risk, and similarly shows that such choice depends on the underlying parameters and initial conditions.

Discussion and conclusion
In this paper, we have studied the multi-armed bandit problem as a mathematical model for sequential decision-making under uncertainty. In particular, we focus on its application in financial markets and construct a sequential portfolio selection algorithm. We first apply graph theory and select the peripheral assets from the market to invest. Then at each trial, we combine the optimal multi-armed bandit policy with the minimization of a coherent risk measure. By adjusting the parameter, we are able to achieve the balance between maximizing reward and minimizing risk. We adopt the Black-Scholes model to repeatedly simulate stock paths and observe the performance of our algorithm. We conclude that the results agree well with our prediction when the market is stable. In addition, when the market is volatile, risk awareness becomes more crucial to achieving high performance. Therefore, parameter selection should be based on the market condition.
For future research, one may consider the optimal selection of the parameter λ for combining the two portfolios. One may also consider portfolio selection strategies based on the MDP, which is a generalization of the multi-armed bandit to multiple states. In addition, one may pay more attention to a chaotic market environment where stock paths can be affected by various factors instead of simply following a stochastic process. For example, Junior & Mart [48] uses random matrix theory and transfer entropy to show that news articles can possibly affect the market. Finally, one may consider transaction costs and market liquidity. For example, Reiter et al. [49] illustrates the trade-off between reward and cost in a biological auction setting and might provide some important insights for the researcher.