Multiscale computing for science and engineering in the era of exascale performance

In this position paper, we discuss two relevant topics: (i) generic multiscale computing on emerging exascale high-performing computing environments, and (ii) the scaling of such applications towards the exascale. We will introduce the different phases when developing a multiscale model and simulating it on available computing infrastructure, and argue that we could rely on it both on the conceptual modelling level and also when actually executing the multiscale simulation, and maybe should further develop generic frameworks and software tools to facilitate multiscale computing. Next, we focus on simulating multiscale models on high-end computing resources in the face of emerging exascale performance levels. We will argue that although applications could scale to exascale performance relying on weak scaling and maybe even on strong scaling, there are also clear arguments that such scaling may no longer apply for many applications on these emerging exascale machines and that we need to resort to what we would call multi-scaling. This article is part of the theme issue ‘Multiscale modelling, simulation and computing: from the desktop to the exascale’.

In this position paper, we discuss two relevant topics: (i) generic multiscale computing on emerging exascale high-performing computing environments, and (ii) the scaling of such applications towards the exascale. We will introduce the different phases when developing a multiscale model and simulating it on available computing infrastructure, and argue that we could rely on it both on the conceptual modelling level and also when actually executing the multiscale simulation, and maybe should further develop generic frameworks and software tools to facilitate multiscale computing. Next, we focus on simulating multiscale models on high-end computing resources in the face of emerging exascale performance levels. We will argue that although applications could scale to exascale performance relying on weak scaling and maybe even on strong scaling, there are also clear arguments

Introduction
In science, our goal is to convincingly explain the processes at work in phenomena that we observe, as well as to predict what will occur before it does so. Predictions of real-world events all need substantial quantities of data and validated computational models together with the execution of many high-fidelity simulations. In many cases, the models that describe the phenomena are multiscale, as their accuracy and reliability depend on the correct representation of processes taking place on several length and time scales. Multiscale phenomena are everywhere around us [1][2][3][4][5][6][7]. If we study the origin and evolution of the Universe [8,9] or properties of materials [10][11][12][13][14], if we try to understand health and disease [3,[15][16][17][18][19][20][21][22] or develop fusion as a potential energy source of the future [23], in all these cases and many more we find that processes on quite disparate length and time scales interact in strong and nonlinear ways. In short, multiscale modelling is ubiquitous and progress in most of these cases is determined by our ability to design and implement multiscale models of the particular systems under study [1,6,24,25].
The increasing importance of multiscale modelling in many domains of science and engineering is clearly demonstrated in numerous publications (e.g. [1,26]). Therefore, we must anticipate that multiscale simulations will become an ever-more important form of scientific application on high-end computing resources, necessitating the development of sustainable and reusable solutions for such applications. That is, we expect to need generic algorithms for multiscale computing.
We therefore require innovative new ways of computing to face the challenges posed both by multiscale modelling and simulation and by the emerging high-end computing ecosystem [27]. This will contribute to our ability to solve multiscale problems and, as we will argue, can also offer an avenue for new ways to efficiently exploit exascale resources. Multiscale computing could face these challenging by deploying its various single-scale components across heterogeneous architectures of exascale resources, mapped to produce optimal performance and designed to bridge both temporal and spatial scales [28][29][30][31]. Therefore, we should embark upon a programme to efficiently deploy multiscale codes on today's and future high-performance computers and, thereby, establish a new and more effective paradigm for exploiting current and emerging computing resources.
In this position paper, we explore and discuss generic multiscale computing on emerging exascale high-performing computing (HPC) environments. We will first discuss the different phases when developing a multiscale model and simulating it on available computing infrastructure, and analyse where in our view we could, and maybe should, continue to further develop generic frameworks and software tools to facilitate multiscale computing. Next, we will focus on simulating multiscale models on high-end computing resources, which we call High Performance Multiscale Computing (HPMC), in the face of emerging exascale performance levels. We will argue that strong and weak parallel scaling of monolithic applications often may reach its limits at the exascale and that we need to invoke what we would call multi-scaling. Note that although our analysis is driven by the needs of modelling multiscale systems, our arguments with respect to the scalability challenge for multiscale systems to the exascale also point to the necessity to consider new approaches to increase concurrency within complex (single-scale) models through new algorithms and corresponding implementations.

Generic multiscale modelling and simulation (a) Simulation-based science
Simulation-based science is all about formulating computational models of phenomena that we observe, and performing computer simulations in order to deepen our understanding of the systems that underpin these phenomena. The aim is to predict their future behaviour or to find adaptations that would change their behaviour in some desired way. Simulationbased science is sometimes called the third pillar of science and complements theory and experiments. Together they underpin the scientific method and strongly interact. Theory provides the necessary framework for computational models, experiments provide the data against which the computational models need to be validated, and numerical simulations may lead to new insights and theory or new hypotheses that are tested experimentally.
In performing simulation-based science, we usually go through a generic modelling and simulation cycle (figure 1). Based on available (observational) data and knowledge and theory of the phenomenon under study, a conceptual model is formulated, which is then turned into a computational model. This is then implemented on a computer after which we perform numerical experiments with the computational model. These simulations provide results that are used to validate our models and, once validated, to predict the behaviour of the system we study.
The power of this approach lies in the fact that the computational sciences have developed a large collection of well-established generic modelling paradigms (such as ODE or PDE-based methods, particle-based methods, agent-based methods, fully discrete methods, etc.) and a large collection of well-established (numerical) methods to discretize these computational models and implement them very efficiently on the full range of available computers (from the laptop via cloud to petaflop/s supercomputers). We argue that also in multiscale modelling and simulation such generic approaches are possible and have been demonstrated in a number of successful projects. Having been exposed to such solutions over the last decade, we will discuss the potential of generic multiscale modelling and simulation, with emphasis on high-end performance levels, projecting forward to the era of exascale computing.   Many applications in computational science involve large ranges of spatial and temporal scales. Multiscale modelling amounts to splitting the spatio-temporal scales into what are often called single-scale submodels. These submodels are then coupled through various scale-bridging techniques. Such a multiscale model, that is, a collection of single-scale submodels coupled via scale bridging algorithms, should then be a sufficiently accurate representation of the behaviour of the problem where all the scales are present, yet with a substantial reduction of computing needs. Figure 2 illustrates the multiscale approach as defined in the Multiscale Modelling and Simulation Framework (MMSF) [28,32].
The MMSF is a theoretical and practical method for modelling, characterizing and simulating multiscale phenomena. The MMSF has been developed and applied over the past 10 years [1,28,29,31,[33][34][35], and comprises a four-stage pipeline from developing a multiscale model to executing the multiscale simulation. This is shown in figure 3. First, we model a phenomenon by identifying relevant processes that are well described by single-scale submodels, and their relevant scales, using the Scale Separation Map (SSM, the right panel in figure 2 is an example of an SSM). The architecture of the multiscale model, that is the communication between the single-scale models and details of the scale-bridging methods, are then specified in the Multiscale Modelling Language (MML) [36]. This specification is then used to (semi) automatically glue the single scale components (software implementations of the single-scale models) and the scale bridging components together using some dedicated coupling toolkit (such as, for instance, Muscle [33,37], MaMiCo [38] or AMUSE [9,39]). Finally, the multiscale simulation can be executed, by using dedicated environments such as, e.g. QCG [40]. This can in principle be done in a highly automated, optimized way, certainly when targeting HPC infrastructure [31,41].

(c) Generic frameworks
From the start of the development of the MMSF and comparable frameworks for multiscale modelling and simulation (for an overview, see the paper by Groen et al. [42] in this special issue), the ambition has always been to be as generic as possible, in the sense that such environments should be applicable for a large range of applications from different domains, that software components (implementing single-scale models or e.g. scale bridging) be reusable and interchangeable, and relieve the users as much as possible of the intricacies of launching and executing complex multiscale models on equally complex distributed HPC infrastructure.
We argue, based on a range of projects in multiscale modelling and computing, that the four phases of the MMSF ( This observation would warrant generic all-encompassing software environments to define a conceptual model, which is then translated by this environment into a computational model, and is finally executed on a range of computing infrastructures, again facilitated by such generic environments. Although this is certainly possible, as demonstrated, e.g. by our own developments [32], it is however not surprising that a range of different environments have been developed to, e.g. connect single codes into a full multiscale simulation. Usually, this is within specific communities, e.g. weather/climate [43], fusion [23] or astrophysics [44]. These environments run in production and are tailored to the needs of their communities. This then triggers the question as to what the benefits of generic frameworks could be, if any, and where they could be most useful. We believe that such frameworks should lead to a separation of concerns, where the user can focus on the modelling and simulation itself. A main ingredient in multiscale modelling is coupling single-scale models using scale bridging to convert information from one scale to another. Such coupling increases the complexity of multiscale modelling and simulation, and software should help mitigate that, both on the conceptual level and the simulation level. Moreover, generic frameworks should allow the reuse of validated and tested modules (submodels) and this could help in effectively standardizing data formats.
In our opinion, generic tools could be beneficial to the users both on the level of developing a conceptual multiscale model and on the level of executing the multiscale model. We call this the multiscale computing hourglass (see figure 4). It is important to continue to develop frameworks for multiscale computing on both levels, as we will argue below in more detail.
In MMSF, a conceptual multiscale model is defined as a collection of single-scale models, their connection including scale-bridging methods, and the details of the connection scheme [28]. The MML provides a formal way to specify the conceptual multiscale model [28,36], and using its xml version called xMML, it can be fully specified in a machine-readable text file. Such formal specification, in the form of an xMML file, could then be input to the neck of the hourglass in figure 4, where actual 'glue' code needs to be implemented. The benefits of agreeing on a formal specification of a multiscale model, relying, e.g. on MML, should not be underestimated and will bring many benefits. First, it allows clear and well-defined communication with colleagues, as well as the sharing of multiscale models. One could, for instance, imagine repositories of xMML files, very much like existing repositories of SBML [45] or CellML [46][47][48] models. It could also form the starting point for a mathematical framework to reason about properties of  a multiscale model, ranging from proving that it does not contain a deadlock [28], via using it as input for multiscale uncertainty quantification methods [49], to estimating and optimizing its computational footprint [31].
The neck of the hourglass contains the actual creation of the multiscale computational model, where all the application-specific details need to be established (figure 4). One important component is software to glue together the single-scale components using some coupling toolkit. Although some of these toolkits have been developed to be again generic and suitable across scientific domains (e.g. Muscle [33,37]), most coupling libraries used in production today are developed and maintained in specific scientific communities, e.g. astrophysics [39], complex fluids [38,50], quantum chemistry [51] or climate modelling [43]. Although we would very much welcome an exchange of ideas and technologies between these communities, to avoid reinventing the wheel, we also acknowledge that given the current state of the art, it would be unrealistic to push for generic coupling toolkits and environments, comparable to, e.g. MPI in parallel computing.
What would be profitable is to link generic concepts for conceptual multiscale modelling, as laid down in xMML files, to these coupling environments. For instance, one could compile xMML files into 'glue code' and 'wrapper code' for a specific environment, as we have demonstrated with Muscle [28,33]. Such technology would mitigate much of the aforementioned complexity in relation to coupling together single-scale components and scale-bridging code snippets. In our experience, such high level inter-operability does not necessarily induce overheads (e.g. [29]), and even if so, we believe that the benefits in terms of maintainability and extensibility of complex multiscale models would outweigh such trade-offs. Moreover, this approach also allows for straightforward 'plug-and-play', where, for instance, different implementations of a single-scale model, that may perform optimally on different architectures, are quickly interchanged.
Finally, at the bottom of the hourglass, when simulating the multiscale model as efficiently as possible on available computing resources, we find again ample room for generic solutions that would alleviate the burden of mapping a complex multiscale simulation in the most efficient way to hardware. Certainly when the targeted hardware consists of HPC machines, distributed systems, cloud environments or a combination of all these, and when issues like load balancing, advanced reservations, fault tolerance and energy efficiency need to be taken into account and optimized, we believe that generic solutions, to which the coupling libraries discussed above could interface, would be highly beneficial.
To do so we have proposed the concept of Multiscale Computing Patterns (MCPs), which are generic and recurring call sequences at the level of the single-scale components of a multiscale simulation [31]. It is beyond the scope of this manuscript to discuss the details of MCPs. It suffices to note that we proposed three such patterns, which capture most, if not all, multiscale simulations. We have proposed and implemented a pilot of MCPs software that takes as input the generic xMML description of a multiscale model, in combination with performance measurements of the components that make up the multiscale model (both in terms of computational performance and energy usage on massively parallel machines), and then interfaces with middleware, such as in our case QCG [40], to propose an optimal mapping of the multiscale simulation to available hardware, and to schedule and finally execute the simulation [41]. Note that such a generic deployment of the multiscale simulation only requires knowledge of the control structure of the multiscale model, as available via the xMML, and detailed knowledge of the performance of the single-scale models (both in terms of execution time, but also in terms of e.g. energy usage). If this is available for a range of (heterogeneous) architectures, the MCPs software can select, in a generic way and for each type of pattern, an optimal way to perform the multiscale simulation, given available hardware and even considering expected queuing times for that hardware [41]. Although these are very recent developments, in our view such generic solutions for scheduling and executing multiscale simulations on complex computing ecosystems consisting of large HPC machines, distributed environments and clouds, and interfacing these solutions to the most popular multiscale coupling libraries and to the most relevant scheduling environments and queuing systems, could be extremely beneficial to users, who no longer need to worry about this time-consuming and complex phase of multiscale computing.

Towards the exascale (a) Strong and weak scaling
As the clock speeds of individual cores are no longer increasing, emerging exascale machines can only reach their anticipated exaflop/s computational speeds by aggregating larger and larger numbers of nodes and cores. As a result, these machines are becoming 'fatter', not faster. We have argued before that as a result of this we are approaching the limits of what is achievable using monolithic codes [31]. Our argument was that 'because the parallelism is usually applied to the spatial domain, we are increasingly simulating larger slabs of matter and bigger chunks of the Universe, applying weak scaling by using more particles, a higher grid resolution or more finite elements. Yet it often is the temporal behaviour that one is really interested in, and that behaviour is not extended by adopting larger computers of this nature, or by making the problem physically larger. Since the scientific problems of interest usually have time scales which scale as a nonlinear function of the volume of the system under investigation, each temporal update requires more wall clock time for larger physical problems. This is in fact a recipe for disaster: we are not getting closer to studying large space and long time behaviour with monolithic codes'. We consider this to be a cogent argument to suggest that monolithic codes could be the exception on exascale machines, as weak scaling may not produce the desired results. If so, we should invest in developing other scenarios, including high-performance multiscale computing, or scenarios where monolithic codes (or coupled multiscale codes for that matter) are repeated over and over again in ensembles, e.g. in parameter sweeping scenarios or for uncertainty quantification (see also Portegies Zwart, who recently made a similar argument [27]).
We will now further analyse this argument by looking in detail at different scaling scenarios, relying on the Scale Separation Map as introduction in figure 2 and high-level modelling of parallel performance. If we would assume that we can continue using emerging exascale machines efficiently by relying on weak scaling, we basically hope that much larger systems can be simulated. And that this is possible without new additional algorithmic strategies, but by just relying on the huge additional computing power offered by such exascale machines. Figure 5 shows different scaling scenarios on a scale map. For instance, in climate modelling, the system size L is likely to stay the same (the full Earth), as well as the time span of the simulation T. But accuracy may be improved, which requires reductions in t and x. Decreasing t would imply more iterations to reach the time span T, and therefore our argument holds.
For other applications, T might already be sufficient, but L needs to be increased. As an example, one may consider the transport of sediments in a river over a given time period. The capability to consider a longer stretch of the river is important. Also, when computing blood flow, we may want to add more vessels in the simulation, thus increasing L. In this case, the argument above would not hold and weak scaling remains a good option.
It is likely, however, that both L and T need to be increased in proportion because phenomena at a larger time scale usually also imply a larger spatial scale. It is in such scaling scenarios, which we believe are most common, where we may be reaching the limits of what is achievable by just increasing the number of processors, as we need to do in order to reach exascale performance levels.
We will now explore the limits of increasing L and/or T, assuming that a much larger computer becomes available. We will show that for some applications exascale machines can bring a substantial gain, but for others scaling is limited, and new multiscale techniques need to be developed to exploit the additional computing power. Note that here we mainly focus the discussion on increasing L or T, but the same approach can be used to study the possibility to decrease t or x to improve accuracy.
Let us consider a parallel code whose execution time T par reads where W is the total work (e.g. the number of operations required to solve the problem), p the number of cores, R the speed of the core (number of operations per second) and Ω the overhead resulting from the parallelization. The efficiency E is then Note that the quantity pRΩ/W is the total fractional overhead [52]. In all generality, we can write showing that the effective power of the processors is reduced by a factor equal to the efficiency of the parallel implementation. The relation W = W(p) that guarantees that E stays constant is called the isoefficiency function. Weak scaling is achieved by increasing W as p increases, typically according to the isoefficiency relation. Strong scaling is achieved by increasing p while keeping W constant. In the former case, one solves a larger problem on more processors, within about the same time T par . In the latter case, one expects to solve a given problem faster by using more processors.
For instance, for an iterative stencil-based calculation on a three-dimensional mesh, one has a communication overhead between each subdomain, at each iteration. Let us assume that we have n iterations and that w = W/n is the work per iteration, which is proportional to the sum of the volumes of the subdomains. The communication overhead is then proportional to the boundary of one subdomain, namely to (W/(np)) 2/3 = (w/p) 2/3 . Thus, where C is a constant depending on the speed of the interconnection network. Thus Ω = nC(w/p) 2/3 and The isoefficiency function is then W(p) ∼ p. Many scientific applications correspond to the time evolution of a spatial quantity. The typical spatial and temporal scales that can be resolved depend on the space and time discretization x and t, as well as L, the spatial extension of the system, and T, the duration of the simulation. Spatial and temporal scales outside this interval will not be resolved. One can estimate the CPU time required for such a simulation. The total work W is typically where β is a proportionality factor and d is the spatial dimension. The number of iterations is n = T/ t and the work per iteration is If we again assume a stencil-based calculation on a three-dimensional mesh, combining equations (3.3), (3.5)-(3.7) results in We can write this as (3.8) The above equation relates the physical time T that can be simulated within a parallel execution time T par as a function of the spatial extension L of the system and the number of processors p. This equation T = T(L) can be seen as an iso-performance relation, or an iso-T par curve. Let us now take specific values for the performance model given in equation (3.8) (table 1). Here we assume a Lattice Boltzmann (LB) simulation 1    size, resolved at 10 µm, for two heart beats. Time resolution is 10 −5 s, and we assume cores able to perform 0.4 × 10 9 double precision operations per second. A typical LB iteration requires β = 200 arithmetic operations and the exchange of typically 20 populations of 8 bytes. Thus, we set C = 1 GB s −1 × 20 × 8 = 1.5 × 10 −7 . Figure 6 show the iso-performance curve (3.8), assuming that we consider a computation of T par = 1.5 days. These curves indicate how L and T can be varied for the same T par , while keeping x and t fixed.
When getting a faster machine, with m times more processors, we see that we can consider the same physical time T, and increase the system size by the expected factor m 1/3 . This is weak scaling. On the other hand, one can also keep the same system size and increase the simulation time T. This corresponds to a strong scaling situation, except that here we keep T par = 1.5 days constant and compute a larger time scale. As we can see from figure 6, the potential is less significant, although still interesting. For instance, taking p = 1 000 000 and the same L allows us to reach T = 10 3 s, that is an increase of a factor 500, with respect to 1000 times more processors. With p = 1 000 000 000, the value of T becomes 2 × 10 5 , hence an increase of a factor 10 5 with 10 6 additional processors. We also notice that these iso-curves stop when L becomes too small, corresponding to the situation where each processor would hold less than one data element.
It is, however, usually more likely that both L and T need to be increased. Typically, when analysing convective processes, L and T must grow proportionally (for a diffusive process, L would grow as √ T). Figure 6 shows with a dotted line an increase of L and T by the same factor (up to a 1000). Its intersection with the exascale iso-performance line gives T = 8 × 10 1 s and L = 4 × 10 −1 m. This is a 40 times longer simulation time, for a system 40 times larger, thus with 40 3 = 64 000 more mesh points. Therefore, if such a constraint holds between L and T, the possibility to reach larger physical time scales is very limited, even with a one millionfold performance increase. This is in line with our original qualitative argument that weak scaling starts to break down if we only increase processor numbers without increasing processor speeds.
Next we consider another application that scales less well than LB models. We assume an N-body problem with long-range forces. We consider a O(N 2 ) implementation, assuming that the Barnes-Hut algorithm does not apply here because the particles are on average too close to each other. The sequential execution time is where the first term corresponds to the computation of the pairs interaction forces (e.g. gravity needs 11 operations) and the second is the time integration (e.g. Verlet requires six operations). R is the speed of a core, T/ t is the number of time steps. 2 With p cores and N/p particles per core, the force computation requires an all-to-all communication, which can be realized with p − 1 communications involving all cores P i in parallel. At stage k, k = 1, . . . , p − 1, P i sends 3N/p coordinates to P i+k modulo p. The resulting communication time is  Thus, the parallel time T par is and, from this relation, we obtain This relation links the number of iterations T/ t that are possible with N particles within a computational time T par . We call this relation the T par iso-line. Following the same approach as in the previous case, we show in figure 7 this iso-line for different scales of computing resources.
Here, we took R = 10 9 s −1 , C = 10 9 s. The T par iso-lines stops when the number of particles per core reaches 1. From figure 7, we see first that adding more cores allows one to consider more particles, but not in proportion to the increase of computing power. Second, the possibility of running the simulation to larger physical time is rather limited. And worse, the maximum possible number of iterations may even decrease as the number of cores increases, as is the case for the reference simulation (N, T/ t). This behaviour is obviously due to the poor scaling of an all-to-all communication and suggests that the exascale may not allow one to solve larger N-body problems with a simple brute force approach.
These results indicate that some single-scale monolithic applications (e.g. lattice Boltzmann solvers) still have good potential to exploit exascale computers, but for others (e.g. N-body simulations) exascale may not bring any benefit to simulate larger systems for a longer physical time. Smarter than brute force approaches are thus needed, such as for instance multiscale simulation where scales are separated within different sub-models.

(b) Multi-scaling
One way out of this dilemma is by taking highly efficient monolithic codes that perform well on the petascale, and combining them in loosely coupled ways to reach exascale performance. We call this multi-scaling, where 'scale' in this case does not refer to spatial or temporal scales, but to processor scales. We can identify a few scenarios for multi-scaling (figure 8).
One could run multiple instances of the code, e.g. in replica computing (parameter sweeping) applications or uncertainty quantification. One could also create a multi-process application, effectively coupling different processes together that overlap in spatio-temporal scales. Finally, one could create multiscale applications, coupling single-scale monolithic codes on different spatio-temporal scales. And, of course, combinations of these scenarios are possible, e.g. performing uncertainty quantification on a multiscale model [49].
It would be wise to explore further these types of applications and better understand how multi-scaling could help unleash the power of future exascale machines, but at the same time also face the challenges in relation to energy consumption and fault tolerance. We have already seen convincing examples, e.g. using the multiple instances approach, in replica computing for calculating binding affinaties [53,54].

Conclusion
Multiscale computing has turned into a mature paradigm, supported by many communities that have invested in production software. Yet, we see room for further development of generic solutions in conceptual multiscale modelling as well as in executing multiscale simulations. When seen in the light of the anticipated need for multi-scaling as opposed to weak scaling in order to reach exaflop/s performance, we conclude that such developments are very much needed for the optimal exploitation of future exascale HPC systems.
Data accessibility. This article has no additional data. Authors' contributions. A.G.H. conceived the research that led to this discussion paper and wrote the first version of the manuscript; P.C. formulated the original weak scaling argument, and B.C. developed the scaling analysis; all authors contributed to the content of the manuscript and in revising the manuscript.
Competing interests. We declare we have no competing interests.