Digital blood in massively parallel CPU/GPU systems for the study of platelet transport

We propose a highly versatile computational framework for the simulation of cellular blood flow focusing on extreme performance without compromising accuracy or complexity. The tool couples the lattice Boltzmann solver Palabos for the simulation of blood plasma, a novel finite-element method (FEM) solver for the resolution of deformable blood cells, and an immersed boundary method for the coupling of the two phases. The design of the tool supports hybrid CPU–GPU executions (fluid, fluid–solid interaction on CPUs, deformable bodies on GPUs), and is non-intrusive, as each of the three components can be replaced in a modular way. The FEM-based kernel for solid dynamics outperforms other FEM solvers and its performance is comparable to state-of-the-art mass–spring systems. We perform an exhaustive performance analysis on Piz Daint at the Swiss National Supercomputing Centre and provide case studies focused on platelet transport, implicitly validating the accuracy of our tool. The tests show that this versatile framework combines unprecedented accuracy with massive performance, rendering it suitable for upcoming exascale architectures.


Methods
The computational framework used in this study simulates the particulate nature of blood, i.e. blood cells that are submerged in blood plasma. In more details, we suggest a high-performance computational framework for fully resolved 3D blood flow simulations. The tool is based on three computational modules, namely the fluid solver, the solid solver and the fluid-solid interaction (FSI). The fluid solver is based on the lattice Boltzmann method (LBM) [1] and the module uses Palabos [2,3] as for the implementation of LBM. Palabos stands for Parallel Lattice Boltzmann Solver and it is an open source software maintained by the Scientific and Parallel Computing Group (SPC) at the University of Geneva. The FSI is done via an immersed boundary method known as multidirect forcing scheme and its implementation is realised in Palabos. Lastly, the solid solver is based on the nodal projective finite elements method (npFEM) [4] and the module uses the npFEM library developed and maintained by Palabos team, based on the open-source ShapeOp library [5]. The npFEM library is integrated in Palabos core library.
In this section, a basic and brief outline of the used methods is given for the sake of completeness of the article. For a thorough presentation of the numerical methods, the interested readers should consult Kotsalos et al. [4].

Nodal Projective FEM (npFEM)
The deformable blood cells are simulated by the nodal projective finite elements method. The npFEM is a masslumped linear FE solver that resolves both the trajectories and deformations of blood cells with high accuracy. The solver has the capability of capturing the rich and non-linear viscoelastic behaviour of red blood cells as shown and validated in [4]. The rest of blood cells, platelets for the current study, are simulated as nearly-rigid bodies by increasing the stiffness of the material in the solid solver.
Let us assume a surface mesh consisting of n vertices with positions x ∈ R n×3 and velocities v ∈ R n×3 . The evolution of the body (trajectory & deformed shape) in time follows Newton's laws of motion. At time t n , the system is described by {x n , v n }. The external forces are defined as F ext ∈ R n×3 and are due to the fluid-solid interaction, while the internal forces are F int ∈ R n×3 and are due to the material properties. The internal forces are given by is a scalar discrete elemental potential energy (the summation of all elemental potential energies results in the total elastic potential energy of the body). The implicit Euler time integration leads to the following advancement rules: where M ∈ R n×n is the mass matrix, h is the time step, subscripts n and n + 1 refer to time t and t + h, respectively and C ∈ R n×n is the damping matrix acting like a proxy for viscoelasticity. Except for Rayleigh damping, we augment the viscoelastic behaviour through another proxy called position-based dynamics damping [6] (more details in [4]). The mass matrix is built by lumping the total mass of the body (including the interior fluid) on the mesh vertices, resulting in a diagonal structure. We should mention as well that the implicit nature of the npFEM solver renders it capable of resolving very high deformations with unconditional stability for arbitrary time steps. Other integration schemes, such as the explicit Runge-Kutta, have been tested for even higher accuracy, but since we are interested in investigating the collective behaviour of multiple blood cells, this enhanced accuracy in modelling the trajectory of a single blood cell did not affect the overall behaviour. Equations (1) & (2) can be combined and converted into an optimisation problem as where M = M + hC, y n = x n + h M −1 Mv n + h 2 M −1 F ext and || · || F is the Frobenius norm. Indeed, setting the derivative of equation (3) to zero (thereby minimising the objective function), we recover Newton's second law of motion. The solution to the above optimisation problem gives the system's state at the next time step, i.e. {x n+1 , v n+1 } and thus resolves the trajectory and deformed state of the body at time t + h. The steps to minimise equation (3) are thoroughly described in Kotsalos et al. [4], but roughly a quasi-Newton approach is employed that converges to x n+1 through a sequence of iterations (k = 1, ...) as where H is an approximation of the Hessian of g(x n+1 ). The Hessian approximation is built on top of the observation that most of the potential energies that describe the behaviour of a red blood cell are of quadratic form [4]. The theory introduced above is not restricted to membranes only, but it can straightforwardly resolve fully 3D bodies, i.e. the ones discretized by tetrahedra. This section gave a broad overview of the topic and the main novelty of this approach is a cleaner way to describe elasticity through a variational point of view.

Lattice Boltzmann Methods (LBM)
The lattice Boltzmann method (LBM) is employed to solve indirectly the Navier-Stokes equations. Virtual particles, also known as populations, move on a regular grid/lattice and collide at the lattice nodes. LBM is a second-order accurate solver for the weakly compressible Navier-Stokes equation, where the weak compressibility refers to errors that amplify as Mach number tends to one [1].
Let us consider a 3D incompressible fluid flow, with density ρ and kinematic viscosity ν. The fluid domain is discretised into a regular grid with spacing ∆x in all directions. The fluid is represented as a group of fictitious particles residing at lattice sites and move to the neighbouring nodes along a fixed set of discrete directions with given discrete velocities at discrete time steps ∆t [7]. For this study, we use a D3Q19 stencil and the fluid populations at each site are allowed to move to its 19 immediate neighbours with 19 different velocities {c i } which describes the collision and propagation of the mesoscopic particle packets. In all our simulations we use the linearised single-relaxation-time BGK [8] formulation for Ω i . The macroscopic properties of the fluid such as density ρ, velocity u and hydrodynamic stress tensor σ can directly be recovered at each lattice node and per time step from the moments of the distribution functions f i . External forcing terms (like the FSI force f imm ) can be incorporated in the LBM through a modification of the collision term Ω using the Shan-Chen forcing scheme [9]. Important quantities to consider in LBM are the lattice speed C = ∆x/∆t, the relaxation time τ , the fluid speed of sound C s = C/ √ 3 and their coupling through the physical kinematic viscosity ν as A particular point of interest is that in lattice Boltzmann simulations, if one fixes the spatial discretization and the relaxation time for a given fluid (known viscosity), then there is no freedom over the selection of the time discretization, as it is dictated by the diffusive scaling formula (7). If ∆t is not defined through equation (7), then physics are violated. It can be proved that the updating rules (with BGK for Ω) (5) recover the incompressible Navier-Stokes to second order in both space and time. Since the LBM is obtained by the linearised expansion of the original kinetic theory-based LB equation, the resulting macroscopic variables converge to the Navier-Stokes equations with order Ma 2 [7], where Ma = u max /C is the computational Mach number and u max is the maximum simulated velocity in the flow. Therefore for numerical accuracy and stability, it is required that Ma 1. For a complete overview of the lattice Boltzmann method, the reader should consult the work done by Krüger et al. [1,10] & Feng et al. [7].

Immersed Boundary Method (IBM)
The coupling between blood cells (solid phase) and blood plasma (fluid phase) is realised through the immersed boundary method (IBM). Essentially, IBM imposes a no-slip boundary condition, so that each point of the surface and the ambient fluid move with the same velocity. The key idea is that the surface of the deformable object is viewed as a set of Lagrangian points, which do not have to conform with the fluid mesh/lattice. The fluid feels the presence of the body only via a force field f imm . The interaction between the off-lattice marker points (Lagrangian points) and the fluid is done via interpolation stencils, i.e. regularised Dirac delta functions. There are many variations of IBM [11] and in this study we are using a modified version of the multi-direct forcing method introduced by Wang and colleagues [12], in a form close to the one proposed by Ota et al. [13].
The goal is to compute a forcing term f imm along the immersed boundaries and apply it as body force to the fluid (Shan-Chen forcing scheme). Let us denote with x and X k the Eulerian and Lagrangian points respectively, with U k the velocity of the Lagrangian point k as computed by the solid solver (npFEM) and with u * (x, t) the fluid velocity as recovered by the distribution functions f i . Using an interpolation stencil (W ), the velocity on the Lagrangian point X k is given by where the interpolation stencil is the φ 4 with support 4∆x [14].
The body force f imm is determined by the following iterative procedure: Step 0. Initial estimation of the body force on the Lagrangian points by where A k is the corresponding surface area of the Lagrangian point k.
Step 1. Distribute the force estimate from the Lagrangian point X k to the neighbouring lattice sites at a range dictated by the interpolation kernel φ 4 . Correct the fluid velocity as u * +f l ∆t and re-interpolate the corrected velocity at the Lagrangian points.
Step 2. Re-update the body force on Lagrangian point X k and repeat from Step 1.
The above iterative procedure can be applied for as many cycles as the application demands. For simulations with multiple blood cells, even one cycle gives satisfactory results. The body force computed by the last cycle on the Eulerian grid is the f imm . The immersed body force term f imm could be used as the F ext for the solid solver. However, we make use of the hydrodynamic stress tensor for higher accuracy. To compute the force on the Lagrangian point k, we project the stress tensor σ onto the surface normal n k .
We should highlight that in the classic immersed boundary method the same fluid covers the whole computational domain. However, this contradicts with the existence of interior fluid in blood cells, of different viscosity and density, than the exterior blood plasma. Our approach is to completely disregard the interior fluid and view it as an artefact of IBM. To do so, we compute F ext on the solid phase by summing only the points x that are on the outside region of the bodies. The effect of the interior fluid of the blood cells is implicitly taken into account from the npFEM solver by considering both its incompressibility and viscosity as additional terms to the viscoelasticity of the membrane. Thus, the bodies are simulated as entities that encapsulate all the different components of blood cells [4].

Particle In Kernel (PIK) technique
The correct handling of colliding bodies, especially for flows at high hematocrit, is of paramount importance for accuracy and stability reasons. Bodies that approach each other very closely activate the collision framework which has to assist/replace the hydrodynamic interactions because of under-resolution of the latter. Additionally, IBM and the computation of external forces on the immersed bodies make use of interpolation kernels that are drastically affected when they are "contaminated" by "foreign" particles. This means that the interpolated values consider regions that are not well-resolved, e.g. the interior fluid of blood cells, and this could potentially lead to accuracy and stability problems. The Particle In Kernel (PIK) technique is essentially part of the collision framework and makes sure that the correct forces are used at all times. In Figure 1 there is a visual description of PIK for an investigated point x i (dot) and 2 potential collision candidates (stars). If there is no "foreign" particle, i.e. particle belonging to another body, inside the kernel (sphere with 4∆x diameter in 3D) then the fluid resolution is sufficient to handle the flow field. Otherwise, we disregard the fluid force and the collision framework gives the force instead. The closest "foreign" particle is chosen to be the colliding neighbour and the desired location of x i is noted as p i (projection). Subsequently, a collision potential energy and the corresponding force is introduced in the solid solver. The threshold to disregard the forces from the fluid can be either the boundaries of the interpolation kernel or user specified according to various factors, e.g. boundary conditions, hematocrit, flow field. This approach guarantees to use only non-contaminated interpolation kernels and a clean collision framework. This technique affects only the computation of F ext on the immersed boundaries and has no impact on the IBM steps.

Piz Daint @ CSCS
Piz Daint is the current flagship system of the Swiss National Supercomputing Centre (CSCS). This supercomputer is a hybrid Cray XC40/XC50 system. According to the list TOP500 (November 2019), Piz Daint is ranked 6 th worldwide, while it is the most powerful system in Europe. The specifications of the supercomputer are the following: Running the CUDA "deviceQuery" script on an XC50 node, we get the characteristics of the GPU that it is equipped with (see Figure 5).

Performance Analysis
Goal of this section is to complement the performance analysis of the main article. Figure 2 summarises the weak scaling case studies (we omit the cases at 45% hematocrit). Figure 3 presents the weak scaling comparison between the hybrid version (npFEM on GPUs, LB/IBM on CPUs) and the CPU-only version. Porting part of a library on GPUs can cause extra data management, book-keeping and overhead from new data structures, thus most of the times the efficiency of the ported hybrid library demonstrates a degradation. Obviously (Figure 3), this is not the case for our computational framework where we manage to keep the same performance, keeping the overhead at a minimum level. Strong scaling plays a critical role when there is an abundance of computational power. Also, strong scaling partially indicates the scalability of a tool and thus where applicable it is an interesting metric to show. Figure 4 presents the speedup of the strong scaling for a domain of size 50x1000x50 µm 3 . The speedup is given as where t N0 is the time spend in N 0 GPU nodes and t N in N GPU nodes. The speedup is good but not ideal and the reason for this is the extra book-keeping needed for satisfying modularity and genericity of the framework.