Passively parallel regularized stokeslets

Stokes flow, discussed by G.G. Stokes in 1851, describes many microscopic biological flow phenomena, including cilia-driven transport and flagellar motility; the need to quantify and understand these flows has motivated decades of mathematical and computational research. Regularized stokeslet methods, which have been used and refined over the past 20 years, offer significant advantages in simplicity of implementation, with a recent modification based on nearest-neighbour interpolation providing significant improvements in efficiency and accuracy. Moreover this method can be implemented with the majority of the computation taking place through built-in linear algebra, entailing that state-of-the-art hardware and software developments in the latter, in particular multicore and GPU computing, can be exploited through minimal modifications (‘passive parallelism’) to existing Matlab computer code. Hence, and with widely available GPU hardware, significant improvements in the efficiency of the regularized stokeslet method can be obtained. The approach is demonstrated through computational experiments on three model biological flows: undulatory propulsion of multiple Caenorhabditis elegans, simulation of progression and transport by multiple sperm in a geometrically confined region, and left–right symmetry breaking particle transport in the ventral node of the mouse embryo. In general an order-of-magnitude improvement in efficiency is observed. This development further widens the complexity of biological flow systems that are accessible without the need for extensive code development or specialist facilities. This article is part of the theme issue ‘Stokes at 200 (part 2)’.


Introduction
Stokes flow describes the fluid mechanics of a vast range of microscopic life, for example the motility and generation of feeding currents by microorganisms, and organ cleansing, gamete/embryo transport and developmental patterning in higher organisms.Typically flow and locomotion involve the action of individual or multiple slender organelles termed flagella and cilia, with the effects of cell surfaces, surrounding cavities and sometimes free surfaces playing crucial roles through hydrodynamic interaction.
This biological relevance has motivated decades of research into what we now refer to as the Stokes flow equations, originally studied (with the inclusion of an unsteady term) by G.G. Stokes [1], which describe Newtonian fluid dynamics in the inertialess regime associated with microscopic length scales.The field is too broad to do justice to here, so we mention a few highlights, including Gray & Hancock's study [2] of the mechanism of sea urchin sperm propulsion and associated slender body theory (ref.[3], see also [4], [5] and [6]), Chwang & Wu's work on helical propulsion (ref.[7], see also [8] and [9]), Blake's squirmer model [10] and method of images [11] (see also [12,13,14,15,16,17]), Pedley, Kessler and colleagues' studies of algal suspension dynamics and gyrotaxis [18,19], the discovery of Stokes flow as the first leftright asymmetric event in vertebrate embryo development [20,21] (see also [22]), Cortez and colleagues' development of the method of regularized stokeslets [23,24], and Goldstein and colleagues' discoveries on algal cilia fluid mechanices, from single cell to colony scales [25].For further review see for example references [26,27,28,29,30].

Stokes flow and stokeslets
The Stokes flow equations describing the inertialess dynamics of an incompressible Newtonian fluid are where p = p(x, t) is pressure, u = u(x, t) is velocity, x is position, t is time and the constant µ is dynamic viscosity.These equations are typically accompanied by the no-slip, no-penetration condition u(X, t) = ∂ t X on boundary points X, along with u → 0 or u ∼ U ambient as |x| → ∞ in exterior flows.
The dynamic geometric complexity characterising many biological flows complicates both analytical and numerical solution approaches, however the linearity of equations (2) mean that solutions can be constructed by superposing 'fundamental' solutions in the absence of a boundary driven by point forces, moments, sources/sinks and higher order derivatives.For example, Hancock's [3] slender body theory was based on replacing the sperm flagellum with a line integral of point forces (which he termed stokeslets) combined with appropriately-weighted source dipoles to account for the finite radius of the flagellum.Similarly, the flow due to a translating sphere in Stokes flow can be decomposed exactly as a point force and source dipole.
Another key feature of equations ( 2) is the absence of an explicit time derivative, which means that flow is instantaneously determined by the boundary conditions.This absence results in the famous scallop theorem [31], i.e. that a swimmer must undertake a time-irreversible motion in order to achieve a net translation (see also the earlier movie of G.I. Taylor [32]); moreover a microscopic swimmer cannot 'kick and glide', continuous motion requires continuous expenditure of energy.From a computational perspective, the absence of a time derivative entails that the flow problem itself does not require time-stepping, although the time-evolving transport of suspended particles or migration of swimmers of course does.
Focusing on fundamental solutions, the Stokes flow equations with a finite but spatiallyconcentrated force located at y and pointing in the k-direction are where δ(x) is the three-dimensional Dirac delta distribution and êk is a unit basis vector pointing in the k-direction.Equations ( 4) have solution u = (8πµ) −1 (S 1k , S 2k , S 3k ) and p = (4π) −1 P k , where This Oseen tensor [33] or stokeslet [3] forms the basis for slender body theory [4,3], along with the boundary integral method [34,35,36].The boundary integral method enables the flow exterior to or bounded by a smooth surface B to be expressed as where f is the traction, i.e. the force per unit area exerted by the fluid on the surface, n is the unit normal pointing into the fluid, and T jk is the stress tensor −P k δ j + µ(∂ S jk + ∂ j S k ).The summation convection over repeated Cartesian indices is assumed above and throughout.
For problems satisfying B u • ndS y = 0, i.e. no change in volume, the second integral term ('double layer potential') in equation ( 7) can be eliminated (see for example [37]), motivating focus on the single layer boundary integral equation Using the fact that S jk (x, y) = S kj (y, x) and relabelling, equation ( 8) may be rewritten [36] The well-known advantages of the boundary integral are (1) reducing the computational domain from a 3D volume to a 2D surface (or possibly 1D line) B, and (2) the natural treatment of open boundaries without the need to make artificial truncations, enabling excellent computational accuracy and efficiency.For example, Phan-Thien et al. were able to model propulsion of a physiologically-shaped bull sperm cell in the mid-1980s, making use of machines with the order of 1 MB memory [35].
The standard numerical approach to the solution of equation ( 9) is to discretize the traction as where f [1], . . ., f [N ] are vector constants and φ 1 (y), . . ., φ N (y) are basis functions.In the simplest case, the latter may be piecewise constant functions defined on a partition B = B 1 ∪ . . .∪ B N via φ n (y) = 1 for y ∈ B n and φ n (y) = 0 otherwise.The semi-discrete numerical problem then reads The resistance problem (prescribed velocity u on the surface, unknown traction f ) can be solved as follows: equation ( 11) can be converted to a 3N × 3N linear system by applying point collocation, i.e. x = x[m] for m = 1, . . ., N where x[m] is the centroid of element B m .As discussed below, the swimming problem is a minor modification of the resistance problem involving the introduction of an unknown rigid body motion and additional force and moment balance equations.However, scientists who are not computational specialists may encounter two challenges in implementing the above approach: 1.The generation of a mesh, i.e. the connected, non-overlapping partition B 1 , . . ., B N of what may be a complex geometry.Typically this mesh includes a set of vertices, a connectivity table associating each element B n to 3 or 4 vertices, a mapping between local coordinates (ξ, η) in each element and global coordinates y and surface metric dS y .
2. The evaluation of the (weakly) singular integrals occurring in equation ( 11) when x = x[m] and n = m.These integrals involve terms which behave as r −1 as r → 0. A related difficulty occurs 'downstream' of the problem for the traction if subsequent computation of the velocity u(x) at a fluid point x close to B is required.

Regularized stokeslets
Issues (i) and (ii) are certainly not insurmountable, and codes such as the Fortran library BEMLIB [38] provide considerable assistance.However, given the scientific breadth of (often experimentally-focused or multidisciplinary) researchers working in biological fluid dynamics, there is great value in methods which avoid or alleviate them.One appealing strategy is to regularize the singularity occurring so that Sjk (x, x) < ∞, which immediately solves issue (ii).Additionally issue (i) can then be side-stepped by simply covering B with a set of points {x [1], . . ., x[N ]} (without the need for a connectivity table or local coordinates), and discretizing the integral directly with a quadrature rule, where ), i.e. absorbing the quadrature weight and metric into the unknown force.The linear system for the resistance problem then becomes, Equation ( 13) is referred to as a Nyström discretization of the integral equation [39].
There are clearly many ways to regularize the kernel, for example replacing |x − y| −1 with (|x − y| 2 + ε 2 ) −1/2 for some small parameter ε > 0. However, this ad hoc approach has the unwanted side-effect of producing potentially unphysical "solutions" that no longer satisfy the original equations, for example by violating the incompressibility condition ∇ • u = 0. Cortez and colleagues [23,24] proposed instead to start by regularizing the point force, i.e. to consider the exact solution to the Stokes flow equations with spatially-smoothed point forces, where φ ε (x) is a family of "blob" functions which approximates δ(x) as ε → 0. The particular choice is plotted in figure 2. This choice leads to the regularized stokeslet solutions It can be seen that P ε k (x) ∼ P k (x) and S ε jk (x, y) as ε → 0; moreover the corresponding single layer boundary integral equation is where p = 1 for x on or near B and p = 2 otherwise [24].Alternative forms for the blob function have been derived [17] to improve the order of the error, however the form ( 16) is by far the most commonly-used.The O(ε) regularization error associated with boundary collocation is inherited by the resistance problem for finding the traction associated with a given rigid body motion, and the swimming problem for finding the traction and translation/rotation resulting from a given boundary deformation and force/moment balance.It is distinct from the errors associated with discretization of the traction and numerical quadrature which we will consider shortly.
Regularization enables a particularly convenient discretization of the single layer boundary integral equation; this simplicity however comes at a cost.The Nyström method (13) corresponds to using identical discretizations for the traction f k (y) and S jk (x, y) when considered as functions of position y ∈ B (the latter for fixed x), despite the fact that the stokeslet kernel varies much more rapidly than the traction, as shown in figure 1, the variability becoming more rapid as ε is reduced (figure 2).The number of degrees of freedom of the system 3N and hence the 3N × 3N matrix size is tied to the discretization for the traction, and so reducing the regularization error via reducing ε entails rapid growth in memory and computational requirements associated with system assembly and solution.Cortez et al. [24] initially suggested a quadrature error O(h 2 /ε 3 ) where h is discretization length; more recently [40] this estimate has been improved to O(h 2 /ε) + O(P ε −1/P h 1−P ) for any integer P > 3.
As described in equations ( 10), (11), boundary element methods separate out the traction and quadrature discretization by expanding the traction in terms of basis functions, leaving the stokeslet integrals to be evaluated by the most appropriate means (e.g.adaptive quadrature) without affecting the number of degrees of freedom of the system.Alongside being the standard method for the classical boundary integral equation, this approach has been employed in the context of regularized stokeslets to model cilia-driven flow [41,42], autophoretic swimmers [43] and to explore the evolution of bacterial morphology [44].These studies would have been challenging or practically impossible via the Nyström method, which computational experiments suggest would have required above 3 orders of magnitude greater memory and processing time [45].A related idea of pre-integrating line distributions of regularized stokeslets to model cilia and flagella, which is closely related to slender body theory [45] has recently been extended to higher order basis functions [46,47], and to model flagellar elastohydrodynamics [48].However, problems involving surface integrals require true mesh generation and as such despite being suggested over 10 years ago [45], the 'element' approach has not been very widely adopted in comparison with the Nyström method for regularized stokeslets.
With the aim of preserving the meshlessness of the Nyström discretization but decoupling the traction discretization from the numerical quadrature, we recently [49] suggested an alternative approach based on an idea from meshless interpolation.The concept is to use two point cloud discretizations, one 'coarse force' set {x [1], . . ., x[N ]} which is sufficient to capture the variation of the traction, and which dictates the size of the linear system, and another finer quadrature discretization set {X [1], . . ., X[Q]} which has sufficient resolution to capture the rapidly-varying kernel S ε jk (x, y) for y in the vicinity of x, as in figure 1c.The force at a quadrature point f (X[q])dS(X[q]) is then approximated by its value at the nearest point on the coarse force set, f (x[n])dS(x[n]), where the index n is given by Denoting ν[q, n] = 1 and ν[q, n] = 0 for n = n, we may write where Defining h f to be the length scale associated with the traction points and h q to be the length scale associated with the quadrature points, equation ( 23) has a regularization error O(ε) and traction discretization error O(h f ).The quadrature error depends on whether the traction points are 'contained' (i.e.within distance O(ε) or closer) in the quadrature set.Associated to each traction point which is contained in the quadrature set is an error ) for all integer P > 3, the former term being dominant [40].Associated to each traction point which is no closer than distance δ ε > 0 to the quadrature points is an error O(h 3 q δ −2 ) + O(P h 1−2/P q ) [40].In the latter case there is no asymptotic dependence on ε in the limit ε → 0.
The "NEAREST" method ( 23) is not intended to compete directly with boundary element methods, still less techniques based on treecodes and fast multipole methods [50,51,52], rather it is aimed at providing a simple and accessible meshfree method for non-specialists that is capable of dealing with problems of moderate complexity, with relatively standard workstation hardware.As an example, the resistance matrix (forces and moments due to the translation and rotation of a sphere, with regularization parameter ε = 0.01 can be calculated to within 2% error with N = 864 and Q = 3456, taking 6 seconds on a notebook computer.The same computation with the Nyström method (N = Q = 3456) yielded a 5% error and required 350 seconds.Other early applications of the nearest neighbour discretization by our group have included the diffusion tensor of a macromolecular structure [49], cell motility [53] and embryonic nodal flow [54].
A further feature of this method is that equation ( 23) can be viewed as the product of a (dense) 3N × 3Q stokeslet matrix with a 3Q × 3N (sparse) nearest neighbour matrix and a 3N × 1 column vector of tractions, which naturally lends itself to implementation in numerical linear algebra software such as MATLAB or GNU Octave in terms of matrix-matrix and matrixvector operations.Limited use of an iterated loop to handle the 'Q' dimension in block prevents memory overflow in in this respect.The use of 'under the hood' numerical linear algebra enables ongoing progress in hardware and software optimizations -particularly those involving multicore and graphical processing unit (GPU) developments -to be exploited 'passively', i.e. with minimal changes to the code.In this article we present some experiments along these lines; it will be found that the use of a modest compute-GPU in constructing and solving swimming-type problems can lead to a reduction in the required computational time that can be in excess of an order of magnitude when using the nearest neighbour regularized stokeslet method.

Parallelising NEAREST
In a recent article [53], we showed how the trajectories of multiple flagellated swimmers with specified beat patterns can be calculated through formulating and solving an initial-value problem (IVP) of the form Ẏ = F (Y, t), where with X 0 (t) being the position of the swimmers at time t, and B(t) the matrix of basis vectors describing the body-frame of the swimmers.The rates Ẋ0 (t), and Ḃ(t) are determined from the linear system for the swimmers' translational velocity U (3 scalar unknowns per swimmer), angular velocity Ω (3 scalar unknowns per swimmer), and tractions f [•] (3N scalar unknowns per swimmer).This formulation allows for the use of built-in IVP solvers such as MATLAB's ode45, or GNU Octave's lsode.Approximately 90% − 99% of the computational time needed to solve such a problem using NEAREST is due to a combination, at each time step, of: 1. Construction of the stokeslet matrix multiplied by the nearest neighbour matrix (A s ), such as in (23); 2. Construction of the matrix (A) from pre-calculated matrix blocks; 3. Solution of the linear system.
The construction of the matrix A s can itself be decomposed into two steps: the calculation of the regularized stokeslet kernel at each force node with respect to each quadrature node (3N M × 3QM calculations for M swimmers), and, then, the multiplying of the stokeslet matrix with the nearest neighbour matrix ν[q, n].This is exactly the type of problem that is suited to GPU optimisation; a large number of small calculations can take advantage of the large number of processing cores available on modern GPUs, with relatively modest hardware enabling thousands of calculations to be performed simultaneously.Similarly, it is well known that there are significant gains to be made when using a GPU for large matrix construction [55], and with the NEAREST method we are able to exploit commercially optimised algorithms for solving linear systems on the GPU, such as the MATLAB "\" command.
With the aim of maintaining the simplicity of the original method we take a passively parallel approach whereby we do not rewrite the entire method to fully optimise GPU use, but make minimal changes that nonetheless bring an order of magnitude performance improvement.For detailed description of the previously-published code and algorithms, see refs.[49,53]; the entirety of the code changes to make use of an available compute GPU are as follows: 1. To construct the matrix A s , the vectors containing force and quadrature points (x and X respectively) are 'cast' to the GPU via the commands There is a small amount of computational overhead in transferring data between the CPU and GPU, but many MATLAB operators are 'overloaded' to work on both standard and GPU arrays without additional intervention by the user.Due to the use of x and X as gpuArrays, the constructed matrix A s is itself a gpuArray, and therefore the large matrix A is constructed on the GPU, again with no additional user intervention.
2. The MATLAB "\" command for solving the linear system works without modification on a gpuArray, outputting a gpuArray solution vector Ḟ .Currently the MATLAB built-in ODE solvers (such as ode45) are not currently GPU optimised, and so the solution must be 'gathered' from the GPU back to the CPU via the command In what follows we will investigate the effect that these minimal user changes can have on the run time of simulations, including an assessment of the savings for each of the three major costs (assembling A s , assembling A, linear solver), and how these changes impact the time required to solve 'real-world' problems of tracking 25 swimming Caenorhabditis (C.) elegans, 9 sperm swimming between two solid plate boundaries, and of tracking particle deposition in the euciliated embryonic node of the mouse.
Comparisons will be made between calculations on M1 (with and without GPU acceleration), and between M2 (CPU) and M3 (GPU).Although M2 and M3 form part of an HPC cluster, each calculation will be limited to a single computational node per simulation.
For benchmarking we take the swimming problem of a single model sperm from [53].The model sperm comprises a tail of length L = 50 µm, discretized as a line of stokeslets, and an ellipsoidal head with semi-axes of length 2 µm, 1.6 µm, and 1 µm, and is nondimensionalised with respect to the length-scale L. The beat pattern follows the planar activated beat of Dresdner and Katz [56].We discretize the model sperm tail using N t = 100 and Q t = 400, and an increasing number of traction points on the head N h with Q h = 4N h , to obtain a problem in terms of 3 (N h + 100) scalar degrees of freedom (sDOF), and corresponding 3 (4N h + 400) quadrature points.
In Figure 3 we show the time required for calculation of the model swimming sperm at the first time point t = 0.In these figures, the times shown are the mean time calculated over 100 runs.Figure 3a-c shows the total time taken for constructing A s , A and for solving the linear system, with the relative increases (M1 (CPU)) / (M1 (GPU)) and M2 / M3 shown in Figure 3d-f.In each case we see that for small numbers of sDOF (< 500) the CPU calculations show a modest performance increase over those with GPU acceleration.When increasing the number of sDOF, we see greater than an order of magnitude speed up for A s (maximum gains are a factor of 29 (M1) and 23 (M2/M3)), A (maximum gains are a factor of 19 (M1) and 20 (M2/M3)), and for solving the linear system on M2/M3 (34 times faster).When solving the linear system on M1 for > 500 sDOF, we see a factor of 2 increase when using the GPU as opposed to the CPU.This reduced increase is due to the availability of 24 CPU cores on M1, allowing for significant performance increases over M2.For very large numbers of sDOF (> 10 4 ) there is a slight reduction in computational gains when using the GPU machines.This is due to the limited amount of GPU memory available, meaning additional overhead in transferring data between the CPU and GPU. Figure 3g,h shows the total time taken to construct and solve the swimming problem for the first time step t = 0.Here same pattern occurs, with an order of magnitude speed up when using the GPU for problems with > 10 3 sDOF, with a maximum 21 times increase for M1, and 18 times increase for M2/M3.

Multiple swimming C. elegans
In the first of our 'real-world' tests, we simulate an array of 25 model C. elegans over three beat cycles (see Figure 4b).The swimmers are discretized as a line of stokeslets with 75N sDOF, and corresponding Q = 25 × 4N vector quadrature points.The C. elegans beat pattern follows that of Thomases and Guy [57], illustrated for a single swimmer in Figure 4a.The time taken to solve this swimming problem when N = 2 n (n = 3, 4, . . ., 7) is shown in Figure 4c, with the speed up when using the GPU on M1, or using M3 instead of M2, shown in Figure 4d.As with the single iteration calculations in §4.1, when using a reasonable number of points (> 96 sDOF per swimmer) we see an order of magnitude speed up in calculations.When using 384 sDOF to discretize each swimmer, the real-world time required is just over 4.5 hours when using the CPU of M2, reducing down to around 15 minutes on the GPU of M3.(d) -(f) The relative speed increase when incorporating the GPU for M1, and when using M3 instead of M2 (higher is better).(g) The total time to solve a single iteration for the single sperm on M1 (CPU), M1 (GPU) (lower is better), M2 and M3.(h) The relative speed increase when incorporating the GPU for M1 and M3 instead of M2 (higher is better).

Multiple swimming sperm between parallel plates
Regularized stokeslet methods enable solid boundaries to be taken into account via the inclusion of additional surface distributions.In the present case we consider the situation of two parallel plane walls approximating a microscope slide and cover slip.Between these, we simulate an array of 9 model sperm, again using the model of Dresdner and Katz [56], initialised with varying wave number k ∈ [2π, 4π] and phase φ ∈ [0, 2π], and solved for three beat cycles.Boundaries are included as a pair of parallel rectangular plates 4 × 4.5 flagellar lengths (equivalently 180 × 202.5 µm), separated by a distance of 0.4 flagellar lengths (equivalently 18 µm), with the sperm swimming in the plane a distance of 0.2 flagellar lengths from each plate.The characteristic beat pattern of one of these swimmers is shown in Figure 5b, with the position of each swimmer at time t = 0, and the tracks traced out over three beat cycles shown in Figure 5a.The boundaries are discretized with a total of 1440 sDOF (and corresponding 7680 vector quadrature points), with the sperm tails discretized using 120 sDOF (and corresponding 160 vector quadrature points), and the heads are discretized with an increasing number of (CPU) and M4 (GPU) (lower is better).(e) The relative speed when using a GPU compared to CPU on M1, and M3 compared to M2 (higher is better).sDOF (162-648 per swimmer).For the range of sDOF shown in Figure 5c,d, we see in excess of an order of magnitude speed increase when using the GPU accelerated machines, with a maximal 19 × speed-up when using the GPU on M1 with 3978 sDOF.

Particle tracking in the euciliated mouse node
The flow driven by beating cilia in a fluid-filled structure called the embryonic node is the initiator of symmetry breaking in early development [41,42,30], although the mechanism by which this flow is converted into anatomical asymmetry is still to be understood [22].The role of morphogen-bearing vesicles is believed to play a central role.
To investigate this problem, we recently modelled the enclosed embryonic node of the mouse using NEAREST [54], with a biologically realistic 112 beating cilia, with particular focus on how the calculated flow can transport particles potentially needed for chemical signalling.The large number of beating cilia, and non-planar boundaries in this problem, means that longtimescale computational simulation of particle transport is particularly challenging for most existing methods.
Taking the forces calculated for the 39, 979 sDOF from the previous work [54], we calculate the time required to track a set of 1, 10, and 100 particles for 1000 beat cycles of the 112 cilia.A sketch of the euciliated mouse node is shown in Figure 6a, with the paths of 10 particles shown in Figure 6b, time requirements in Figure 6c and relative speed increase in Figure 6d.We see in the tracks the characteristic leftward-moving, "loopy drift" of particles as they follow the flow generated by the array of beating cilia.The ability to model and perturb a physiologically-accurate node, with particle transport over a long time, alongside the mechanical stresses produced by cilia movement, may enable us to probe more deeply the flow conversion mechanism that has proved to be so elusive, and moreover to assess the diverse morphologies observed in different species.
For the GPU accelerated calculations, the size of the system requires significant data transfer between the GPU and the CPU, and so the relative speed increases are not as significantan order of magnitude on M1, and half an order of magnitude when using M3 rather than M2.However the real-world implication of this speed increase is significant; when solving for a system of 100 particles, the time required is a little over 42 hours when using M2 (CPU), but less than 8 hours on M3 (GPU).

Discussion
The regularized stokeslet method provides a relatively elementary access point to numerical simulation of the geometrically complex Stokes flows characterising many microscale biological systems.In this paper we reviewed the nearest-neighbour discretisation approach, which decouples the number of degrees of freedom from the quadrature process, hence controlling the size of the linear system that needs to be solved.We then described the implementation of this method on GPU-enabled hardware; the fact that the method is already based on linear algebra operations means that only two lines of MATLAB code needed to be added in order to exploit GPU acceleration.Assessing the problem of calculating swimming motion due to multiple undulatory swimmers in an infinite fluid, multiple sperm in a confined channel, and particle transport in a ciliated organ, we consistently observed at least an order of magnitude time reduction when using the GPU over the CPU.While we carried out some of the computations on an HPC cluster, we limited our use to a single computational node per simulation.Further advances may be possible by re-analysing the code and/or utilising a compiler, although our primary focus is simplicity of implementation.
Parallel computing methods for microscale flow have been explored since the development of the completed double layer boundary integral equation method of Phan-Thien and Tullock in 1994 [58] (see also the earlier work of Power and Miranda [59] and later by Keaveny & Shelley [60]), enabling a multiparticle system with 81, 000 degrees of freedom to be solved on a Cray-YMP class supercomputer.The most powerful approaches to large scale microscale flow problems are based on approximating the multipole expansion of the stokeslet, including the kernel-independent fast multipole method [61,50,62,51], treecode [52] and the forcecoupling method [63,64], which enable O(N ) or O(N log N ) computational complexity, by contrast with the O(N 3 ) complexity of the method described here.These ideas have been used to simulate large suspensions of swimmers and elastic fibres, which are beyond the scope of the 'original' formulations of the boundary integral method or regularized stokeslet method, arguably at the cost of significantly greater implementational complexity.More 'mainstream' computational fluid dynamics approaches based on volumetric meshes and the finite volume and finite element methods have also been deployed very successfully for multicilia simulations for example [65].It would certainly be interesting to compare the present code with these methods, however the spirit of the nearest-neighbour approach is, rather than aiming to reduce asymptotic complexity, to minimise unnecessary degrees of freedom (i.e. to reduce N ), via algorithms that can be implemented in a few linear algebra operations, and without intricate mesh construction.The present contribution shows that with fairly minimal further modifications and accessible hardware, some fairly complex biological flow systems can be simulated.Future efforts along these lines may consider how rapidly-advancing algorithmic and hardware developments can continue to be made available to the non-specialist community.

Figure 1 :
Figure 1: The problem of discretizing the regularized stokeslet boundary integral equations via quadrature rules, illustrated via the resistance problem for a rigid sphere, with associated coarse discretization illustrated via dots.(a) The traction field associated with pure rotation, with associated fine discretization illustrated via crosses.(b) The regularized stokeslet kernel with ε = 0.1 for fixed stokeslet point y and varying field point x.(c) The combination of coarse (black dot) and fine (red cross) discretizations as employed by the nearest neighbour regularized stokeslet method.

Figure 3 :
Figure 3: Comparison of the computational time for constructing the matrices A s and A, and solving the linear system for the problem of a single sperm swimming.(a) -(c) The time taken for M1 (CPU), M1 (GPU), M2 and M3 (lower is better).(d) -(f)The relative speed increase when incorporating the GPU for M1, and when using M3 instead of M2 (higher is better).(g) The total time to solve a single iteration for the single sperm on M1 (CPU), M1 (GPU) (lower is better), M2 and M3.(h) The relative speed increase when incorporating the GPU for M1 and M3 instead of M2 (higher is better).

Figure 4 :
Figure 4: Simulation of an array of 25 model C. elegans for three beat cycles.(a) The model beat pattern in time.(b) The position of each swimmer at t = 0 is shown, with the paths traced out by the leading point as they swim overlain.(c) The total simulation time required on each of M1 (CPU), M2 (GPU), M3 and M4 (lower is better).(d) The relative speed when using a GPU compared to CPU on M1 and M3 compared to M2 (higher is better).

Figure 5 :
Figure 5: Simulation of an array of 9 model sperm for three beat cycles.(a) The model beat pattern in time.(b) The position of each swimmer at t = 0 is shown, with the paths traced out by the leading point as they swim overlain.(c) Side-on view of swimmers showing location of plate boundaries.(d) The total simulation time required on each of M1 (CPU), M2 (GPU), M3(CPU) and M4 (GPU) (lower is better).(e) The relative speed when using a GPU compared to CPU on M1, and M3 compared to M2 (higher is better).

Figure 6 :
Figure 6: Particle tracking in the euciliated mouse node.(a) A sketch of the mouse node geometry with overlaying Reichert's membrane and 112 beating cilia.(b) Paths traced out by 10 particles after 1000 beats.(c) The total simulation time required on each of M1 (CPU), M2 (GPU), M3 (CPU) and M4 (GPU) (lower is better).(d) The relative speed when using a GPU compared to CPU on M1, and M3 compared to M2 (higher is better).