Improving axial resolution in Structured Illumination Microscopy using deep learning

Structured Illumination Microscopy (SIM) is a widespread methodology to image live and fixed biological structures smaller than the diffraction limits of conventional optical microscopy. Using recent advances in image up-scaling through deep learning models, we demonstrate a method to reconstruct 3D SIM image stacks with twice the axial resolution attainable through conventional SIM reconstructions. We further demonstrate our method is robust to noise and evaluate it against two-point cases and axial gratings. Finally, we discuss potential adaptions of the method to further improve resolution. This article is part of the Theo Murphy meeting issue ‘Super-resolution structured illumination microscopy (part 1)’.


Introduction
Structured Illumination Microscopy (SIM) is a superresolution technique which enables a twofold increase in lateral resolution (X/Y axis, perpendicular to the line of sight of the microscope) when compared to conventional fluorescence microscopy [1,2]. SIM functions via the illumination of structured light onto a specimen to obtain spectral information that would be out of the range visible to a wide-field microscope. Repeating this procedure across a set of structured light patterns constructs a stack of raw SIM images of a specimen, the ensemble of which contains a wider spectral range of frequencies than obtained via a single exposure. Image stacks are then reconstructed into a single super-resolved image using computational algorithms. A wide range of illumination patterns can be used, including 3D patterns that allow for improvements in axial resolution (Z axis, parallel to line of sight of the microscope) [3]. This has proven to be a useful investigative tool for the analysis of live biological processes as it avoids most sample damage traditionally associated with electron microscopy [4]. Example applications include in vivo imaging of synapses in mice [5], or analysing the distribution of proteins on chromosomes, a task previously unfeasible due to the relative scale of the protein and a confocal microscope's diffraction limit [6]. As such, the further improvement of image reconstruction methodologies for microscopy may allow the observation of previously unobserved live biological processes.
Improvements in axial resolution have historically been achieved by advances in optical setup and illumination techniques, such as 3D-SIM [7], TIRF-SIM [8], spot-scanning SR-SIM [9] or 2-photon SR-SIM [10]. As these techniques can incur significant costs and complexity, recent developments in deep learning have been adapted and optimized to address these issues from a purely computational approach. Proposed deep learning methods for SIM have successfully demonstrated reductions in aberrations and reconstruction artfacts due to movements of the specimen [11,12] or low-light conditions [13][14][15], but as of yet, deep learning networks have not been applied to the problem of reconstructing SIM image stacks to a resolution beyond that attainable by standard SIM methodologies. Training a network to take as its input a set of SIM images as might be produced by a 2D-SIM microscope setup (2-beam, 3-angle) and with a target that has the resolution characteristics of a 3D-SIM microscope, would allow the network to achieve an axial resolution comparable to 3D SIM microscopes without the motion artefacts or photo-bleaching caused by high image counts [16].
As the range of imaged light frequencies is reduced in this microscope set-up, the use of convolutional layers will allow us to infer additional information (as is currently done through Richardson-Lucy deconvolution [17]). Building on the work of deep learning networks applied to SIM reconstructions, a modified residual channel attention network (RCAN) [18] deep neural net is trained to reconstruct SIM image stacks at double the axial resolution than that attainable via the original SIM reconstruction methodology [2]. High-resolution (HR) images are simulated as confocal images with an illumination wavelength reduced by a factor of √ 2 to match the lateral resolution obtained via seminal SIM reconstructions, while axial resolution is improved by decreasing the axial slice thickness of each image in the stack. Simulated training and test data is generated from both dense and biologically realistic 3D point clouds, at both standard resolution (SR) and HR as detailed in §3. The network is evaluated against test data to investigate the method's effectiveness at resolving structures composed of point illuminations ( §4). As we expect the network to output spatial frequencies which are not inherent in the input data (which is already achieved through deconvolution techniques [17]), the network should be able to infer these from the widefield images using the mapping from simulated widefield images to simulated HR confocal images. Further analysis is performed that demonstrates robustness to low-lighting conditions ( §4). Limitations in network generalizability and the broader potential of the work is discussed in §5.

Existing work
Computational methods for SIM image reconstruction have set benchmarks of attainable resolution, and demonstrated flexibility to imprecise or varying illumination patterns. Examples of these are blind-SIM [19], in which gradient descent is used to minimize the least-squared error between the measured data and data predicted by the convolution of an estimated biological sample with an estimated illumination pattern. By alternatively fixing the estimated data and estimated illumination pattern, a HR image can be retrieved with limited knowledge of the illumination pattern but at the cost of significant compute time. Similarly, the use of Richardson-Lucy deconvolutions [17] Figure 1. Modified RCAN network architecture, adapted from [18]. Note the use of a 3D convolution at the input, and the use of a single input channel and three output channels. (Online version in colour.) demonstrated significant improvements in resolution when experimentally measuring the PSF of the optical apparatus. By contrast, deep learning has been applied to SIM reconstruction with varying goals, such as accelerating the imaging process by reducing the required number of raw frames (i.e. the number of variations in illumination patterns) and inferring the spectral information from a smaller subset of SIM frames. These demonstrate an objective of high-throughput imaging, where a known microscope configuration is coupled with deep learning methods to minimize the imaging and processing time while improving the obtained resolution.
This has been achieved using U-Nets [14] and CycleGAN (a variation of a conditional generative adversarial network) [15]. These diverge from our goal which is to attain a higher axial resolution while maintaining a 2× increase in lateral resolution. Image super-resolution for generic photography has progressed considerably in the past decade, resulting in a large variety of architectures and techniques [20]. Notably, these form paradigms regarding the pre-or postupsampling of inputs (relative to convolutional or other image processing), methods of image interpolation, residual learning, network recursiveness or parallel paths in addition to a large number of architecture-specific features.
Among these exists the RCAN [18], which is engineered to use inter-channel features (RGB colours in the original network) in addition to sub-pixel upsampling. These are particularly suited to SIM reconstruction, as this process entails the recombination of multiple raw images (i.e channels) into an up-scaled image. An RCAN for SIM reconstruction was successfully trained using simulated SIM stacks of generic stock imagery [13], which forms the basis for our architecture (figure 1).

Methods (a) Deep learning (i) Network architecture
The network architecture is based on the RCAN [18] which was previously used to implement a form of SIM reconstruction without up-scaling [13]. We modified the network to re-implement the lateral up-scaling module found in the original implementation, therefore doubling the lateral output image size. The up-scaling module was necessary to generate images comparable to our target data, which is Nyquist sampled without aliasing. The use of an up-or down-scaling algorithm to adapt the target to the network output or vice versa could cause undesirable patterns to be learned by the network, while using the up-scaling module between learned layers allows for some learned adaption to the underlying mapping of the low-resolution image stacks and HR confocal target images. Furthermore, the first 2D convolutional layer was replaced by a 3D convolution in order to infer information from other SIM stacks within the same processing chunk (see section below for chunk-generating procedure) prior to higher order operations (which take place in the residual blocks of the network). The number of output channels was also changed according to the chunk size, such that three consecutive grey scale (i.e. single channel) images were produced for every inputted SIM stack.
The final network configuration used 12 groups of three residual blocks, a kernel size of 3×3 and a chunk size of 3; these parameters were chosen experimentally as detailed in electronic supplementary materials, 6, Network architecture & Optimization.
(ii) Chunk processing Input SIM stacks were processed in chunks to allow varying depths of image stacks to be compatible with the network and improve the model's performance by inferring relations between consecutive axial images. In this scheme, a chunk size of two corresponds to a network input of 14 images (7 frames per SIM image * 2 chunk size) and a network output of 6 frames (3 frames per SIM image * 2 chunk size). Chunk size was determined experimentally by modifying the training data and input channels for the architecture described in §3; these were then evaluated using 24 withheld chromatin structure images and a series of images of spherical point clouds with increasing point densities. The evaluation of chunk size on network performance can be found in electronic supplementary material, Section A5.
The chunk size had no significant effect on the network's mean squared error (MSE) (Mann-Whitney U-Test, p > 0.05 for all pairs of chunked data), but the structural similarity index (SSIM, a comparative metric of conservation of structural information between an image and reference image [21]) was higher for a chunk size of 3 (U-Test, p < 0.05 for 6/9 comparisons). This indicates that a chunk size of 3 maximizes the conservation of the structure of the images, as reflected by the mean lateral FWHM (smaller than all other chunk size's means, U-Test, p < 0.05) and axial mean (smaller than all other chunk sizes, U-Test, p < 0.05).
(iii) Data pre-and post-processing Pairs of SIM image stacks and HR confocal images were axially sliced to produce chunked sets of images as described above. As the simulated images extended beyond the simulated underlying data, image pairs with little signal (mean pixel value below 5 × 10 −7 ) were discarded. The remaining set of chunked images was then normalized to a range of [0-1].
A total of 100 chromatin structures and 68 spherical datasets were simulated to generate 168 image stacks. Of these, 10 chromatin structures and 20 spherical datasets were reserved for a test dataset, leaving 138 stacks for training and validation. The input/output images were then axially chunked into sets of 3 SIM images, generating 12 training datapoints per stack for a total of 1656 training points. As the simulated stacks contain little signal at the start and end of the image stacks, chunks with a mean pixel value (measured as a 16-bit float) below 10 −7 are discarded, leaving a total of 1178 training points and 130 validation points.
Negative pixel values did occur in the RCAN outputs due to small normally distributed random errors in output values; as a majority of test images are zero-valued backgrounds, many of the errors cause small negative values. The output was therefore post-processed to truncate negative values, which is in effect the equivalent to using an additional reLu layer on the network but without causing a loss of learning gradient.
(iv) Training methodology All deep learning models were implemented using the PyTorch library (v. 1.5) [22] in Python 3.7. The starting learning rate was set to 10 −6 , was adjusted if the validation rate plateaued during training through the use of the ReduceLROnPlateau method provided by PyTorch, and a further 1% decrease was applied every 5 epochs to ensure fine tuning in later epochs. Network gradients were clipped at ±0.1 to avoid exploding gradients. Training/validation sets were drawn from all generated data except for withheld test cases, which were randomly selected prior to training. Training took approximately 8 h on a set of 2 Nvidia GTX 2080 TI GPUs. Over-fitting does not appear to have occurred in our training scheme as demonstrated by the continual decrease in mean MSE values for the validation dataset (see electronic supplementary material, A6).

(b) Data simulation
The model was trained using simulated SIM image stacks as input, and an up-sampled confocal image as ground truth. The methods used to generate these and benchmark SIM reconstructions are described below.

(i) SIM image stacks
Raw SIM images were generated from 3D point cloud data; these simulated a biologically realistic chromatin structure [23] (approx. bounded in a square of 16 × 16 µm) or a point cloud encompassed in a sphere (radius of 5 µm) with varying numbers of point emitters. The simulated microscope used a 7-frame setup based on the interference of 3 coherent beams to produce a 2D hexagonal illumination pattern on the sample; this is well illustrated in figure 2D-F of Ingerman et al. [3]. The structured illumination is simulated as a thin light-sheet projected onto the specimen from the side [24] thereby avoiding the issues of reduced resolution enhancement usually found with this SIM geometry in an epi-illumination set-up. The simulated microscope set-up and image simulation approach is further described by Gong et al. [25], and a mathematical description can be found in the electronic supplementary material of Gong et al. The system was simulated with a camera size of 256 × 256 pixels of size 6.5 µm square, an objective magnification of 60, with a numerical aperture of 1.1 and an illumination/emission wavelength of 525 nm. The chosen illumination/emission wavelengths are not required to be identical and were chosen for convenience in this experiment. No adverse effects are expected from the use of wavelengths differing by up to 10%. Photobleaching and other image quality degradation were not simulated as part of the experiment, though low-lighting conditions were simulated using Poisson statistics.
(ii) High-resolution confocal microscope simulation Target output stacks were generated using a simulation of a confocal microscope on the same cloud point data. The parameters of a simulated confocal microscope were set to out-perform the simulated SIM microscope (halved pixel size, illumination wavelength is divided by √ 2, axial & lateral Nyquist sampling rate doubled, squared PSF function). These parameters ensure a doubling in both axial and lateral resolutions.
The number of output frames was also increased from 40 to 120 frames in order to improve the precision of measurements of axial resolution.

(iii) Structured illumination microscopy reconstructions
While complete HR images can be reconstructed with a set of just 7 images, this hexagonal patterned SIM illumination reconstructs according to the same carrier spatial frequencies as are found in a 2-beam, 3-angle set-up using 9 frames [1,2]. As a consequence, the processing of the 7-frame data to a super-resolved image with enhanced lateral resolution [25] was achieved in a similar way to the standard SIM reconstruction algorithm [2]. The estimation of reconstruction parameters was found to be imprecise due to the sub-sampling of image stacks in which the number of point emitters was limited. Constant reconstruction parameters matching those used to simulate the images were therefore used to reduce any source of variability in the reconstructed SIM images. Reconstructed SIM images were calculated to provide a comparative baseline for our RCAN-reconstructed images. This was implemented in Python (see §5), and was modified to output 120 frames to match the number of output frames in our HR confocal simulations and network outputs.

(c) Measuring resolution
The full width at half maximum (FWHM) metric has commonly been used to evaluate the resolution of microscopy images [7,26]. It is defined as the width of a point spread function in a given direction at half the peak intensity; a large FWHM would indicate a wider point spread function (PSF) resulting in a less detailed image, while a smaller FWHM reflects a narrower PSF and a sharper image. Alternative methods such as Fourier-ring correlation [27] could be viable to measure resolution; we instead elected to use FWHM as it provided a high-throughput measure of resolution for images of point clouds. The FWHM was estimated using a method based on auto-correlation functions [28], thereby evaluating the FWHM for every point light source in an image rather than individually and manually selected data points. The auto-correlation function (ACF) of an image was calculated using the inverse Fourier transform of the power spectrum on individual 3D datasets. High densities of data-points can cause significant cross-correlation of signals from neighbouring PSFs; these are removed by modelling the image auto-correlation as the sum of two Gaussians. The first Gaussian models the auto-correlation of individual PSFs in the image, while the second models the cross-correlation between neighbouring randomly distributed PSFs. These are estimated using a bounded least-squares regression, and the FWHM is calculated from the Gaussian with the smallest standard deviation. An example of the modelling process is illustrated in supplementary materials Section A4. The Python & Scipy implementation of this methodology is available via GitHub, and an equivalent mathematical notation is available in the electronic supplementary material, Section A1.

(a) Doubling axial resolution using RCAN networks
The network's ability to double the axial resolution was tested using 24 unseen simulated chromatin structures; this is visible in figure 2a. The mean axial FWHM for RCAN reconstructions (mean = 436.8 nm, s.d. = 21.8) was found to be significantly smaller than those produced using SIM (mean = 669.67 nm, s.d. = 207.2) (T-Test t(10) = −3.5, p < 0.05). We further found the model to be robust to increasing point density when compared to standard SIM reconstructions (see figure 2b). This is achieved without significantly affecting the lateral resolution (X and Y axis, perpendicular to line of sight of the microscope), as seen in figure 2c, and maintaining a strong conservation of structure as seen in figure 2d. This is illustrated by the orthogonal Y/Z view in the electronic supplementary material, Section A2; figure 2, where the high structural fidelity and low background noise of the RCAN output contrasts with the SIM reconstruction.

(b) Robustness in low-light imaging conditions
An additional dataset was generated using four other simulated chromatin structures tainted with Poisson noise. The signal-to-noise ratio was decreased by reducing the expected signal photons per emitter over 7 SIM frames, thereby simulating low-lighting conditions. The RCAN model demonstrated robustness against low-lighting conditions by maintaining a consistent axial resolution (figure 3a) MSE (figure 3c) and SSIM [21] (figure 3d), in contrast to the reconstructed SIM images which diverge from the reference HR images in SSIM and MSE in worsening lighting conditions. The notable decrease in axial resolution for SIM reconstructions at lower lighting levels is due to a breakdown in the quality of the reconstruction, as reflected by the large increase in MSE and decrease in SSIM, while these are more consistent for RCAN reconstructions (figure 4).

(c) Evaluation against third party data
The training data examined above differs from the biological structures observed in other SIM experiments, such as actin filaments [14], membranes [13] or other biological structures which form wider continuous illuminations. To evaluate our network's performance on data containing these structures, cloud point data for Single Molecule Localization Microscopy (SMLM) was taken from an SMLM competition's training dataset [29] (training sets MT1.N1.LD (≈8000 points, lowdensity) and MT1.N1.HD (≈13 000 points, high-density)). These contain shallow and narrow filament structures with high point densities, allowing us to evaluate the generalizability of our method to biological structures observed in other studies. SIM image stacks and HR confocal images were simulated from this data, and processed by the RCAN network. As visualized in 5, the RCAN (bottom-right) was able to produce the expected microtubule without the side-lobe artefacts seen in the SIM image; these are visible as faint duplicate microtubules on either side of a well illuminated strand, and are more distinguishable to non-expert viewers through the video provided in the supplementary material. The difference in axial resolution (RCAN mean: 331.4 nm, SIM mean: 605.1 nm, reference HR confocal mean: 372.9 nm) reflected the previous test on chromatin data, as did the similarity in lateral resolution (RCAN: 170. 8   The drop in resolution as noise levels increase represents a shortfall of our method for resolution estimation, as reconstructed point artefacts due to noise are not filtered when estimating the resolution. We therefore consider the measured resolution as a comparative measure of reconstruction quality rather than an attainable resolution in experimentally noisy conditions. (Online version in colour.) HR confocal, and in electronic supplementary materials A7, figure 10b, where a uniformly dense grating is reconstructed inconsistently. This effect is likely due to the use of sparse point clouds as training data, and could be remedied in future work by simulating artificial specimens containing uniformly dense light sources ( figure 6).

Discussion
As demonstrated by the results in §4, the RCAN architecture described in §3 is capable of producing SIM reconstructions with a higher axial resolution than current 2D SIM reconstructions without a loss of lateral resolution or structural fidelity. This was demonstrated to be robust in low-light levels, which is typically a complexity of SIM and had been previously addressed by other deep-learning solutions [14]. The use of two simulated structure types in our training data (biological structures and spherebounded point-clouds) demonstrates some capacity for the network to generalize to varying structures in observed specimens; this was reflected by the performance of the network on simulated SMLM data in figure 5. The network was found to approximate reconstructions of continuous objects (microtubules) by reproducing the point-source patterns found in our training data. The task of generalizing to other sources of data was addressed in varying approaches, such as training with real-world images [13]. This mitigates the likelihood of over-fitting a model to a distinct biological structure, as they are not present across all training data. By contrast, a U-Net approach found the need to re-train models to mitigate the occurrence of artefacts in reconstructed images [14]. Therefore, it may be necessary to train on a wider variety of real or synthetic datasets to optimize the reconstruction of other biological structures using this method.
An issue which has yet to be addressed by image up-scaling networks is the limits of the up-scaling factor with regards to image fidelity. The original implementation of the RCAN architecture [18] demonstrates 4× up-scaling but with a severe degradation of SSIM; this brings into question the certainty with which reconstructed images can be considered, especially if the ratio of up-scaling is further increased. The modified RCAN described here may be more effective when attempting higher ratios of up-scaling as the initial set of information contains many more frames (i.e channels) per output image, and therefore has the capacity to contain more information to be reconstructed into the HR image. A brief insight into the effect of removing one of these channels (see electronic supplementary material, A3) demonstrates the use of all channels in the reconstruction of the image, as the removal of any single image deteriorates the MSE and SSIM statistics in comparison to the reconstruction without masked layers. Ultimately, as the ratio of up-scaling goes beyond the capability of human validation, further imaging of structures via destructive methods (such as electron microscopy) could also validate final time points in reconstructed time lapse SIM images. A cross-validation could be performed using training groups formed of distinct biological structures; the sensitivity of the network to its training data would then provide better insight into potential reconstruction biases present in the network. This could further be extended by generating ensembles of trained models; this has previously been demonstrated in the use of image restoration for fluorescence microscopy [30]. In this scheme, each model would be trained on a random subset of the training data, and the output of all trained models for any given novel image reconstruction would be considered to form a statistical model for each output pixel. This could allow a degree of statistical confidence, which may be increasingly required to assert the certainty of reconstructions especially at larger ratios of image up-scaling.