Modified kernel MLAA using autoencoder for PET-enabled dual-energy CT

Combined use of PET and dual-energy CT provides complementary information for multi-parametric imaging. PET-enabled dual-energy CT combines a low-energy X-ray CT image with a high-energy γ-ray CT (GCT) image reconstructed from time-of-flight PET emission data to enable dual-energy CT material decomposition on a PET/CT scanner. The maximum-likelihood attenuation and activity (MLAA) algorithm has been used for GCT reconstruction but suffers from noise. Kernel MLAA exploits an X-ray CT image prior through the kernel framework to guide GCT reconstruction and has demonstrated substantial improvements in noise suppression. However, similar to other kernel methods for image reconstruction, the existing kernel MLAA uses image intensity-based features to construct the kernel representation, which is not always robust and may lead to suboptimal reconstruction with artefacts. In this paper, we propose a modified kernel method by using an autoencoder convolutional neural network (CNN) to extract an intrinsic feature set from the X-ray CT image prior. A computer simulation study was conducted to compare the autoencoder CNN-derived feature representation with raw image patches for evaluation of kernel MLAA for GCT image reconstruction and dual-energy multi-material decomposition. The results show that the autoencoder kernel MLAA method can achieve a significant image quality improvement for GCT and material decomposition as compared to the existing kernel MLAA algorithm. A weakness of the proposed method is its potential over-smoothness in a bone region, indicating the importance of further optimization in future work. This article is part of the theme issue ‘Synergistic tomographic image reconstruction: part 2’.


Introduction
Positron emission tomography (PET) integrated with computed tomography (CT) is a molecular imaging modality that is widely used in clinical oncology, neurology and cardiology. In parallel, dual-energy (DE) CT imaging has the unique capability of using energy-dependent tissue attenuation information to perform quantitative multi-material decomposition [1]. Combined use of PET/CT and DECT provides a multi-parametric characterization of disease states in cancer and other diseases [2]. Nevertheless, the integration of DECT with existing PET/CT would not be trivial, either requiring costly CT hardware upgrade or significantly increasing CT radiation dose.
We have proposed a new dual-energy CT imaging method that is enabled using a standard time-of-flight PET/CT scan without change of scanner hardware or adding additional radiation dose or scan time [3,4]. Instead of using two different X-ray energies as commonly used by conventional DECT, the PET-enabled dual-energy CT method combines a radiotracer annihilation-generated high-energy 'γ -ray CT (GCT)' at 511 keV with the already-available lowenergy X-ray CT (usually ≤ 140 keV) to produce a pair of dual-energy CT images on PET/CT for multi-material decomposition.
The reconstruction of the GCT image from the PET emission scan can be achieved by the maximum-likelihood attenuation and activity (MLAA) method [5]. However, standard MLAA reconstruction is commonly noisy because the counting statistics of PET emission data are limited. While the noise would not compromise the performance of MLAA for PET attenuation correction, it may affect the quantitative accuracy of GCT for multi-material decomposition. To suppress noise, the kernel MLAA approach [4] has been developed by use of X-ray CT as image prior through a kernel framework and has demonstrated substantial improvements over standard MLAA.
In the kernel methods for image reconstruction (e.g. [6][7][8][9][10]), a set of features need to be defined for constructing the kernel representation of the image to be estimated. Existing kernel methods have mainly used image pixel intensities of a small patch (e.g. for MR-guided PET reconstruction [7][8][9][10]) or temporal sequence (e.g. for dynamic PET reconstruction [6,9]). However, the intensitybased features do not always provide satisfactory results. As shown in [4] and later in this paper, the reconstructed GCT image by such a method suffers from artefacts.
In this paper, we propose to use a convolutional neural network (CNN) feature set that is adaptively learned on the prior image to build the kernel representation for MLAA reconstruction. It has been demonstrated that deep learning with CNN has a strong ability to derive a latent feature representation in different tasks [11]. While it is often impractical to collect a large amount of training data for supervised learning, here we utilize the concept of autoencoder CNN [12], an unsupervised representation learning technique, for intrinsic feature extraction from the X-ray CT image prior for the kernel construction. The autoencoder CNN-derived feature set of X-ray CT image is expected to provide a more robust kernel representation for the GCT image reconstruction than the conventional intensity-based features.

PET-enabled dual-energy CT (a) Statistical model of PET emission data
In time-of-flight (TOF) PET, the measured data y can be well modelled as independent Poisson random variables using the log-likelihood function, where i denotes the index of PET detector pair and N d is the total number of detector pairs. m denotes the mth TOF bin and N t is the number of TOF bins. The expectation of the PET projection data y is related to the radiotracer activity image λ ∈ R N p ×1 and object attenuation image μ ∈ R N p ×1 at 511 keV via where G m ∈ R N d ×N p is the PET detection probability matrix. N p is the total number of image pixels. r m ∈ R N d ×1 accounts for the expectation of random and scattered events. n m (μ)∈ R N d ×1 is the normalization factor with the ith element being where c i,m represents the multiplicative factor excluding the attenuation correction factor and A∈ R N d ×N p is the system matrix for transmission imaging.

(b) Maximum-likelihood attenuation and activity reconstruction
The MLAA ( [5]) reconstruction jointly estimates the attenuation image μ and the activity image λ from the projection data y by maximizing the Poisson log-likelihood, An iterative interleaved updating strategy is commonly used to seek the solution [5].

(c) Kernel MLAA
The GCT estimate by standard MLAA is commonly noisy due to the limited counting statistics of PET emission data. To suppress noise, the kernel MLAA approach [4] incorporates the X-ray CT image as a priori information to guide the GCT reconstruction in the MLAA. It describes the intensity of the GCT μ j in pixel j as a linear representation in a transformed feature space where f j is the data point of pixel j that is extracted from x and φ( f j ) is a mapping function that transforms the low-dimensional data point f j to a high-dimensional feature vector. w is a weight vector which also sits in the transformed space, w = l α l φ( f l ), with α being the coefficient. Then, we can obtain the following kernel representation for μ j , where κ(·, ·) is the kernel function (e.g. radial Gaussian) that is equal to the inner product of the two transformed feature vectors φ( f j ) and φ( f l ). The equivalent matrix-vector form for the GCT image is μ = Kα, (2.9) where K is the kernel matrix and α denotes the corresponding kernel coefficient image. Substituting equation (2.9) into the MLAA formulation in equation (2.4) gives the following kernel MLAA optimization formulation, λ,α = arg max λ≥ 0,α≥ 0 L y|λ, Kα . (2.10) The detail of the kernel MLAA algorithm is provided in [4]. Onceα is obtained, the final estimate of the GCT image is obtained byμ = Kα.
(d) Material decomposition using PET-enabled dual-energy CT For each image pixel j, the GCT attenuation value μ j and X-ray CT attenuation value x j jointly form a pair of dual-energy measurements u j [x j , μ j ] T , which can be modelled by a set of material bases, such as air (A), soft tissue (S) or equivalently water, and bone (B): subject to k ρ j,k = 1. The coefficients ρ j,k with k = A, S, B are the fraction of each basis material in pixel j. The material basis matrix U consists of the linear attenuation coefficients of each basis material measured at the low and high energies. Finally, ρ j is estimated using the following leastsquare optimization for each image pixel, (2.12)

Proposed autoencoder kernel method (a) Building kernels using convolutional neural network features
In the kernel MLAA [4] and other kernel methods for image reconstruction (e.g. [6][7][8][9][10]), the formation of the pixel-wise feature vector f j is a key factor to build the kernel matrix K and directly impacts the reconstruction result. Conventionally, the feature vector f j is defined by the intensity value of a pixel or its surrounding small patch (therefore denoted by f Patch j ). However, such an approach may lead to suboptimal feature representation because of the simplification of feature attributes and the small size of the receptive field. Artefacts were observed in the reconstructed GCT images of the kernel MLAA [4]. While it is possible to design more complex features (e.g. texture) to make the features more robust and differential, it would require a significant amount of handcrafting.
To alleviate the issues, we propose to exploit deep learning with CNN [11] for intrinsic feature extraction from the X-ray CT image instead of using conventional intensity-based features in the kernel MLAA. The latent feature representation of a trained CNN model may benefit from the larger receptive field of deep convolutions to build a deeper and more robust feature set for kernel representation. A natural choice is by use of supervised deep learning which has shown strong potential for feature extraction in image recognition tasks. The optimization problem can where i denotes the index of training pairs with z i the input and o i the output. φ(θ; ·) is a neuralnetwork model with θ the unknown weights. Once the neural-network model is pre-trained, we can apply it to the X-ray CT image prior x to extract intermediate CNN feature sets for each image pixel. One major challenge with supervised deep learning is it commonly requires a large number of training datasets, which are not always available or the data acquisition is costly. Therefore, here we explore the feasibility of feature extraction using unsupervised deep learning for the kernel methods.

(b) The autoencoder kernel
An autoencoder is an unsupervised technique for representation learning using neural networks [12] based on a signal reconstruction problem. The output of the neural network is trained to be an approximation of the input, as restricted by the network model itself. The corresponding optimization problem for applying an autoencoder for our kernel method is defined by, where both the input z and output o in equation (3.1) are now set to the X-ray CT image x. The optimization essentially seeks an adaptive CNN representation of the image x, without requiring a large training database. We specifically consider two types of CNN for φ, as shown in figure 1. One is the residual encoder-decoder convolutional neural network (RED-CNN) that has been used for low-dose CT  denoising [21]. The other one is a Unet model that is widely used for image segmentation and reconstruction (e.g. [22]). In the RED-CNN, the encoders consist of a series of 5 × 5 convolutions followed by rectified linear unit (ReLU) activation step by step. A series of 5 × 5 deconvolution layers are used with residual mapping to recover the structural details. For the Unet, a modified structure [22] is used, which consists of a left-side encoder path and a right-side decoder path with residual connection from the left to the right. All convolutional filters have a size of 3 × 3. Stride 2 × 2 is used for downsampling and 2 × 2 bilinear interpolation is used for upsampling. This results in approximately 7 × 10 5 parameters in the RED-CNN model and 3.4 × 10 5 parameters in the Unet model, when compared with the input image of 32400 pixels in our study.
Once the model is trained using equation (3.2), the feature set of pixel j is obtained by, where F denotes the output of the th layer of the CNN model, which corresponds to the th three-dimensional block in figure 1. Note that can only be set to a layer that provides multichannel feature maps of the same size as the input image. Thus, a general choice is the penultimate layer which commonly provides an intermediate output that has the same image size as the final output and also consists of a number of channels. Compared to the intensity-based features, the CNN-derived features form a different nonlinearly transformed space based on which the built kernel method may have a better performance, as illustrated later in the Results section. For the kernel methods, the extracted CNN feature set is fed into the kernel calculation, for instance, using a radial Gaussian kernel, where σ is a hyper-parameter which can be set to 1 if the feature set is normalized [6]. Note that the full consideration of all (j, l) pairs for kernel representation would result in a full matrix K, which is impractical for efficient implementation because of its large size. Similar to the previous kernel methods [6], K is built to be sparse using the k-nearest neighbours (kNN) strategy [23].

Computer simulation studies (a) Simulation set-up
We simulated a GE Discovery 690 PET/CT scanner in two dimensions. The TOF timing resolution of this PET scanner is about 550 ps. The simulation was conducted using one chest slice of the XCAT phantom. The true PET activity image and 511 keV attenuation image are shown in figure 2a,b, respectively. The images were first forward projected to generate a noise-free sinogram of 11 TOF bins. A 40% uniform background was included to simulate random and scattered events. Poisson noise was then generated using 5 million expected events. The X-ray CT image at a low-energy 80 keV was also simulated from XCAT and is shown in figure 2c.   [4], we used the 511 keV attenuation map converted from the X-ray CT image as the initial estimate of μ in the MLAA reconstructions for accelerated convergence. All different kernel matrices were built using kNN [23] with k = 50 based on the Euclidean distance between f l and f j in the same way as used in [6]. All MLAA reconstructions were run for 3000 iterations for the purpose of studying the convergence of different MLAA algorithms. Within each outer iteration, one inner iteration was used for the λ-estimation step and one inner iteration was used for the μ-estimation step. These parameters were chosen for approximately optimal image quality.

(c) Autoencoder convolutional neural network learning and feature visualization
We used the Adam optimization algorithm to train the RED-CNN and Unet of the X-ray CT image with 300 epochs on a workstation with a Nvidia GeForce RTX 2080 Ti GPU. The learning rates of RED-CNN and Unet were chosen to be 10 −4 and 10 −2 , respectively, for approximately optimal performance. The training time was approximately 20 s for both RED-CNN and Unet in this two-dimensional simulation study.
To demonstrate the differences between the autoencoder kernel and standard kernel, figure 3 visualizes the maps of the feature set used by the two types of kernels. Each subimage corresponds to one element of the feature vector f j or f CNN j at all different pixels. The standard intensity-based kernel was formed from 3 × 3 neighbouring patches, which explains why the feature maps look similar. The Unet-derived CNN features were learned using an autoencoder as the output of the penultimate layer (12 channels) and extracted intrinsic, differential features from the X-ray CT image. Figure 4 shows the two-dimensional manifold visualization of these high-dimensional (9 and 12, respectively) feature vectors using the t-distributed stochastic neighbour embedding (t-SNE) algorithm [24]. The t-SNE is a nonlinear dimension reduction technique that locates each high-dimensional data point in a low-dimensional embedding space with intrinsic structures preserved. Image pixels (equivalently points in the figure 4) are clustered but less organized in the embedding space of the intensity-based features. In comparison, pixels are more organized

(d) Evaluation metrics
Different MLAA methods were first compared for the image quality of GCT using the mean squared error (MSE) defined by MSE(μ) = 10 log 10 ||μ − μ true || 2 /||μ true || 2 (dB), (4.1) whereμ represents the reconstructed GCT image by each MLAA method and μ true denotes the ground truth. The ensemble bias and standard deviation (SD) of the mean intensity in regions of interest (ROIs) were also calculated to evaluate ROI quantification in a liver region and a bone region shown in figure 2d, where c true is the noise-free intensity and c = 1 N r N r i=1 c i denotes the mean of N r realizations. N r = 10 in this study. In addition to the evaluation of bias and SD for ROI quantification, pixelbased ensemble bias and SD in percentage were also calculated and reported in average for specific regions in the same way as used in [6].
Different MLAA algorithms were further compared for dual-energy CT multi-material decomposition. Similarly, image MSE, ROI-based bias and SD, as well as pixel-based bias and SD were calculated for each of the material basis fraction images. Because our focus is on dual-energy CT imaging, we did not intend to evaluate PET activity image reconstruction in this study.
(e) Comparison results for GCT image quality    The comparison of ensemble bias versus SD for GCT ROI quantification is shown in figure 7 by varying the iteration number. As iteration number increases, the bias of ROI quantification is reduced while the SD is increased. All the kernel MLAA algorithms outperformed the standard MLAA. While the RED kernel MLAA did not outperform the standard kernel MLAA, the Unet kernel MLAA achieved the best performance among different algorithms for both liver and bone ROIs. At a fixed bias level, the Unet kernel MLAA has lower SD than the other two kernel-based approaches. Figure 8 shows the comparison using pixel-based bias versus SD trade-off in the two regions (liver and bone). Again, all the kernel MLAA algorithms outperformed the standard MLAA. The SD level in the pixel-based evaluation is much higher than in the ROI-based evaluation (figure 7) because spatial correlation is taken into account in the latter. The Unet kernel MLAA achieved a better result than the standard kernel MLAA in the liver region for early iterations but became worse for late iterations. In the bone region, the Unet kernel demonstrated a slightly worse performance than the standard kernel but overall the two algorithms were comparable to each other for the pixel-based bias-SD evaluation in this region.
(f) Comparison results for material decomposition Figure 9 shows the fractional basis images of soft tissue and bone from multi-material decomposition of the PET-enabled dual-energy CT images. The results were obtained from the MLAA reconstructions with 600 iterations. The ground truth of the soft tissue and bone bases was generated using the noise-free data. Compared to the standard MLAA and RED kernel MLAA, the Unet kernel MLAA reconstruction led to better visual quality and decreased image MSE. Figure 10 shows image MSE as a function of iteration number for each basis fractional image. All the kernel MLAAs were superior to the standard MLAA, with the best MSE performance from the Unet kernel MLAA.  To demonstrate the performance of different reconstruction algorithms for ROI quantification on basis fractional images, figure 11 shows the bias versus SD trade-off plot for ROI quantification on the soft tissue and bone fractional images. The Unet kernel MLAA outperformed other algorithms for ROI quantification in the liver and bone regions. For bone ROI quantification, all kernel MLAAs demonstrated a noticeable bias when compared with the standard kernel MLAA. The bias in the bone ROI quantification of the fractional basis image was supposed to be propagated from the GCT image reconstruction (as demonstrated in figure 7b).
The result from pixel-based evaluations is shown in figure 12. Overall, the Unet kernel MLAA demonstrated a better or at least comparable performance as compared to the standard kernel MLAA and other algorithms.

Discussions
This paper combines autoencoder-based representation learning with kernel MLAA to improve PET-enabled dual-energy CT imaging. This work falls into the scope of applying deep learning to MLAA. Compared to direct application of CNN for MLAA (e.g. [25,26]), the proposed approach has the advantages of not requiring a large training dataset and being patient specific. The training is also computationally efficient as it is based on a single image. The autoencoder-derived feature maps in figure 3b indicate a smoothing effect exists as compared to the intensity-based feature maps in figure 3a.  downsampling and upsampling in the Unet. In our investigation, applying smoothing to the X-ray CT led to reduced GCT image quality (results not shown), which can be explained by the loss of spatial resolution. This suggests the smoothing effect in the feature maps may have compromised the performance of the proposed approach and possibly caused the bias in the bone region. This problem can be potentially overcome by using a modified Unet architecture (e.g. [27]) and/or by our other ongoing effort [28] that combines deep image prior [22,29] as an implicit regularization on α in the kernel MLAA method.
In addition, there are multiple options of layer location to extract the CNN features after the autoencoder learning. This paper does not intend to identify the optimal location but introduces and demonstrates the feasibility for kernel MLAA. It is possible to combine the feature extraction process and kernel optimization into an integrated deep learning framework to further improve the autoencoder kernel MLAA approach.
Based on the quantitative MSE and bias-SD curves shown in this paper, 600 iterations of reconstruction seem a reasonable choice for the kernel MLAA reconstructions. Here, the number of iterations is based on not using any ordered subsets. If 30 subsets were used, the needed number of iterations would be approximately 20, which is computationally feasible for clinical practice.
TOF PET data can determine the GCT image but the solution is non-unique due to a scaling constant in the sinogram domain [30]. The scaling problem was less severe in our application partly because the available X-ray CT provides a very good initial estimate and also the anatomical prior information from X-ray CT could further mitigate the problem in the kernel MLAA reconstructions. Dedicated approaches for the scaling problem are being developed in the field (e.g. [31]), which may be combined with the kernel MLAA method to improve GCT quantification.
Another limitation of this work is the simulation study used the XCAT phantom that was generated mono-energetically without artefacts. The resulting X-ray CT prior could be very strong and may be less realistic in practice. Our future work will test the method directly using physical phantom and patient data.

Conclusion
We have developed an autoencoder kernel MLAA reconstruction method for PET-enabled dualenergy CT. The autoencoder-derived feature set can provide an improved kernel representation to incorporate X-ray CT as the image prior for GCT image reconstruction from PET emission data. Computer simulation results have demonstrated the improvement of the autoencoder kernel MLAA over existing MLAA algorithms for GCT image quality and dual-energy CT material decomposition. The proposed method can suppress noise, reduce image artefacts, and improve ROI quantification. A weakness of the kernel methods is they may potentially over-smooth a bone region and induce a bias in bone ROI quantification. Our future work will further optimize the method and identify a solution to overcome the limitation.