An energy efficient time-mode digit classification neural network implementation
This paper presents the design of an ultra-low energy neural network that uses time-mode signal processing). Handwritten digit classification using a single-layer artificial neural network (ANN) with a Softmin-based activation function is described as an implementation example. To realize time-mode operation, the presented design makes use of monostable multivibrator-based multiplying analogue-to-time converters, fixed-width pulse generators and basic digital gates. The time-mode digit classification ANN was designed in a standard CMOS 0.18 μm IC process and operates from a supply voltage of 0.6 V. The system operates on the MNIST database of handwritten digits with quantized neuron weights and has a classification accuracy of 88%, which is typical for single-layer ANNs, while dissipating 65.74 pJ per classification with a speed of 2.37 k classifications per second.
This article is part of the theme issue ‘Harmonizing energy-autonomous computing and intelligence’.
Machine learning is the study of models and algorithms that give rise to generalizable understanding of data and task completion without explicitly programmed instructions. As one of the many approaches in machine learning, artificial neural networks (ANNs) are partly inspired by connectivity and property of biological neurons and have proven to achieve considerable performance in a number of application areas.
These areas include machine translation [1,2], computer vision [3,4], pattern recognition [5–7], game-playing [8,9] and medical diagnosis [10,11]. In the applications that require real-time operation, e.g. speech  and human action  recognition, and physical activity and patient monitoring , there is a need for always-on sensing. However, one of the challenges of the modern machine learning algorithms is their energy dissipation . Most of the machine learning hardware development is done using either standard cell digital design methods [14,15] or mixed-signal methods  employing analogue processing techniques in CMOS technologies.
The advancement and scaling of CMOS technologies have always been based on improving the performance of digital systems. With each new technology node, the threshold voltages of the available MOS transistors and the supply voltage of the process node is scaled as well. Scaling of the supply voltage reduces the headroom that is available to the transistors for operating in the saturation region. Without transistors operating in the saturation region, it is very hard to realize signal processing and amplification functions in the analogue domain. One solution to this problem is using time-mode signal processing (TMSP) techniques [17–19]. Time-mode (TM) circuits represent an analogue signal by the time difference between two binary switching events. Furthermore, when compared to standard digital design practices, TM operation is inherently of lower power. For example, when compared to standard CMOS digital circuit operation, to transfer N bits of data, the number of switchings required may change from 0 to N on the data line if the data are transmitted in parallel, whereas in a TM circuit transfer of the data always takes two switchings if the rising and falling edges of a pulse is used for information transmission. There are other advantages of TM operation, especially for machine learning hardware implementations: (i) TM operation allows the designer to reduce the supply voltage and still realize analogue-like functions, as will be shown in this paper, and (ii) using single wires for data transmission instead of using data buses will allow a hardware designer to realize densely connected ANNs on chip more easily. Based on these observations, it is arguable that more low-power signal processing and machine learning systems will be implemented using TMSP techniques in the future.
The research work presented here focuses on developing a TM digit-classification single-layer neural network for ultra-low-energy operation. The proposed system is shown in figure 1. A digit classification ANN was chosen for its simplicity, and well-studied and understood behaviour. During the design and training of the ANN, image data from a widely available dataset, MNIST , was used. n by n image data were converted into analogue values and applied to the TM ANN. The applied image data are processed by the TM ANN by accumulating weighted delay values and a classification signal for the input image is generated. As it will be presented in the paper, TMSP allows such an ANN to work with extremely low-energy dissipation values and with classification accuracy that is typical for single-layer ANNs.
Contributions of this paper are as follows. A TM implementation of a handwritten digit classification ANN is presented. Optimization steps for both system level and hardware level design are given, followed by the details of sub-block designs. The designed ANN is verified by both system-level mathematical simulations as well as with transistor-level SPICE simulations. The design is characterized for classification accuracy, energy dissipation and classification speed. The organization of the paper is as follows. Section 2 presents the high-level details and implementation steps of the ANN in software. Section 3 describes the TMSP ANN implementation with sub-block design and performance improvement steps. Transistor-level simulation results are presented in §4 together with performance metrics, and, finally, the conclusion is drawn in §5.
2. Artificial neural network
In this study, we implemented a hardware version of a TM, fully connected, single-layer neural network to recognize handwritten digits (figure 2), using the MNIST database of handwritten digits . The MNIST database contains a training set of 60 000 images and a test set of 10 000 images, with all images of size 28 × 28 pixels.
As presented in figure 2, a single layer of neurons applying a linear transformation to the input data (i.e. MNIST handwritten digits) was constructed. The size of input sample was set to 784 (MNIST handwritten digits of 28 × 28 pixels) with an input range [0, 1] and the size of output sample was set to 10 (10 possible output digits ranging from 0 to 9).
An artificial neuron in the implemented ANN receives pixel data from input units (figure 3). Each input x1…xn is multiplied by its respective weight w1…wn, and the artificial neuron receives and sums all weighted inputs according to
After the weighted sums of the inputs are transformed by the activation function, final classification results are supplied by the ANN, i.e. if the nth value is the highest in the output vector of the Softmin function, it means that weighted sum output of the nth neuron is the smallest and nth neuron has the highest probability to successfully classify the input to be digit n. Unlike most of the implementations that used Softmax as the activation function, in the present study Softmin was used. This is based on the assumption that with Softmax, an artificial neuron with a greater weighted sum (i.e. greater accumulated delay in the hardware implementation) wins. However, this does not correspond to the targeted hardware implementation in which the fastest neuron is favoured and, therefore, using a Softmax activation function would decrease the operating speed of the ANN. Hence, we have chosen the Softmin activation function with which the fastest neuron wins and classification speed of the ANN is higher.
The described high-level ANN was implemented, trained and tested on the PyTorch framework . The training of the ANN was done using batches of 100 images per batch and for 10 epochs. During the off-line training of the ANN, floating point values were used; however, during hardware implementation and high-level verification simulations, the weight values were scaled to a range and quantized.
The adaptive moment estimation method (Adam)  was used as the optimization method during the training of the ANN. Adam optimization combines the advantages of both Adaptive Gradient Algorithm (AdaGrad ), which works well with sparse gradients, and Root Mean Square Propagation (RMSProp ), which works well in online and non-stationary settings. Kingma & Ba  suggested that instead of generating its parameter updates using a momentum (like RMSProp with momentum does), updates of Adam may be directly estimated using an average of first and second moment of the gradient. As a result, Adam performed equal or better than RMSProp regardless of hyperparameter setting. In L2-regularized multi-class logistic regression, Adam converged faster than AdaGrad. In a dataset with sparse features, Adam converged as fast as AdaGrad while dealing with space features efficiently. In an experiment with convolutional neural networks, Adam converged considerably faster than AdaGrad. For a more detailed discussion, see [22,25]. During training, the learning rate (α) was set to 0.01 while all other parameters were configured based on the default settings recommended in  (β1 = 0.9, β2 = 0.999, ϵ = 10−8) and we aimed to minimize the cross-entropy loss, which is given by
To be able to implement the ANN hardware in an efficient manner, we first investigated the effects of image size reduction (downsampling) and quantization of neuron weights. Multiple ANNs with varying image sizes (from 1 pixel/side to 28 pixels/side) and varying weight quantization bits (1–8 bits) were created, trained and tested. To ease the hardware implementation and consecutive implementation steps that will be explained in the next sections, in addition to standard training settings, we constrained the minimum value of neuron weights to be 0. Therefore, all the weights obtained from the training were either 0 or positive numbers. The results of our simulations are presented in figure 4. During our tests, maximum accuracy achievable from the presented single-layer shallow ANN was 92.95% (for an image size of 26 × 26 and 8-bits quantization), which is in line with the classification accuracy of single-layer ANNs in the literature .
After the successful software implementation, training and testing of the ANN, multiple steps were taken to prepare the design for TM hardware implementation. First, image size and number of quantization bits were chosen for the required accuracy. In this implementation, we opted for a 9 pixels/side input image (81 input pixels) in order to (i) reduce the energy dissipation without significant loss of accuracy, (ii) to have an ANN implementation that is directly comparable to an implementation in the literature , and (iii) to reduce the transistor-level transient simulation time significantly. With input image scaling, the maximum digit classification accuracy reduced from 92.95 to 89.65% (for 8-bit quantization).
After the input image size was chosen, the weights were scaled to the range [0, 1] and were later quantized. From our simulations (figure 4), 4-bit quantized weights were a good compromise between the expected energy dissipation and accuracy. In the implemented system, which will be explained in the next section, most of the energy is dissipated in the switched capacitances. For a fixed total switched capacitance, the number of quantization bits have negligible effect on the total energy dissipation, and employing a higher number of quantization bits only result in added implementation complexity. However, if smaller capacitance values can be tolerated, i.e. less stringent noise and mismatch considerations in the system, then the smallest number of quantization bits for a given image size, hence the smallest total capacitance values in the implementation, should be chosen as the energy dissipation scales linearly with the number of quantization bits. Furthermore, energy dissipation increases quadratically with the square of image size/side, making the image size a more important parameter for energy reduction. Therefore, the smallest possible image size that satisfies the accuracy requirements should be chosen to minimize the energy dissipation. For the presented ANN implementation, we assumed 9 × 9 input images, and chose 4-bit quantization weights to represent an average case for the number of quantization bits. The classification accuracy loss due to quantization was minimal, i.e. from 89.65 to 89.35%.
Results of quantization on the weights of the neuron (used to classify handwritten digit 9) are shown in figure 5. The leftmost figure shows the intensity of the weights, darker pixels representing smaller values. The middle figure shows the floating point scaled weights and the rightmost figure shows the weights after quantization. When these two figures are compared, it is observed that due to quantization, many weights were reduced to zero. These zero weights have no effect on the weighted sum given in (2.1), therefore can be removed to both simplify the hardware implementation and reduce energy dissipation. Non-zero weights for all the neurons are given in table 1.
|neuron no.||non-zero weights|
Owing to the nature of the designed TMSP circuits which will be explained in the next section, each circuit that realizes the multiply accumulate (MAC) operation has an inherent non-zero fixed delay. Therefore, not to penalize the neurons which have more non-zero weights and for the correct operation of the designed system, we designed the circuit implementation of the ANN such that each neuron has an equal number of MAC elements, which is equal to the maximum number of non-zero weights given in table 1, i.e. 64 for neuron 8. Such an implementation allowed us to reduce the number of MAC units for the 9 × 9 pixel design from 810 to 640, effectively reducing the expected average energy dissipation by 21% by high-level design choices.
3. A time-mode MNIST digit classifier ANN implementation
Following the mathematical modelling, training, verification and quantization of the ANN, we applied TM operation and TMSP methods to the design of a digit classification ANN in a standard 0.18 μm IC process. Each neuron defined by (2.1) is mapped to a TMSP implementation, as shown in figure 6. As in (2.1), a chain of multiplying analogue-to-time converters (mATC) converts a voltage input value into a pulse whose width is proportional to both the input signal value and the assigned weight. The signal propagates through the chain of mATCs and fixed-width pulse generators (FWPGs). FWPGs, represented by the pulse blocks in figure 6, are required to be able to trigger the next mATC in the chain with the falling edge of the previous mATC pulse. The structures and operation principles of both the mATC and the negative-edge triggered fixed-width pulse generator are explained in the following paragraphs. In this specific implementation, we created a chain of 64 mATCs for each neuron. Owing to the resulting zero weights after quantization, not all the pixels are connected to each neuron, further simplifying the hardware implementation and future on-chip routing.
The operation of the TM ANN neuron is as follows: once the neuron has been triggered with the Begin Classification signal, the chain of mATCs and FWPGs operate sequentially to accumulate the delay information from each mATC, each of which represents the weighted input pixel data. As explained in the previous section, the ANN has been trained with a Softmin activation function, meaning that the neuron with the smallest weighted sum output value, i.e. in TMSP terms, the fastest response (earliest falling edge at the output of the last (Nth) mATC), will get to classify the input image first. Therefore, we placed negative-edge triggered flip-flops at the output of the neurons to capture the final falling edge of the signal generated by the chain of mATCs. This ‘faster response wins’ approach directly mimics the Softmin function explained in the previous section and is also similar to how some biological neural networks which are trained repeatedly behave.
During the design of the TM ANN, we employed a modified version of the basic monostable multivibrator (MSMV)  to work as an mATC in the system, as shown in figure 7. In this implementation, a pMOS transistor (M1) acts as a variable resistor whose resistance is modulated by the current input voltage signal. When the MSMV is triggered by an input pulse, nodes n1 and n2 are pulled to logic-low and M1 starts charging node n2. The gate of M1 is driven by the input signal that is to be converted into time, and sampling is realized by modulating the instantaneous resistance of M1. Thus, the RC time constant of the multivibrator is modulated as well, resulting in a pulse whose width is proportional to the amplitude of the input signal. The pulse width generated by the ATC is given in  by
In the first iteration of the design, the minimum unit capacitor that satisfies this requirement was found to be 20 fF. In this iteration, we used only switchable capacitors as the charged capacitor to reduce the total switched capacitance, hence the total energy dissipation. However, during our transistor-level simulations, we saw that due to the parasitic capacitances at node n2 and the non-idealities of the switches, the pulse-width ratio between the successive weights degraded, especially for the smaller values, i.e. for
Transistor-level simulations using the HSPICE simulator were run to characterize the mATC. Simulations were run for a supply voltage of 0.6 V VDD, while sweeping the input signal voltage from 300 to 400 mV, to represent the expected input signal values from an imager. In all the transistor-level simulations, the black and white pixels are represented by 300 mV and 400 mV input voltage values to the mATCs, respectively.
The advantages of the placement of Cx in the second iteration of the mATC design are shown in figure 8. The range of pulses generated by both versions of the mATC for different weights as well as their mean are presented in the figure. There are multiple points that should be noted from transistor-level simulation results: (i) slope of the time response of the mATC has been reduced, effectively making the mean pulse-width values more fitting to a binary progression, hence the name time-linearizing (TL), (ii) due to the better fitting of the mean to binary progression, the error between the multiplication steps has been reduced (the root mean squared error (RMSE) is reduced from 6.61 to 1.59%, see table 2 for more details), and (iii) due to the reduced total capacitance, the system response is faster (average pulse width is reduced from 81.16 to 43.72 μs) and the average energy dissipation is reduced (from 254 to 157 fJ).
|mATC multiplication weight ratio||expected ratio||% error - base mATC||% error - TL mATC|
A negative-edge triggered FWPG, shown in figure 9, is used between the mATC blocks as we require the triggering of the next mATC in the chain to occur during the falling edge of the pulse generated by the previous mATC. By triggering the next mATC with the falling edge of the previous mATC output, time addition operation is realized. In this implementation, we used an FWPG which generates pulses with a pulse-width of 50 ns. This minimum value of the pulse-width can be chosen to be any value that satisfies the following requirements during triggering: (i) both nodes n1 and n2 are completely driven to ground during the pulse, and (ii) other input of the NOR gate is completely driven to VDD with sufficient timing margin to account for process mismatch before the output of the FWPG goes low. The maximum value of the pulse-width of the FWPG is limited by the minimum pulse value that is generated by the mATC, i.e. 1.94 μs for a
4. Simulation results
After characterization and verification of the sub-blocks of the hardware ANN, extensive transistor-level SPICE simulations using HSPICE simulator were run to verify the correct time-mode operation of the designed system. As in the characterization of the sub-blocks, a supply of 0.6 V is used. Separate testbenches were programmatically created to simulate 100 samples from the test dataset and transient simulations were run. Results of one such simulation run for a classification of digit 2 is shown in figure 10. As it can be seen, the correct classifier neuron generates a faster output response than the other neurons, successfully classifying the input digit. For this specific case, the fastest neuron, i.e. neuron 2, responded 62.1 μs faster than the next fastest neuron.
The average energy dissipation per classification while working at 0.6 V VDD is 65.74 pJ. The average classification response time for the test dataset is 421.8 μs, resulting in 2.37 k classifications per second for a classification accuracy of 88%. As the focus of the present study is on an energy efficient design with low dissipation rather than state-of-the-art classification accuracy, the accuracy of 88%, which is typical in 1-layer neural networks , is acceptable at the current stage. In the meantime, the classification accuracy is still significantly higher than the random guess for MNIST dataset.
We also investigated the effects of process mismatch on the performance of, first, a chain of mATCs, and, later, on the ANN. We first simulated a chain of mATCs with varying number of elements for the effects of local mismatch. For these simulations, to represent an average case of operation, all the analogue input voltages to the mATCs and the multiplication coefficients were set to 350 mV and
As it is apparent from the mATC chain mismatch simulations, increasing the number of elements in the chain reduces the relative variability of the ANN. Even though as the implemented ANN is trained off-line and there is no provision and straight-forward way to address variability during training, reliability issues due to process mismatch may be addressed in two ways: (i) algorithmically testing each neuron for the variability of the elements by applying multiple analogue input and digital control combinations and extracting the linear transfer curve, and (ii) by increasing the number of mATCs in the chain to average out and reduce the effects of variation, as shown in figure 11.
To test the performance of the ANN for process mismatch, for each of the 100 image samples, we used to simulate and characterize the system, we ran 100 point Monte Carlo mismatch simulations (100 × 100 transient simulations in total) and the average standard deviation in the neuron response due to process mismatch time was 9.2 μs. Simulation results of such a simulation run for the classification of handwritten digit of 3 is presented in figure 12 as an example. The figure shows the response time variation distribution of each neuron in the designed ANN due to process mismatch.
For a misclassification to occur due to mismatch, the second fastest neuron should respond faster than the fastest neuron of the nominal conditions. For the case shown in figure 12, this is possible between neurons 3 and 7. This probability can be modelled as the half of the area of overlap of two Gaussian distributions with the same standard deviation and differing means (figure 13), and the intersection point of the distributions depends on the distance between the means. For mean differences less than 1.2σ, we saw that the errors due to the training (88% accuracy) occurred. For mean differences greater than 1.2σ, we calculated the added misclassification probability for each neuron and found the total added possibility of error due to process mismatch to be 1.17%, reducing the expected minimum accuracy to 86.63%.
When compared to the state-of-the-art hardware ANN implementations, the design presented in this work compares favourably in terms of reduced energy dissipation, which is the main aim of this design exercise. A comparison of results with a recent and directly comparable hardware 9 × 9 pixel MNIST classification ANN in  is given in table 3. When both implementations are compared, even though the presented ANN is designed using an older technology, i.e. 0.18 μm process, compares favourably in terms of energy dissipation. One metric where the design in  is performing better than the presented implementation is the classification speed. However, due to the design constraints in , operating voltage cannot be lowered further; hence, energy dissipation, which is proportional to the square of supply voltage in digital circuits, cannot be further reduced. Furthermore, it is expected that our implementation will achieve much better average operating speed and energy dissipation numbers when this design is migrated to more advanced technologies. The energy dissipation per classification is reduced by a factor of 9.58x, from 630 pJ down to 65.74 pJ when compared to . It should also be noted that the ANN implementation presented in this paper works with analogue signal inputs, without requiring the input data to be converted to digital for further processing. If analogue-to-digital conversion energy cost per image is added to the classification energy numbers reported in  in table 3, presented ANN implementation is even more energy efficient.
|SRAM classifier ||this work|
|supply voltage (V)||1.2||0.6|
|classification accuracy (%)||90||88|
|classification speed (Hz)||50 × 106||2370|
|analogue-to-digital conversion energy included||no||yes|
|energy dissipation (pJ)||630||66|
Extending the single-layer ANN presented in this study to a multi-layer version is an on-going work. However, from our preliminary results, it has been observed that, once a value/variable is converted to a TM signal, in order to operate in the most energy efficient way, processing should continue in TM without conversion between the TM and analogue/digital domains. For example, an asynchronous time-to-digital converter (TDC) in 0.18 μm process dissipates 1.48 pJ , and a similar TDC in a 65 nm process dissipates 0.97 pJ  per conversion. When compared to the average energy dissipation of each neuron (6.6 pJ), it can be observed that conversion between different operating domains incurs energy dissipation overhead values which are comparable to the energy dissipation of the data processing circuitry.
This paper presents the hardware design and the simulation results of a TM, single-layer ANN with Softmin activation function for handwritten digit classification. TMSP techniques have been applied for accumulating weighted image signal values using energy-efficient time-mode circuitry. Optimization steps for both system level and hardware level design are given. The system was designed and simulated in a standard 0.18 μm process and operates from a supply voltage of 0.6 V. By applying the presented design guidelines, an energy-optimal 9 × 9 handwritten digit image classification ANN with 4-bit quantized weights was designed. The energy dissipation of the design for each classification is 65.74 pJ while operating at a speed of 2.37 k classifications per second, with a classification accuracy of 88%.
This article has no additional data.
O.C.A. conceived, designed, simulated and verified the transistor-level, time-mode ANN implementation. O.C.A. and J.M. created and optimized Python level ANN for time-mode implementation. Both authors drafted, read and approved the manuscript.
We declare we have no competing interests.
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curiegrant agreement no. 752819 for the MSCA IF Project ATiNaRI.
The authors thank the anonymous reviewers for their constructive comments and helpful suggestions.
Published by the Royal Society. All rights reserved.
Sutskever I, Vinyals O, Le QV. 2014Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. San Diego, CA: NIPS. Google Scholar
Deng L, Hinton G, Kingsbury B. 2013New types of deep neural network learning for speech recognition and related applications: an overview. In 2013 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May, pp. 8599–8603. Piscataway, NJ: IEEE. Google Scholar
Liang M, Hu X. 2015Recurrent convolutional neural network for object recognition. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Boston, MA, 7–12 June, pp. 3367–3375. Google Scholar
Al-Shayea QK. 2011Artificial neural networks in medical diagnosis. Int. J. Comput. Sci. Issues 8, 150–154. Google Scholar
Kodali S, Hansen P, Mulholland N, Whatmough P, Brooks D, Wei G-Y. 2017Applications of deep neural networks for ultra low power IoT. In 2017 IEEE Int. Conf. on Computer Design (ICCD) , pp. 589–592. Piscataway, NJ: IEEE. Google Scholar
Bankman D, Yang L, Moons B, Verhelst M, Murmann B. 2018An always-on 3.8 μj/86% CIFAR-10 mixed-signal binary CNN processor with all memory on chip in 28 nm CMOS. In 2018 IEEE Int. Solid-State Circuits Conf.-(ISSCC), Boston, MA, 11–15 February, pp. 222–224. Piscataway, NJ: IEEE. Google Scholar
Chen Z, Gu J. 2016Analysis and design of energy efficient time domain signal processing. In Proc. of the 2016 Int. Symp. on Low Power Electronics and Design, San Francisco, CA, 8–10 August, pp. 100–105. NewYork, NY: ACM. Google Scholar
Akgun OC, Mangia M, Pareschi F, Rovatti R, Setti G, Serdijn WA. 2019An energy-efficient multi-sensor compressed sensing system employing time-mode signal processing techniques. In Proc. of IEEE Int. Symp. on Circuits and Systems (ISCAS), Sapporo, Japan, 26–29 May, pp. 1–5. Piscataway, NJ: IEEE. Google Scholar
Paszke Aet al.2017Automatic differentiation in PyTorch. In NIPS-W. San Diego, CA: NIPS. Google Scholar
Tieleman T, Hinton G. 2012Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4, 26–31. Google Scholar
Ruder S. 2016An overview of gradient descent optimization algorithms. Google Scholar
Sedra AS, Smith KC. 1998Microelectronic circuits. 4th edn. Oxford, UK: Oxford University Press. Google Scholar
Akgun OC. 2018An asynchronous pipelined time-to-digital converter using time-domain subtraction. In Proc. of IEEE Int. Symp. on Circuits and Systems (ISCAS), Florence, Italy, 27–30 May, pp. 1–5. Piscataway, NJ: IEEE. Google Scholar