Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences
You have accessResearch articles

An energy efficient time-mode digit classification neural network implementation

Published:https://doi.org/10.1098/rsta.2019.0163

    Abstract

    This paper presents the design of an ultra-low energy neural network that uses time-mode signal processing). Handwritten digit classification using a single-layer artificial neural network (ANN) with a Softmin-based activation function is described as an implementation example. To realize time-mode operation, the presented design makes use of monostable multivibrator-based multiplying analogue-to-time converters, fixed-width pulse generators and basic digital gates. The time-mode digit classification ANN was designed in a standard CMOS 0.18 μm IC process and operates from a supply voltage of 0.6 V. The system operates on the MNIST database of handwritten digits with quantized neuron weights and has a classification accuracy of 88%, which is typical for single-layer ANNs, while dissipating 65.74 pJ per classification with a speed of 2.37 k classifications per second.

    This article is part of the theme issue ‘Harmonizing energy-autonomous computing and intelligence’.

    1. Introduction

    Machine learning is the study of models and algorithms that give rise to generalizable understanding of data and task completion without explicitly programmed instructions. As one of the many approaches in machine learning, artificial neural networks (ANNs) are partly inspired by connectivity and property of biological neurons and have proven to achieve considerable performance in a number of application areas.

    These areas include machine translation [1,2], computer vision [3,4], pattern recognition [57], game-playing [8,9] and medical diagnosis [10,11]. In the applications that require real-time operation, e.g. speech [5] and human action [7] recognition, and physical activity and patient monitoring [12], there is a need for always-on sensing. However, one of the challenges of the modern machine learning algorithms is their energy dissipation [13]. Most of the machine learning hardware development is done using either standard cell digital design methods [14,15] or mixed-signal methods [16] employing analogue processing techniques in CMOS technologies.

    The advancement and scaling of CMOS technologies have always been based on improving the performance of digital systems. With each new technology node, the threshold voltages of the available MOS transistors and the supply voltage of the process node is scaled as well. Scaling of the supply voltage reduces the headroom that is available to the transistors for operating in the saturation region. Without transistors operating in the saturation region, it is very hard to realize signal processing and amplification functions in the analogue domain. One solution to this problem is using time-mode signal processing (TMSP) techniques [1719]. Time-mode (TM) circuits represent an analogue signal by the time difference between two binary switching events. Furthermore, when compared to standard digital design practices, TM operation is inherently of lower power. For example, when compared to standard CMOS digital circuit operation, to transfer N bits of data, the number of switchings required may change from 0 to N on the data line if the data are transmitted in parallel, whereas in a TM circuit transfer of the data always takes two switchings if the rising and falling edges of a pulse is used for information transmission. There are other advantages of TM operation, especially for machine learning hardware implementations: (i) TM operation allows the designer to reduce the supply voltage and still realize analogue-like functions, as will be shown in this paper, and (ii) using single wires for data transmission instead of using data buses will allow a hardware designer to realize densely connected ANNs on chip more easily. Based on these observations, it is arguable that more low-power signal processing and machine learning systems will be implemented using TMSP techniques in the future.

    The research work presented here focuses on developing a TM digit-classification single-layer neural network for ultra-low-energy operation. The proposed system is shown in figure 1. A digit classification ANN was chosen for its simplicity, and well-studied and understood behaviour. During the design and training of the ANN, image data from a widely available dataset, MNIST [20], was used. n by n image data were converted into analogue values and applied to the TM ANN. The applied image data are processed by the TM ANN by accumulating weighted delay values and a classification signal for the input image is generated. As it will be presented in the paper, TMSP allows such an ANN to work with extremely low-energy dissipation values and with classification accuracy that is typical for single-layer ANNs.

    Figure 1.

    Figure 1. Proposed TMSP ANN high-level block diagram. (Online version in colour.)

    Contributions of this paper are as follows. A TM implementation of a handwritten digit classification ANN is presented. Optimization steps for both system level and hardware level design are given, followed by the details of sub-block designs. The designed ANN is verified by both system-level mathematical simulations as well as with transistor-level SPICE simulations. The design is characterized for classification accuracy, energy dissipation and classification speed. The organization of the paper is as follows. Section 2 presents the high-level details and implementation steps of the ANN in software. Section 3 describes the TMSP ANN implementation with sub-block design and performance improvement steps. Transistor-level simulation results are presented in §4 together with performance metrics, and, finally, the conclusion is drawn in §5.

    2. Artificial neural network

    In this study, we implemented a hardware version of a TM, fully connected, single-layer neural network to recognize handwritten digits (figure 2), using the MNIST database of handwritten digits [20]. The MNIST database contains a training set of 60 000 images and a test set of 10 000 images, with all images of size 28 × 28 pixels.

    Figure 2.

    Figure 2. A fully connected, single-layer neural network receiving inputs from an image of handwritten digit 3.

    As presented in figure 2, a single layer of neurons applying a linear transformation to the input data (i.e. MNIST handwritten digits) was constructed. The size of input sample was set to 784 (MNIST handwritten digits of 28 × 28 pixels) with an input range [0, 1] and the size of output sample was set to 10 (10 possible output digits ranging from 0 to 9).

    An artificial neuron in the implemented ANN receives pixel data from input units (figure 3). Each input x1xn is multiplied by its respective weight w1wn, and the artificial neuron receives and sums all weighted inputs according to

    f(X)=w1x1+w2x2++wixi++wnxn=inwixi.2.1
    Afterwards, an activation function is used to process the weighted sum. In the presented implementation, Softmin activation function, which takes all the weighted sums as input, is used (figure 2). By processing the weighted sums, Softmin function rescales and assigns probabilities to the classified digit outputs. As a result of the Softmin function, each output is squashed to a value in the range (0, 1) and the sum of all outputs add up to 1. The Softmin function is defined as
    Softmin(xi)=exp(xi)jexp(xj).2.2
    Figure 3.

    Figure 3. An artificial neuron with n × n inputs. Each input is associated with a weight. The weighted sum of all inputs is transformed by the activation function to produce the output.

    After the weighted sums of the inputs are transformed by the activation function, final classification results are supplied by the ANN, i.e. if the nth value is the highest in the output vector of the Softmin function, it means that weighted sum output of the nth neuron is the smallest and nth neuron has the highest probability to successfully classify the input to be digit n. Unlike most of the implementations that used Softmax as the activation function, in the present study Softmin was used. This is based on the assumption that with Softmax, an artificial neuron with a greater weighted sum (i.e. greater accumulated delay in the hardware implementation) wins. However, this does not correspond to the targeted hardware implementation in which the fastest neuron is favoured and, therefore, using a Softmax activation function would decrease the operating speed of the ANN. Hence, we have chosen the Softmin activation function with which the fastest neuron wins and classification speed of the ANN is higher.

    The described high-level ANN was implemented, trained and tested on the PyTorch framework [21]. The training of the ANN was done using batches of 100 images per batch and for 10 epochs. During the off-line training of the ANN, floating point values were used; however, during hardware implementation and high-level verification simulations, the weight values were scaled to a range and quantized.

    The adaptive moment estimation method (Adam) [22] was used as the optimization method during the training of the ANN. Adam optimization combines the advantages of both Adaptive Gradient Algorithm (AdaGrad [23]), which works well with sparse gradients, and Root Mean Square Propagation (RMSProp [24]), which works well in online and non-stationary settings. Kingma & Ba [22] suggested that instead of generating its parameter updates using a momentum (like RMSProp with momentum does), updates of Adam may be directly estimated using an average of first and second moment of the gradient. As a result, Adam performed equal or better than RMSProp regardless of hyperparameter setting. In L2-regularized multi-class logistic regression, Adam converged faster than AdaGrad. In a dataset with sparse features, Adam converged as fast as AdaGrad while dealing with space features efficiently. In an experiment with convolutional neural networks, Adam converged considerably faster than AdaGrad. For a more detailed discussion, see [22,25]. During training, the learning rate (α) was set to 0.01 while all other parameters were configured based on the default settings recommended in [22] (β1 = 0.9, β2 = 0.999, ϵ = 10−8) and we aimed to minimize the cross-entropy loss, which is given by

    loss(x,class)=log(exp(x[class])jexp(x[j]))=x[class]+log(jexp(x[j])),2.3
    where class is the digit to be classified.

    To be able to implement the ANN hardware in an efficient manner, we first investigated the effects of image size reduction (downsampling) and quantization of neuron weights. Multiple ANNs with varying image sizes (from 1 pixel/side to 28 pixels/side) and varying weight quantization bits (1–8 bits) were created, trained and tested. To ease the hardware implementation and consecutive implementation steps that will be explained in the next sections, in addition to standard training settings, we constrained the minimum value of neuron weights to be 0. Therefore, all the weights obtained from the training were either 0 or positive numbers. The results of our simulations are presented in figure 4. During our tests, maximum accuracy achievable from the presented single-layer shallow ANN was 92.95% (for an image size of 26 × 26 and 8-bits quantization), which is in line with the classification accuracy of single-layer ANNs in the literature [26].

    Figure 4.

    Figure 4. Classification accuracy of the designed ANN for varying image length/width sizes and weight quantization bits. Accuracy contour lines are also drawn for easier reference. Maximum accuracy and chosen implementations are marked.

    After the successful software implementation, training and testing of the ANN, multiple steps were taken to prepare the design for TM hardware implementation. First, image size and number of quantization bits were chosen for the required accuracy. In this implementation, we opted for a 9 pixels/side input image (81 input pixels) in order to (i) reduce the energy dissipation without significant loss of accuracy, (ii) to have an ANN implementation that is directly comparable to an implementation in the literature [13], and (iii) to reduce the transistor-level transient simulation time significantly. With input image scaling, the maximum digit classification accuracy reduced from 92.95 to 89.65% (for 8-bit quantization).

    After the input image size was chosen, the weights were scaled to the range [0, 1] and were later quantized. From our simulations (figure 4), 4-bit quantized weights were a good compromise between the expected energy dissipation and accuracy. In the implemented system, which will be explained in the next section, most of the energy is dissipated in the switched capacitances. For a fixed total switched capacitance, the number of quantization bits have negligible effect on the total energy dissipation, and employing a higher number of quantization bits only result in added implementation complexity. However, if smaller capacitance values can be tolerated, i.e. less stringent noise and mismatch considerations in the system, then the smallest number of quantization bits for a given image size, hence the smallest total capacitance values in the implementation, should be chosen as the energy dissipation scales linearly with the number of quantization bits. Furthermore, energy dissipation increases quadratically with the square of image size/side, making the image size a more important parameter for energy reduction. Therefore, the smallest possible image size that satisfies the accuracy requirements should be chosen to minimize the energy dissipation. For the presented ANN implementation, we assumed 9 × 9 input images, and chose 4-bit quantization weights to represent an average case for the number of quantization bits. The classification accuracy loss due to quantization was minimal, i.e. from 89.65 to 89.35%.

    Results of quantization on the weights of the neuron (used to classify handwritten digit 9) are shown in figure 5. The leftmost figure shows the intensity of the weights, darker pixels representing smaller values. The middle figure shows the floating point scaled weights and the rightmost figure shows the weights after quantization. When these two figures are compared, it is observed that due to quantization, many weights were reduced to zero. These zero weights have no effect on the weighted sum given in (2.1), therefore can be removed to both simplify the hardware implementation and reduce energy dissipation. Non-zero weights for all the neurons are given in table 1.

    Figure 5.

    Figure 5. Weights of a trained neuron, visual representation and values before and after quantization.

    Table 1. Number of non-zero weights after quantization for all the neurons in the designed ANN.

    neuron no.non-zero weights
    063
    151
    247
    339
    461
    550
    654
    745
    864
    953

    Owing to the nature of the designed TMSP circuits which will be explained in the next section, each circuit that realizes the multiply accumulate (MAC) operation has an inherent non-zero fixed delay. Therefore, not to penalize the neurons which have more non-zero weights and for the correct operation of the designed system, we designed the circuit implementation of the ANN such that each neuron has an equal number of MAC elements, which is equal to the maximum number of non-zero weights given in table 1, i.e. 64 for neuron 8. Such an implementation allowed us to reduce the number of MAC units for the 9 × 9 pixel design from 810 to 640, effectively reducing the expected average energy dissipation by 21% by high-level design choices.

    3. A time-mode MNIST digit classifier ANN implementation

    Following the mathematical modelling, training, verification and quantization of the ANN, we applied TM operation and TMSP methods to the design of a digit classification ANN in a standard 0.18 μm IC process. Each neuron defined by (2.1) is mapped to a TMSP implementation, as shown in figure 6. As in (2.1), a chain of multiplying analogue-to-time converters (mATC) converts a voltage input value into a pulse whose width is proportional to both the input signal value and the assigned weight. The signal propagates through the chain of mATCs and fixed-width pulse generators (FWPGs). FWPGs, represented by the pulse blocks in figure 6, are required to be able to trigger the next mATC in the chain with the falling edge of the previous mATC pulse. The structures and operation principles of both the mATC and the negative-edge triggered fixed-width pulse generator are explained in the following paragraphs. In this specific implementation, we created a chain of 64 mATCs for each neuron. Owing to the resulting zero weights after quantization, not all the pixels are connected to each neuron, further simplifying the hardware implementation and future on-chip routing.

    Figure 6.

    Figure 6. A classifier neuron using TMSP. (Online version in colour.)

    The operation of the TM ANN neuron is as follows: once the neuron has been triggered with the Begin Classification signal, the chain of mATCs and FWPGs operate sequentially to accumulate the delay information from each mATC, each of which represents the weighted input pixel data. As explained in the previous section, the ANN has been trained with a Softmin activation function, meaning that the neuron with the smallest weighted sum output value, i.e. in TMSP terms, the fastest response (earliest falling edge at the output of the last (Nth) mATC), will get to classify the input image first. Therefore, we placed negative-edge triggered flip-flops at the output of the neurons to capture the final falling edge of the signal generated by the chain of mATCs. This ‘faster response wins’ approach directly mimics the Softmin function explained in the previous section and is also similar to how some biological neural networks which are trained repeatedly behave.

    During the design of the TM ANN, we employed a modified version of the basic monostable multivibrator (MSMV) [27] to work as an mATC in the system, as shown in figure 7. In this implementation, a pMOS transistor (M1) acts as a variable resistor whose resistance is modulated by the current input voltage signal. When the MSMV is triggered by an input pulse, nodes n1 and n2 are pulled to logic-low and M1 starts charging node n2. The gate of M1 is driven by the input signal that is to be converted into time, and sampling is realized by modulating the instantaneous resistance of M1. Thus, the RC time constant of the multivibrator is modulated as well, resulting in a pulse whose width is proportional to the amplitude of the input signal. The pulse width generated by the ATC is given in [28] by

    T=C(R+Ron)ln[RR+RonVDDVDDVth],3.1
    where R is the average resistance of the pMOS transistor during pulse generation, Ron the resistance of the NOR gate, and Vth the switching threshold of the inverter. Assuming Ron ≪ R and Vth = VDD/2, (3.1) is simplified to T = 0.69 RC. Furthermore, this mATC implementation has an inherent timeout feature and will always generate a pulse event at node n1 regardless of the input signal value at Vin, avoiding stalling of the chain. Transistor M1 was made bigger than the minimum values required for correct operation to mitigate process variation effects. The ATC given in [28] was modified with the inclusion of extra switchable capacitors C0-C3 to allow the ATC to realize time-multiplication operation. The capacitors C0-C3 are increasing in a binary weighted fashion, C0 being the unit least-significant bit (LSB) capacitance, and C3 = 8 · C0 being the most-significant bit (MSB) capacitance. The unit capacitor was sized such that, for the smallest multiplication coefficient, i.e. 0001, the mATC still generates a pulse response that is proportional to the input signal value. The switches were implemented as transmission gates using minimum size (0.22 μm/0.18 μm) MOS transistors.
    Figure 7.

    Figure 7. Monostable multivibrator based multiplying ATC with time linearizing capacitor Cx. (Online version in colour.)

    In the first iteration of the design, the minimum unit capacitor that satisfies this requirement was found to be 20 fF. In this iteration, we used only switchable capacitors as the charged capacitor to reduce the total switched capacitance, hence the total energy dissipation. However, during our transistor-level simulations, we saw that due to the parasitic capacitances at node n2 and the non-idealities of the switches, the pulse-width ratio between the successive weights degraded, especially for the smaller values, i.e. for 0001 and 0010. Therefore, we placed a fixed time linearizing 10 fF capacitor Cx in parallel to the switched capacitors. Addition of Cx also allowed us to reduce the value of the unit switched capacitor from 20 fF to 10 fF, as for the smallest weight setting, the charged capacitance at n2 is still 20 fF.

    Transistor-level simulations using the HSPICE simulator were run to characterize the mATC. Simulations were run for a supply voltage of 0.6 V VDD, while sweeping the input signal voltage from 300 to 400 mV, to represent the expected input signal values from an imager. In all the transistor-level simulations, the black and white pixels are represented by 300 mV and 400 mV input voltage values to the mATCs, respectively.

    The advantages of the placement of Cx in the second iteration of the mATC design are shown in figure 8. The range of pulses generated by both versions of the mATC for different weights as well as their mean are presented in the figure. There are multiple points that should be noted from transistor-level simulation results: (i) slope of the time response of the mATC has been reduced, effectively making the mean pulse-width values more fitting to a binary progression, hence the name time-linearizing (TL), (ii) due to the better fitting of the mean to binary progression, the error between the multiplication steps has been reduced (the root mean squared error (RMSE) is reduced from 6.61 to 1.59%, see table 2 for more details), and (iii) due to the reduced total capacitance, the system response is faster (average pulse width is reduced from 81.16 to 43.72 μs) and the average energy dissipation is reduced (from 254 to 157 fJ).

    Figure 8.

    Figure 8. Pulse-width ranges generated by the mATC for different configuration weights.

    Table 2. mATC expected multiplication weight ratios and % errors for different designs.

    mATC multiplication weight ratioexpected ratio% error - base mATC% error - TL mATC
    w2/w12.0091.0920.91
    w3/w21.5015.037.17
    w4/w31.334.270.22
    w5/w41.252.871.62
    w6/w51.201.960.41
    w7/w61.171.500.41
    w8/w71.14-0.24-0.87
    w9/w81.121.110.26
    w10/w91.111.360.45
    w11/w101.101.080.27
    w12/w111.090.990.45
    w13/w121.081.070.37
    w14/w131.081.340.74
    w15/w141.071.080.48

    A negative-edge triggered FWPG, shown in figure 9, is used between the mATC blocks as we require the triggering of the next mATC in the chain to occur during the falling edge of the pulse generated by the previous mATC. By triggering the next mATC with the falling edge of the previous mATC output, time addition operation is realized. In this implementation, we used an FWPG which generates pulses with a pulse-width of 50 ns. This minimum value of the pulse-width can be chosen to be any value that satisfies the following requirements during triggering: (i) both nodes n1 and n2 are completely driven to ground during the pulse, and (ii) other input of the NOR gate is completely driven to VDD with sufficient timing margin to account for process mismatch before the output of the FWPG goes low. The maximum value of the pulse-width of the FWPG is limited by the minimum pulse value that is generated by the mATC, i.e. 1.94 μs for a 0001 input. During our simulations, same pulse-width was also used for the Begin Classification signal.

    Figure 9.

    Figure 9. Negative-edge triggered fixed-width pulse generator. (Online version in colour.)

    4. Simulation results

    After characterization and verification of the sub-blocks of the hardware ANN, extensive transistor-level SPICE simulations using HSPICE simulator were run to verify the correct time-mode operation of the designed system. As in the characterization of the sub-blocks, a supply of 0.6 V is used. Separate testbenches were programmatically created to simulate 100 samples from the test dataset and transient simulations were run. Results of one such simulation run for a classification of digit 2 is shown in figure 10. As it can be seen, the correct classifier neuron generates a faster output response than the other neurons, successfully classifying the input digit. For this specific case, the fastest neuron, i.e. neuron 2, responded 62.1 μs faster than the next fastest neuron.

    Figure 10.

    Figure 10. Transistor-level transient simulation of an example correct classification of a handwritten digit 2 by neuron 2. The figure shows the output signals of the neurons with timing information. (Online version in colour.)

    The average energy dissipation per classification while working at 0.6 V VDD is 65.74 pJ. The average classification response time for the test dataset is 421.8 μs, resulting in 2.37 k classifications per second for a classification accuracy of 88%. As the focus of the present study is on an energy efficient design with low dissipation rather than state-of-the-art classification accuracy, the accuracy of 88%, which is typical in 1-layer neural networks [26], is acceptable at the current stage. In the meantime, the classification accuracy is still significantly higher than the random guess for MNIST dataset.

    We also investigated the effects of process mismatch on the performance of, first, a chain of mATCs, and, later, on the ANN. We first simulated a chain of mATCs with varying number of elements for the effects of local mismatch. For these simulations, to represent an average case of operation, all the analogue input voltages to the mATCs and the multiplication coefficients were set to 350 mV and 1000, respectively. One hundred-point Monte Carlo simulations were run, and the results are presented in figure 11. The figure shows the curve fits of the normalized probability density functions over a varying number of elements in an mATC chain and the decrease in the coefficient of variation (CV = σ/μ) with increasing number of elements in the chain. The improvement in CV is 2 for every doubling of the number of mATC elements in the chain.

    Figure 11.

    Figure 11. One hundred-point Monte Carlo simulations showing the improvement in the coefficient of variation with increasing number of mATCs inside the neuron. (Online version in colour.)

    As it is apparent from the mATC chain mismatch simulations, increasing the number of elements in the chain reduces the relative variability of the ANN. Even though as the implemented ANN is trained off-line and there is no provision and straight-forward way to address variability during training, reliability issues due to process mismatch may be addressed in two ways: (i) algorithmically testing each neuron for the variability of the elements by applying multiple analogue input and digital control combinations and extracting the linear transfer curve, and (ii) by increasing the number of mATCs in the chain to average out and reduce the effects of variation, as shown in figure 11.

    To test the performance of the ANN for process mismatch, for each of the 100 image samples, we used to simulate and characterize the system, we ran 100 point Monte Carlo mismatch simulations (100 × 100 transient simulations in total) and the average standard deviation in the neuron response due to process mismatch time was 9.2 μs. Simulation results of such a simulation run for the classification of handwritten digit of 3 is presented in figure 12 as an example. The figure shows the response time variation distribution of each neuron in the designed ANN due to process mismatch.

    Figure 12.

    Figure 12. One hundred-point Monte Carlo simulation (classification of a handwritten digit 3) response time variation of the neurons of the designed ANN due to process mismatch. (Online version in colour.)

    For a misclassification to occur due to mismatch, the second fastest neuron should respond faster than the fastest neuron of the nominal conditions. For the case shown in figure 12, this is possible between neurons 3 and 7. This probability can be modelled as the half of the area of overlap of two Gaussian distributions with the same standard deviation and differing means (figure 13), and the intersection point of the distributions depends on the distance between the means. For mean differences less than 1.2σ, we saw that the errors due to the training (88% accuracy) occurred. For mean differences greater than 1.2σ, we calculated the added misclassification probability for each neuron and found the total added possibility of error due to process mismatch to be 1.17%, reducing the expected minimum accuracy to 86.63%.

    Figure 13.

    Figure 13. Response variation of two competing neurons due to process mismatch. Half of the value of the green shaded area represents the probability of misclassification.

    When compared to the state-of-the-art hardware ANN implementations, the design presented in this work compares favourably in terms of reduced energy dissipation, which is the main aim of this design exercise. A comparison of results with a recent and directly comparable hardware 9 × 9 pixel MNIST classification ANN in [13] is given in table 3. When both implementations are compared, even though the presented ANN is designed using an older technology, i.e. 0.18 μm process, compares favourably in terms of energy dissipation. One metric where the design in [13] is performing better than the presented implementation is the classification speed. However, due to the design constraints in [13], operating voltage cannot be lowered further; hence, energy dissipation, which is proportional to the square of supply voltage in digital circuits, cannot be further reduced. Furthermore, it is expected that our implementation will achieve much better average operating speed and energy dissipation numbers when this design is migrated to more advanced technologies. The energy dissipation per classification is reduced by a factor of 9.58x, from 630 pJ down to 65.74 pJ when compared to [13]. It should also be noted that the ANN implementation presented in this paper works with analogue signal inputs, without requiring the input data to be converted to digital for further processing. If analogue-to-digital conversion energy cost per image is added to the classification energy numbers reported in [13] in table 3, presented ANN implementation is even more energy efficient.

    Table 3. Comparison of the implemented TM ANN with an implementation in the literature.

    SRAM classifier [13]this work
    technology (nm)130180
    supply voltage (V)1.20.6
    classification accuracy (%)9088
    classification speed (Hz)50 × 1062370
    analogue-to-digital conversion energy includednoyes
    energy dissipation (pJ)63066

    Extending the single-layer ANN presented in this study to a multi-layer version is an on-going work. However, from our preliminary results, it has been observed that, once a value/variable is converted to a TM signal, in order to operate in the most energy efficient way, processing should continue in TM without conversion between the TM and analogue/digital domains. For example, an asynchronous time-to-digital converter (TDC) in 0.18 μm process dissipates 1.48 pJ [19], and a similar TDC in a 65 nm process dissipates 0.97 pJ [29] per conversion. When compared to the average energy dissipation of each neuron (6.6 pJ), it can be observed that conversion between different operating domains incurs energy dissipation overhead values which are comparable to the energy dissipation of the data processing circuitry.

    5. Conclusion

    This paper presents the hardware design and the simulation results of a TM, single-layer ANN with Softmin activation function for handwritten digit classification. TMSP techniques have been applied for accumulating weighted image signal values using energy-efficient time-mode circuitry. Optimization steps for both system level and hardware level design are given. The system was designed and simulated in a standard 0.18 μm process and operates from a supply voltage of 0.6 V. By applying the presented design guidelines, an energy-optimal 9 × 9 handwritten digit image classification ANN with 4-bit quantized weights was designed. The energy dissipation of the design for each classification is 65.74 pJ while operating at a speed of 2.37 k classifications per second, with a classification accuracy of 88%.

    Data accessibility

    This article has no additional data.

    Authors' contributions

    O.C.A. conceived, designed, simulated and verified the transistor-level, time-mode ANN implementation. O.C.A. and J.M. created and optimized Python level ANN for time-mode implementation. Both authors drafted, read and approved the manuscript.

    Competing interests

    We declare we have no competing interests.

    Funding

    This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curiegrant agreement no. 752819 for the MSCA IF Project ATiNaRI.

    Acknowledgements

    The authors thank the anonymous reviewers for their constructive comments and helpful suggestions.

    Footnotes

    Present address: Department of Anatomy, Université du Québec à Trois-Rivières (UQTR), Trois-Rivières, Canada.

    One contribution of 13 to a theme issue ‘Harmonizing energy-autonomous computing and intelligence’.

    Published by the Royal Society. All rights reserved.

    References

    Reference

    • 1.
      Sutskever I, Vinyals O, Le QV. 2014Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. San Diego, CA: NIPS. Google Scholar
    • 2.
      Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. 2014Learning phrase representations using RNN encoder-decoder for statistical machine translation. (http://arxiv.org/1406.1078). Google Scholar
    • 3.
      Cireşan D, Meier U, Schmidhuber J. 2012Multi-column deep neural networks for image classification. (http://arxiv.org/1202.2745). Google Scholar
    • 4.
      Ba J, Mnih V, Kavukcuoglu K. 2014Multiple object recognition with visual attention. (http://arxiv.org/1412.7755). Google Scholar
    • 5.
      Deng L, Hinton G, Kingsbury B. 2013New types of deep neural network learning for speech recognition and related applications: an overview. In 2013 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May, pp. 8599–8603. Piscataway, NJ: IEEE. Google Scholar
    • 6.
      Ji S, Xu W, Yang M, Yu K. 20133d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 221–231. (doi:10.1109/TPAMI.2012.59) Crossref, PubMed, ISIGoogle Scholar
    • 7.
      Liang M, Hu X. 2015Recurrent convolutional neural network for object recognition. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Boston, MA, 7–12 June, pp. 3367–3375. Google Scholar
    • 8.
      Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M. 2013Playing Atari with deep reinforcement learning. (http://arxiv.org/1312.5602). Google Scholar
    • 9.
      Silver Det al.2016Mastering the game of go with deep neural networks and tree search. Nature 529, 484–489. (doi:10.1038/nature16961) Crossref, PubMed, ISIGoogle Scholar
    • 10.
      Khan Jet al.2001Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. 7, 673–679. (doi:10.1038/89044) Crossref, PubMed, ISIGoogle Scholar
    • 11.
      Al-Shayea QK. 2011Artificial neural networks in medical diagnosis. Int. J. Comput. Sci. Issues 8, 150–154. Google Scholar
    • 12.
      Kodali S, Hansen P, Mulholland N, Whatmough P, Brooks D, Wei G-Y. 2017Applications of deep neural networks for ultra low power IoT. In 2017 IEEE Int. Conf. on Computer Design (ICCD) , pp. 589–592. Piscataway, NJ: IEEE. Google Scholar
    • 13.
      Zhang J, Wang Z, Verma N. 2017In-memory computation of a machine-learning classifier in a standard 6T SRAM array. IEEE J. Solid-State Circuits 52, 915–924. (doi:10.1109/JSSC.2016.2642198) Crossref, ISIGoogle Scholar
    • 14.
      Chen Y-H, Krishna T, Emer JS, Sze V. 2017Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits 52, 127–138. (doi:10.1109/JSSC.2016.2616357) Crossref, ISIGoogle Scholar
    • 15.
      Lee J, Kim C, Kang S, Shin D, Kim S, Yoo H-J. 2018Unpu: an energy-efficient deep neural network accelerator with fully variable weight bit precision. IEEE J. Solid-State Circuits 54, 173–185. Crossref, ISIGoogle Scholar
    • 16.
      Bankman D, Yang L, Moons B, Verhelst M, Murmann B. 2018An always-on 3.8 μj/86% CIFAR-10 mixed-signal binary CNN processor with all memory on chip in 28 nm CMOS. In 2018 IEEE Int. Solid-State Circuits Conf.-(ISSCC), Boston, MA, 11–15 February, pp. 222–224. Piscataway, NJ: IEEE. Google Scholar
    • 17.
      Yuan F. 2014CMOS time-to-digital converters for mixed-mode signal processing. J. Eng. 2014, 140–154. (doi:10.1049/joe.2014.0044) CrossrefGoogle Scholar
    • 18.
      Chen Z, Gu J. 2016Analysis and design of energy efficient time domain signal processing. In Proc. of the 2016 Int. Symp. on Low Power Electronics and Design, San Francisco, CA, 8–10 August, pp. 100–105. NewYork, NY: ACM. Google Scholar
    • 19.
      Akgun OC, Mangia M, Pareschi F, Rovatti R, Setti G, Serdijn WA. 2019An energy-efficient multi-sensor compressed sensing system employing time-mode signal processing techniques. In Proc. of IEEE Int. Symp. on Circuits and Systems (ISCAS), Sapporo, Japan, 26–29 May, pp. 1–5. Piscataway, NJ: IEEE. Google Scholar
    • 20.
      LeCun Y. 1998The MNIST database of handwritten digits. See http://yann.lecun.com/exdb/mnist/. Google Scholar
    • 21.
      Paszke Aet al.2017Automatic differentiation in PyTorch. In NIPS-W. San Diego, CA: NIPS. Google Scholar
    • 22.
      Kingma DP, Ba J. 2014Adam: a method for stochastic optimization. (http://arxiv.org/1412.6980). Google Scholar
    • 23.
      Duchi J, Hazan E, Singer Y. 2011Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159. ISIGoogle Scholar
    • 24.
      Tieleman T, Hinton G. 2012Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4, 26–31. Google Scholar
    • 25.
      Ruder S. 2016An overview of gradient descent optimization algorithms. Google Scholar
    • 26.
      LeCun Y, Bottou L, Bengio Y, Haffner P. 1998Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324. (doi:10.1109/5.726791) Crossref, ISIGoogle Scholar
    • 27.
      Akgun OC, Gurkaynak FK, Leblebici Y. 2009A current sensing completion detection method for asynchronous pipelines operating in the sub-threshold regime. Int. J. Circuit Theory Appl. 37, 203–220. (doi:10.1002/cta.540) Crossref, ISIGoogle Scholar
    • 28.
      Sedra AS, Smith KC. 1998Microelectronic circuits. 4th edn. Oxford, UK: Oxford University Press. Google Scholar
    • 29.
      Akgun OC. 2018An asynchronous pipelined time-to-digital converter using time-domain subtraction. In Proc. of IEEE Int. Symp. on Circuits and Systems (ISCAS), Florence, Italy, 27–30 May, pp. 1–5. Piscataway, NJ: IEEE. Google Scholar