Rethinking arithmetic for deep neural networks

We consider efficiency in the implementation of deep neural networks. Hardware accelerators are gaining interest as machine learning becomes one of the drivers of high-performance computing. In these accelerators, the directed graph describing a neural network can be implemented as a directed graph describing a Boolean circuit. We make this observation precise, leading naturally to an understanding of practical neural networks as discrete functions, and show that the so-called binarized neural networks are functionally complete. In general, our results suggest that it is valuable to consider Boolean circuits as neural networks, leading to the question of which circuit topologies are promising. We argue that continuity is central to generalization in learning, explore the interaction between data coding, network topology, and node functionality for continuity and pose some open questions for future research. As a first step to bridging the gap between continuous and Boolean views of neural network accelerators, we present some recent results from our work on LUTNet, a novel Field-Programmable Gate Array inference approach. Finally, we conclude with additional possible fruitful avenues for research bridging the continuous and discrete views of neural networks. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.


Introduction
This paper considers the development of deep neural networks in the supervised learning setting [1]. Inspired by the recent rise of interest in specialized hardware accelerators for deep neural networks [2], we shall take a fresh look at the question of suitable network topologies and basic node functionalities for such accelerators. We shall begin by defining the supervised learning problem. Let X denote the set of possible inputs to a machine learning inference function and Y denote the set of possible outputs. Imagine that we have an oracle function r : X → Y, mapping every possible input to the corresponding ideal output y = r(x). Generally, we will be interested in inference via a family of parametrically defined functions f (p; x), with parameters drawn from some set P. We will often write f p (x) when we wish to consider the case where the parameter value p has been fixed. These functions will not, in general, produce the ideal output for all possible inputs, and therefore we need to consider some notion of inaccuracy, or 'loss', , which measures the difference between the ideal output and the actually computed output as ( f p (x), r(x)). For simplicity, we assume in this article that is a metric [3] defined on Y. We are generally interested in average-case behaviour of these parametric functions 'in the wild', on any data that may frequently appear as input in real usage. Lifting the metric on Y to the following metric defined on functions X → Y: where the expectation is over the input space, we can then pose the question of supervised training as the following optimization problem of selecting parameters to minimize the distance to an oracle function: argmin p∈P m( f p , r). (1.2) There are some practical problems, however. Firstly, it is unlikely that we have access to or knowledge of the distribution of X or to an oracle function r, except through a finite set of samples, known as the training set. Secondly, as Scheinberg notes [4], the loss function desired in practice (e.g. an indicator function) may give rise to a computationally intractable optimization problem. As a result, it is common to aim instead to solve the training problem, where (x i , y i ) are the training data-inputs for which the ideal output is known-and is some suitable, often convex, loss function. The actual accuracy of the resulting function f p * can then be evaluated on some other set of data (x i , y i )-the test data, as a proxy for m( f p * , r), to obtain the test error: ( 1.4) It turns out that this setting, therefore, imposes particular restrictions on the family of parametrized functions f , because we wish p * -which was selected based only on the training data-to also work well for the test data, as well as ensuring several other properties to be discussed. This fundamental problem: the design of families of parametrized functions for this purpose is the key subject of study of this paper. In particular, we address here the case where the functions f p map from one finite set to another, which is always the practical setting in a finite-precision computer. By considering the discrete problem explicitly, several new insights are developed, which may be of value to those researching highly efficient machine inference.
The structure of this paper is as follows: Section 2 introduces a model of computation defined by typed graphs. We use this model to develop a deeper understanding of the computation of inference functions in deep neural networks, discussing suitable choices for such functions. The model also lets us reason about their approximation by discrete functions, and hence the potential for hardware implementations of such computations. In §3, we present an abstract view of the typical digital design process for hardware accelerators of numerical functions. We then show that a known family of extremely quantized neural networks is functionally complete. This result runs counter to standard thinking in hardware-accelerated neural networks, and we shall consider the reason for this apparent contradiction. Section 4 revisits the question of appropriate inference functions from §2, but now in the discrete setting. We argue for a trinity of topology, node functionality and metrics as interacting to determine the efficient inference computation and pose some open questions regarding the extent to which these factors can be decoupled. In §5, we consider an approach for efficient field-programmable gate array (FPGA) inference, known as LUTNet, recently published by my research group as an example of the initial work bridging the continuous and discrete setting. Finally, §6 draws conclusions and points to several fruitful avenues for further research. This paper makes use of a variety of notations, summarized at the end of the paper.

Networks and inference functions (a) A graphical approach
A graphical approach is universally used to describe-formally or informally-the computations performed by deep neural networks. In this section, we shall develop a slightly unorthodox but very general formalism, which will be of use throughout the paper. Our aim here is to distinguish the syntactic description of neural networks as graphs from the semantic interpretation as functions. This distinction will be important because the transformations applied to develop a realization of a neural network as a programme or a piece of digital hardware are primarily based on the syntactic representation.

Definition 2.1.
An edge e is simply a unique label, together with a set such as R or F, which can be interpreted as the type of data carried by the edge in a network. figure 1, x, y, w 1 , w 2 , c and d are all edges, which we will take to be of real type.  figure 1, there is a vertex ((w 1 , w 2 ), (x, y), c, (w 1 , w 2 ; x, y) → w 1 x + w 2 y).

Example 2.2. In
The purpose of distinguishing PARAM from IN is to identify those values (parameters) that are intended to be determined once, offline, versus those values (activations) that are intended to be changed each time the graph is used in inference-we use a semicolon to separate parameters from activations, for readability purposes. In this example, w 1 and w 2 are parameters, commonly called weights, while x and y are not. There is one other vertex ((), c, d, RELU) shown in the figure; this vertex has an empty parameter list. PRIOUT) of vertices together with a distinguished set of edges PRIOUT such that: -No edge appearing in the PARAM list of any vertex also appears in the OUT list of any vertex.
-No edge appears in the OUT list of more than one vertex. -All edges in PRIOUT appear in the list OUT list of exactly one vertex.
We will refer to parameters of the network to mean the list of all edges appearing in the PARAM list of any vertex; inputs to the network to mean the list of all edges appearing in the IN list of some vertex but not in the OUT list of any vertex; and outputs of the network to be the set PRIOUT. We will often refer to the 'leaf functions' of a network, meaning the collection of functions FUNC of all the vertices in the network.
Example 2.6. The network shown in figure 1 consists of the two vertices previously described, together with a set of primary outputs. One possibility for such a set is PRIOUT = {d}, but there are other choices, depending on which edges are required to be observable at the network output. The parameters of this network are w 1 and w 2 , and the inputs to the network are x and y.
royalsocietypublishing.org/journal/rsta Phil. Trans. R. Soc. A 378:   Figure 1. A simple network consisting of two vertices. One vertex has two parameters and two activation inputs and has function (w 1 , w 2 ; x, y) → w 1 x + w 2 y. The other vertex has no parameters and one activation input and has function c → RELU(c).
Definition 2.7. We say that a network N implements a function N defined through the natural function composition of the individual vertex functions, i.e. N is a function from the Cartesian products of parameters and inputs of the network to the Cartesian product of the outputs of the network, defined inductively, with vertex functions as the leaf functions. For simplicity, we will consider computations corresponding to acyclic networks-including the very significant class of Convolutional Neural Networks [5]-however, the formalism can easily be extended to cyclic networks (e.g. LSTMs [6]) by lifting computation over the types illustrated above to computations over streams of those types [7]. This generalization does not affect the following material. Equally, it is trivial to make networks hierarchical by generalizing functions computed to also allow sub-networks, but this will not be required in the sequel.

(b) Functions for inference
What kind of functions f = N form good candidates for machine learning? And what basic functionality should be implemented by nodes in a network N for this purpose? In practical terms, for deep learning today, the most common leaf functions are the inner products, RELU, sigmoid and softmax [1]. However, it is worth considering the various factors that determine this choice now and in the future. Informally, functions should: ❶ Generalize well: once the parameter p is selected based on the training data, f p (x) should also tend to perform well over unseen test data. ❷ Be cheap to compute: the cost (speed, energy) of evaluating the function at inference time should be low. ❸ Be sufficiently general/expressive: the functions should be capable of approximating a wide variety of oracle functions r. ❹ Be easy to learn: optimization algorithms used to address the training problem described in §1 should be both cheap to execute and also rarely give rise to values of parameter that are grossly sub-optimal with respect to the training set.
Strang [8] argues that continuous piecewise linear (CPL) functions have tended to perform well, explaining the importance of the inner product and RELU functions in today's networks, as CPL functions are precisely those that are implemented by networks with these vertices. Strang argues that continuity is the key to generalization, which intuitively makes sense: if an untrained input is very close to a trained one, it seems reasonable to expect the corresponding outputs of the network to be very close in turn. To make this intuition precise requires us to equip the input and output sets with metrics, d and e, respectively, allowing us to define what it means for inputs and outputs to be 'close'. We can then consider the inference function f p as a function from an input metric space (X, d) to an output metric space (Y, e). We have a choice of options to define continuity; we shall use Lipschitz continuity [3], for reasons that will become apparent in the next section.
Example 2.11. For computation over R n with metrics determined by a suitable norm in that space, the RELU function is Lipschitz and the inner products are Lipschitz, and thus by composition, networks constructed from these two functions are Lipschitz [3] and, therefore, good candidates for generalizing beyond training data.
Whether the inner products and RELU functions are cheap to compute (property ❷) depends upon our model of computation; in the abstract Blum-Shub-Smale model for real computation, this is certainly the case [9]. It is now well known that a wide variety of neural networks, including those implementing CPL functions are universal approximators, and hence sufficiently general [10,11] (property ❸). This leaves the question of whether such functions are 'easy to learn' (property ❹). This is still an active area of research; however, theoretical insights, such as [12] combined with practical experience, suggest that this is indeed the case.
So, while CPL functions over the reals appear to be very promising, practical computers do not compute over the reals. In practice, finite-precision datatypes are (almost) always used to approximate computation over the reals, and the picture of appropriate inference functions has the potential to change considerably in this setting. We examine this question in the next section.

Discrete inference
We shall refer to a network where the types of all activations are R as a real network, where the types of all activations are F ⊂ R for finite F as a finite-precision network, and where the types of all activations are B as a Boolean network. Boolean networks correspond exactly to combinational digital circuits, and so hold a special place from an implementation perspective. Figure 2 illustrates the standard digital design process for the development of a Boolean network approximating a given real network G 1 . The first step is that of quantization. Here, real data types associated with edges in G 1 are replaced by finite precision data types F. Typical examples are single-precision IEEE floating-point arithmetic [14] as well as various fixed-point arithmetics. Consequently, the functions FUNC performed by each node in the network must also be quantized, hence it is common to require G 1 's node functions to be drawn from a basic set of operators for which this function quantization process can be performed automatically or is defined by some standard as, e.g. { * , −, +, /} are for IEEE floating-point arithmetic. The quantization process induces a change in function: ι m 1 • G 2 = G 1 • ι n 1 in general, and so has been the subject of a considerable amount of work in the DNN literature, with modern machine inference architectures often offering choices of precision that trade performance for accuracy of computation [2], e.g. [15]. The main distinguishing features of this setting compared to classical finite precision quantization results [16] are due to the metric m introduced in §1: both its inherently stochastic nature and distance to an oracle r rather than distance to the underlying real function being the primary concern, i.e. the ideal quantization is one that by selectingp minimizes m( G 2 p , r) rather than m( G 2 p , G 1 p ). In practice, however, it is typical to initially select each element of the quantized parameter independently, effectively relying on the repeated application of the triangle inequality applied syntactically to the graph to ensure m( G 2 p , G 1 p ) remains small, further relying on the triangle inequality property of m to ensure the distance to the oracle does not grow considerably. Sometimes, this initial choice is refined through a process known as re-training [2].  Starting from a specification graph G 1 , the designer constructs a network G 2 operating on finite-precision datatypes, typically fixed or floating point, as described in the text. A 'synthesis tool' then automatically creates a Boolean network G 3 , known as a 'netlist' . The netlist implements the function G 2 in the sense that Here, we distinguish X and Y from B kn and B km because the inclusions are often not surjective, giving rise to the well-studied problem of 'Boolean don't-cares' [13]. The lower two sections of this diagram, therefore, commute, while the top section 'approximately commutes' .
The second step of the process is to convert the finite-precision network to a Boolean network for implementation. This process is fully automated in modern digital design tools. First, each vertex in the finite-precision graph is replaced by a Boolean network defined for that particular node's function, for a predefined encoding of the elements of F into elements of B k , e.g. the IEEE floating-point storage standard [14] which encodes each single-precision floating-point number as a k = 32-bit vector of Boolean values; this part of the process is known in digital design as 'core generation'. Second, logic synthesis tools [13] are applied to rewrite the graph to reduce its implementation cost as a circuit. The result of this process is a Boolean network G 3 which can be directly implemented as a digital logic circuit. The computation implemented by G 3 corresponds exactly to that implemented by G 2 in the sense that φ o • G 3 | X = G 2 • φ i , where | X denotes the restriction of the function to the domain of φ i , i.e. the middle section of the diagram commutes.
It can, therefore, be seen that in a standard digital design process, the only part of the process where an approximation is induced (the upper section of figure 2) is not associated with topological changes to the network, while the only part of the process where topological changes are induced (the middle section of figure 2) is not associated with approximation. This observation will be of importance in the sequel.
The abstract process described in figure 2 is illustrated for a concrete example in figure 3. The small inset figure corresponds to the topology of G 1 , the original specification graph, where each vertex is associated with a function R 2 × R 2 → R given by (w 1 , Fixing w 1 and w 2 to specific values, quantizing the computation to a 4-bit fixed-point arithmetic and synthesizing the result produces the large main figure, corresponding to G 3 , where each vertex is associated with a 1-or 2-input Boolean function. Clearly, there are some key differences between these networks apart from their datatypes: G 3 has an irregular structure compared to G 1 and has clusters of tightly interconnected 'neighbourhoods', roughly corresponding to the Boolean networks introduced for each fixed-point arithmetic function in G 2 . However, by maintaining the entire design process within the same graph formalism, we can exploit the similarities: both are directed graphs operating on typed data, with nodes which can be considered as parametric functions-for the real network the parametric functions are dot products with parameters given by weights, for the Boolean network they are Boolean functions with the parameter indicating which function from B 1 or B 2 has been selected by the logic synthesis tool.   ). In the real network, nodes with inedges all correspond to a function R 2 → R given by (x 1 , x 2 ) → RELU(w 1 x 1 + w 2 x 2 ) for some-possibly distinctparameter w. In the Boolean network, nodes correspond to simple logic functions from B 1 or B 2 produced by a synthesis tool [17], implementing a 4-bit fixed-point quantization of the real network. Tightly interconnected regions can be seen, corresponding to the Boolean implementation of individual arithmetic operations. Rendering of both graphs is via Gephi [18], with colouring by 'community' . (Online version in colour.)

(a) Binarized neural networks
Driven by the desire to reduce energy consumption and improve performance as much as possible, an extreme form of fixed-point arithmetic has been used in the so-called binarized neural networks (BNNs) [19]. In these neural networks, both the weights and the activation signals are constrained to be drawn from {−1, +1}, resulting in extremely efficient implementations [20].
A classical function f : R n+1 × R n → R, given by (w, c; x) → σ (w T x − c) implemented by the component of a deep neural network, is aggressively quantized to BNN +1} given by (w, c; x) → +1 for w T x ≥ c, and (w, c; x) → −1 otherwise. The key to the implementation efficiency of such functions comes from the near-elimination of hardwareexpensive multiplication operations: multiplication in a vector scalar product is reduced to a Boolean exclusive (XNOR) function. Meanwhile, the addition in the scalar product is reduced to a calculation of Hamming weight (population count), which admits efficient implementations [21].
Although BNNs have received a lot of attention, the general view in the implementation community is that neural networks constructed in this way are not universally able to implement as good quality classification on complex datasets compared to more precise data representations. This observation has led to manufacturers including configurable finite-precision datapaths typically down to 4-bit [15] or 8-bit [22]. It is instructive to pursue an alternative view, which we shall now develop. functionally complete set of Boolean functions at its vertices, similar to figure 2, i.e. f = ϕ 2 • G • ϕ 1 for some Boolean network G.

Theorem 3.2. The set of node functions in a Boolean implementation of BNNs is functionally complete.
Proof. We shall use the bijection φ : B → {−1, +1} defined by ⊥ → −1, → +1. Clote & Kranakis [23] provide necessary and sufficient conditions for a set of Boolean functions to be functionally complete; one well known such set is {∧, ∨, ¬}, together with the constants ⊥ and . The equivalences below can easily be shown through enumeration: Note that it is, therefore, always possible to construct a real-valued DNN which, when quantized to produce a BNN, implements any Boolean function, including those Boolean functions that would have been derived via traditional design techniques (figure 2) using any finite-precision datatype F, i.e. BNNs easily satisfy our property ❸. The theorem, therefore, challenges the received wisdom that BNNs are not always able to produce the required accuracy on a classification task. So, why this apparent discrepancy in practice? The issue is not with the computational generality of BNNs, but rather with the traditional design technique, which is unable to adapt the topology of the network to the requirements of the underlying datatype.

Corollary 3.3. Accuracy-optimal network topology depends on the finite-precision datatype.
This corollary leads to a conjecture on future design methods for efficient neural networks, which generalizes some empirical observations, e.g. that reducing precision can be compensated by increasing network depth [24] or width [25]. Today, digital circuits are universally implemented using CMOS technology [26], whether in a microprocessor or a custom circuit design. CMOS circuits form extremely efficient implementations of nonlinear operations with a single output bit. This contrasts sharply with the standard nodes of real-valued DNNs, the inner product and the RELU, which are piecewise linear but arbitrarily precise. The usual approach to this dichotomy is to use wide enough finite-precision datatypes to make the hardware emulate the real-valued model: but at what cost? Conjecture 3.4. In the future, efficient neural network topologies will be driven by both the topology of the data and by the nature of the discrete representation of the activations. The current separation between approximation (without topological changes) and topological changes (without approximation) will not survive the drive for efficient computation.

Boolean networks for Lipschitz functions
Since we have demonstrated the link between topology and data representation in deep neural networks, a natural question arises: which topologies may form good choices for learning Boolean functions? Perhaps, one may even remove the F level of abstraction in figure 2, which would then become equivalent to learning the arithmetic.
In §2, we discussed the properties of inference functions in a continuous setting; we shall now extend this discussion to Boolean networks. The aim of this section is to focus on property ❶: how can we develop Boolean networks exhibiting good generalization?
We explained, following Strang, the centrality of continuity to generalization in §2. The advantage of working with Lipschitz continuity is that we can directly transfer this idea to the Boolean setting. Here, every function f : (B n , d) → (B m , e) is Lipschitz, since we may take the Lipschitz constant k = max (a,b)∈B m ×B m (e(a) − e(b)), so it is not meaningful to talk about continuity in absolute terms, but rather about the value of the Lipschitz constant. We shall, therefore, study the question which Boolean networks give rise to k-Lipschitz functions? The intuition here is that the lower the Lipchitz constant, the better the function meets the desirable property that small input perturbations cause at most small output perturbations.
Before investigating a concrete example of a simple Boolean circuit in this context, let us consider typical ways to define a metric on the Boolean vectors forming the inputs and outputs of a circuit. It will be helpful to define ϕ : B → {0, 1} as ⊥ → 0, → 1. Although not strictly necessary, it is typical to consider the metrics induced by the norms of encoded data, e.g.
Here, we may interpret φ i as denoting a real vector represented by the Boolean inputs. A trivial example would be φ i (a 1 , a 0 ) = 2ϕ(a 1 ) + ϕ(a 0 ), a representation of a two-bit scalar integer in standard binary arithmetic. A more complex scalar encoding corresponds to IEEE single-or double-precision floating point, as explicitly given in the standard [14].
It is instructive to consider the most basic typical arithmetic circuit, known as a ripple-carry adder, shown in figure 4 [27]. Each leaf node implements a Boolean function known as a full adder: where ⊕ denotes Boolean XOR. We can consider this circuit as implementing a function f : B n × B n × B → B n+1 . If we define w k : B k → Z as the function mapping vectors of Boolean values to the number they represent in a standard binary integer encoding: then it can be seen why the Boolean network is referred to as an adder: + • (w n , w n , ϕ) = w n+1 • f , where + denotes the standard integer addition. In the formalism of figure 2, φ i = (w n , w n , ϕ), φ o = w n+1 . Let us equip the input and output spaces with suitable metrics, e.g. those induced by the 1norm of the difference in their word-level representation: d ((a, b, c), (a , b , c ) ((a, b, c 0 ), (a , b , c 0 ) function (a, b, c) → (⊥, c) results in a minimal Lipschitz constant of 2 n rather than 1. Changing the metrics-equivalent in the norm-induced case to encoding the input or output with a different number systemcould equally impact the Lipschitz properties. Finally, a different network topology based on the same full-adder leaf nodes could clearly lead to a different minimal Lipschitz constant. Thus, the minimal Lipschitz constant exhibited by a function implemented by a network will generally depend on three things: the topology of the network, the leaf node functionality and the encoding/metrics associated with the inputs and outputs of the network. Even if we assume the latter to be fixed, the interaction between the former two features is not ideal if we wish to learn the functionality of nodes in the network: local decisions on Boolean functionality can potentially have a global impact on the generalization behaviour of a network.
Learning from the n-bit adder example, one natural approach to generating functions with low Lipschitz constant appears to be to reverse the direction of the 'carry' edges c i . If these edges are reversed, then no path exists between a i , b i , c i and s j or c j for any j > i, meaning that changes in low-significance input bits cannot impact high-significance output bits. This topology is appealing because it corresponds directly to most-significant-digit-first arithmetic, a universal approach to computation pioneered by Ercegovac [28] in the 1970s for computer arithmetic: through a suitable change in the encoding w k , this topology can be used to implement all the basic arithmetic operators [29]. However, such a topology does not guarantee a particular Lipschitz constant for the metrics defined in (4.3), because small changes in the input metric can still correspond to large changes in the most-significant-digit: one sees this, for example, with the transition 011111 → 100000, a change of one but with a most-significant-digit bit flip. To avoid this issue, one must either change the encoding of the network inputs and outputs or place restrictions on the Boolean functionality of the nodes. The former approach-selecting an optimal encoding of the input space as Booleans-is an open problem. A trivial but inefficient solution would be to use a unary encoding. More efficient solutions could potentially draw deeply from the area of combinatorial Gray codes [30], i.e. methods for generating combinatorial objects (such inputs of a discrete-valued neural network, drawn from X), so that successive objects differ by a small degree. As noted by Savage [30], Gray codes are not preserved under bijection, and it is exactly this property that could suggest implementation-appropriate coding.

Open Problem 4.2.
For future deep neural networks, what input and output codings are commensurate with the properties of good inference functions identified in §2, and how do they depend on the input probability space and oracle function?
The author performed the following simple experiment to investigate the latter approach, i.e. restricting Boolean functionality to ensure a certain Lipschitz constant for fixed topology and metrics. Consider the simple topology shown in figure 4b with associated metrics d((c, a 1 , a 0 ), (c , a 1 , a 0 )) = |w 3 (c, a 1 , a 0 ) − w 3 (c , a 1 , a 0 )| and e((s 1 , s 0 , q), (s 1 , s 0 , q )) = |w 3 (s 1 , s 0 , q) − w 3 (s 1 , s 0 , q )|. There are (2 4 × 2 4 ) 2 choices for the Boolean functionality of ( f 1 , f 0 ). If we assume that neither constant functions nor those in B 1 are of interest, then there are 100 choices for each of f 1 and f 0 . A complete enumeration identifies 376 pairs of Boolean functions f 1 , f 0 for which the network implements a 2-Lipschitz function. One may go further and ask whether we can identify a set of choices for the function of f 1 and the function of f 0 such that we may arbitrarily choose functions from these two sets while maintaining the 2-Lipschitz property, effectively decoupling the choice of leaf functionality from topology. We shall refer to such sets as a 'functional decoupling' for the given topology, value of k and metrics. correspond to the pairs of functions resulting in a network implementing a 2-Lipschitz function. A biclique [31] of this graph corresponds to a decoupled set. Using the algorithm of Gillis & Gilneur [32] reveals such a biclique of size (6,10) for this topology, 1 i.e. any combination of these choices of node function results in a 2-Lipschitz network function.

Open Problem 4.4.
Given metrics on input and output, a Lipschitz constant k, and a network topology, is there a useful characterization of exactly which functions can be implemented by a network with this topology using only leaf functions drawn from functional decouplings?
The significance of this problem is that it would help us to characterize the extent to which it is useful to consider promising network topologies separately from leaf functions.

The discrete-continuous divide: preliminary work
One of today's most promising platforms for practical realization of very high-performance deep neural networks today is the FPGA [33]. These architectures provide an interesting case study for exploring some of the ideas presented in this paper, because there is a natural choice for the set of leaf functions implemented in a network: the set B K [23] of K-input Boolean functions, where K is a device-specific parameter. This is a natural choice because the underlying architecture is actually built of small physical Boolean lookup tables, each programmable to implement any one of the functions in B K , together with programmable interconnect able to connect these lookup tables in an effectively arbitrary topology (K = 6 is common).
Wang et al. [34] have recently begun to explore the potential for using the additional flexibility provided by these lookup tables. In this initial work-which we call LUTNet-we begin by taking a reasonably traditional approach, following [35]: some standard DNN benchmarks from the literature are quantized to use single-bit weights from {−1, +1}, and retrained to improve classification accuracy. In the resulting network, many of the vertices have function (w; x) → wx, usually as a part of the standard inner product common in DNNs. We observe that such computation is inefficient, because the basic lookup tables are not being used to their full potential: in the extreme, we have hardware capable of implementing any function from B 6 used solely to implement 2-input XNOR gates. We, therefore, modify the network in the following way. First, we replace the vertex functions {−1, +1} × {−1, +1} → {−1, +1} given by (w; x) → wx by the strictly more general class of functions B 2 K × {−1, +1} K → {−1, +1} consisting of all functions (isomorphic to) B K , where the parameter selects the particular function. To use the additional support of these functions (K two-valued activations rather than just one), we heuristically allocate the additional inputs to connect to other nodes in the network with low values of weight before quantization. After initially setting the new functions to reproduce the original, i.e. selecting the parameters from B 2 K to be precisely those regenerating the function (w; x) → wx, we then retrain the network using standard Stochastic Gradient Descent (SGD) methods. Finally, we simplify the network topology through a standard 'pruning' technique [36]. The intuition of this process is that the nonlinear generality of the class B K may compensate for the pruning, resulting in a higher accuracy for a given number of Boolean network nodes. This is indeed what we observe in figure 5, which represents the classification error rate on the test set versus network area (in LUTs) for classification of the CIFAR-10 [37] dataset containing 60 000 32 × 32 colour images of 10 different classes, using the CNV neural network model [20] as the baseline topology from which the modifications described above are made to the largest layer-in the case of CNV, this is a sizeable convolutional layer with 256 outputs, operating with 3 × 3 kernels [34]. For this network, we see a reduction in the area consumption of approximately 50% compared to the baseline implementation operating at the same classification accuracy. Using SGD in this discrete setting requires a lifting to a continuous interpolation, as described in [34]. LUTNet is, thus, a representative of one direction in which to cross the discrete-continuous divide; some possible approaches to crossing in the opposite direction are explored in §6.

Future directions
It is the central thesis of this paper that there is much to learn by viewing neural networks and digital circuits as two embodiments of typed operations on graphs.
The topic of determining a good neural network topology is still in its infancy [1]. We have shown that there are additional dimensions to this problem: finite-precision data representation and the metrics determining 'closeness' of input and output also have a direct impact on efficient network topologies. Coupling these two concerns would seem to be a significant avenue for fruitful research in deep learning.
While the literature on learning the appropriate parameters for predefined neural network topologies has developed rapidly in recent years [1], systematic algorithmic approaches to learn neural network topologies from data are only recently appearing [38] and the underlying theory is limited. This mirrors the situation in the automated synthesis of digital circuits before the 1990s: the automated synthesis of logic circuits consisting of two layers (one of AND gates with optional input inversion, followed by one of OR gates) had been understood theoretically [39] and practically [40] before the 1990s, but only during that decade did the technology to optimize multilevel ('deep') Boolean networks emerge [13]. There may be a considerable scope for the crossover between the electronic design automation community and the deep learning community based on this work. Recently, there has been a resurgence of interest in the problem of exact (i.e. optimal) logic synthesis [41], which-albeit it in a different setting-also needs to simultaneously explore the topology and node functionality, and is stymied by the resulting computational complexity. This suggests a possible avenue for future development is to lift the progress being made in this area to richer data types.
The problem of placing bounds on the number of graph nodes drawn from a certain basis set required to meet a given quality of classification, e.g. via metric (1.1), could potentially be a very interesting topic for further theoretical study. There is a rich literature on circuit complexity bounds [23], and it may be possible to combine these ideas with probabilistic notions from Minimum Length Descriptions [42] to bound minimal circuit sizes. 2 The k-Lipschitz property used in this paper is a global property, yet it may seem more natural to consider local properties. Extending the approach to networks that implement some form of locally k-Lipschitz functions with high probability, when the input is viewed as a random variable, may be a fruitful way forward. In addition to reasoning about the generalization behaviour of neural networks, prior work has shown that Lipschitz continuity can play a role in the regularization of neural network models [43] and that minimal Lipschitz constants are hard to compute [44] a posteriori. These results are suggestive that a holistic approach to topology and node functionality is appropriate, as argued in this paper.
The path seems open to investigate a variety of coding techniques for network inputs and outputs that give rise to desirable properties regarding generalization as well as the efficiency of implementation. In a different context, Dietterich & Bakiri [45] consider distributed output coding for classification, and it may be the case that coding theory and combinatorial enumeration approaches [30] have the potential to shed significant light on the key elements of an efficient inference function discussed in this article.
In addition to exploring suitable classes of Boolean function, for example, by attacking Open Problem 2 described in §4, there may be a value in generalizing nodes to exhibit a nondeterministic behaviour. In particular, stochastic rounding has recently appeared as a promising avenue in both training of deep neural networks [46] and in the simulation of biologically plausible neural models [47].
Finally, we have focused entirely on deep neural networks in this article. There are, of course, many other classical machine learning techniques [48]. We should note that once an inference algorithm for one of these classical methods has been decided upon, the algorithm can be typically expressed as a network (in the sense of §2) corresponding to the data-flow graph [49] of the algorithm. Just as LUTNet, described in §5, uses BNNs as a starting point for re-training, it is equally possible to use this network as a starting point for re-training or topological exploration.

Notation
R denotes the reals, and B = {⊥, } the set of Boolean truth values, where ⊥ denotes false and denotes true. RELU : R → R is used to denote the rectified linear unit function x → max(0, x). σ : R → R denotes the sigmoid function x → (2/1 + exp(−x)) − 1. We denote the function composition by •. B K denotes the set of all functions from B K to B. The set of integers is denoted by Z, and the set of integers bounded in absolute value n by Z n = {i ∈ Z| − n ≤ i ≤ n}. The following Boolean connectives are used: ¬ denotes negation, ∧ denotes conjunction, ∨ denotes disjunction and ⊕ denotes exclusive or (XOR).
Data accessibility. Code to support §5 is available at https://github.com/constantinides/rethinking Competing interests. My research chair is part-funded by Imagination Technologies Ltd, and I have cited a blog post from the company in this work.
Funding. This work was financially supported by the Engineering and Physical Sciences Research Council (EP/P010040/1), Imagination Technologies and the Royal Academy of Engineering.