ANALOG PROGRAMMABLE SPARSE APPROXIMATION SYSTEM
This application claims benefit under 35 USC §119(e) of U.S. Provisional Patent Application Ser. No. 61/555,171, of the same title, and filed Nov. 3, 2011, which is herein incorporated by reference as if fully set forth below in its entirety. This invention was made with Government support under Agreement/Contract Number CCF-0905346, awarded by National Science Foundation. The Government has certain rights in the invention. 1. Technical Field Embodiments of the present invention relate generally to sparse approximation and specifically to sparse approximation using accurate, energy efficient, analog sparse approximation with reduced energy consumption and reduced computational expense. 2. Background of Related Art As shown in One specific example of sparse approximation is compressed sensing (CS) (e.g., attempting to create a high-resolution image from relatively few measurements). CS tends to provide results for inverse problems when the signals are highly undersampled (M<<N where M measurements are taken of a length N signal) and the signal is assumed to be sparse (i.e., having very few non-zeros in the signal). CS results indicate that for certain sensing matrices Φ (generally taken to be random), S-sparse signals can be recovered (up to the noise level) by solving an l1 regularized least-squares optimization problem as long as M˜O(S log(N/S)). In other words, in a situation where each measurement is costly, a signal can be undersampled during acquisition in exchange for using more computational resources to later recover the signal. This technique can be used, for example, for coded aperture sensing systems that spend fewer resources to collect data at a specified resolution, relying instead on computational post-processing to reconstruct the signal. Unfortunately, the optimization problems used for signal recovery are computationally expensive, preventing practical deployment of digital solutions for portable, low-power applications (e.g., handheld medical imagers or scanners). Despite the long history of optimization in the field of signal processing, the recent advent of applications that utilize optimization directly to perform CS, for example, identifies a specific need for solvers that can operate in real time and/or under real-world power constraints. This type of signal processing can be useful for, for example and not limitation, for medical imaging and channel estimation for wireless communications. Given the importance of solving sparse approximation problems in state-of-the-art algorithms, therefore, recent research has focused on dramatically reducing their solution times. These optimization programs are particularly challenging due to the presence of the matrix norm, l1-norm, in the objective because this makes the program non-smooth. Thus, despite recent progress in developing convex optimization solvers, this non-smoothness provides significant challenges for obtaining real-time results for moderate to large-sized problems. What is needed, therefore, is a system for recovering sparse signals, for example, with reduced computational time and expense. What is needed is a system for solving widely used sparse approximation problems using commonly available and efficient analog circuitry. It is to such a system that embodiments of the present invention are primarily directed. Embodiments of the present invention relate generally to optimization problems utilizing Hopfield networks and specifically to an analog hardware implementation of a Hopfield network. In some embodiments, the system can be used for sparse approximation using accurate, energy efficient, analog sparse approximation. This system can provide sparse approximation with reduced energy consumption and reduced computational expense. In some embodiments, the system can comprise sub-threshold current mode circuits on a Field Programmable Analog Array (FPAA) or on a custom analog chip. Embodiments of the present invention can comprise a method comprising applying each of a plurality of input signals to each of a plurality of feedforward excitation signals to generate a plurality of first output signals, applying each of a plurality of second output signals to each of a plurality of lateral inhibition signals to generate a plurality of recurrent feedback signals, subtracting each the plurality of recurrent feedback signals from each of the plurality of first output signals to generate a plurality of intermediate signals, and applying each of the plurality of intermediate signals to a non-linear computation to generate a plurality of second output signals. In some embodiments, the method can further comprise converting a first sparse vector of a plurality of sparse vectors to a plurality of input signals. In some embodiments, the plurality of feedforward excitation signals can be applied by a first plurality of transistors that comprise a first analog vector matrix multiplier (VMM). In other embodiments, the plurality of lateral inhibition signals can be applied by a second plurality of transistors that comprise a second analog vector matrix multiplier (VMM). In still other embodiments, the subtraction step can be performed by a plurality of current minors. In some embodiments, each step can be performed in parallel in continuous time for each input signal of the plurality of input signals. In other embodiments, the plurality of first output signals and the plurality of recurrent feedback signals can be analog, while the plurality of second output signals can be digital. In some embodiments, one or more of the plurality of first output signals and the recurrent feedback signals can change in response to a change in one or more of the plurality of second output signals. In some embodiments, a change in one or more of the plurality of first output signals or the recurrent feedback signals can act as a low-pass filter. I Embodiments of the present invention also comprise a device for implementing a Hopfield network. In some embodiments, the device can comprise a plurality of first parallel linear computational devices for applying each of a plurality of input signals to each of a plurality of feedforward excitation signals to generate a plurality of first output signals, a plurality of second parallel linear computational devices for applying each of a plurality of second output signals to each of a plurality of lateral inhibition signals to generate a plurality of recurrent feedback signals, and a plurality of non-linear parallel computational devices for subtracting each of the plurality of recurrent feedback signals from each of the plurality of first input signals to generate a plurality of intermediate signals and applying each of the plurality of intermediate signals to generate a plurality of second output signals. In some embodiments, the device can further comprise a plurality of digital-to-analog converters for converting a plurality of digital signals into the plurality of input signals. In other embodiments, the plurality of first parallel linear computational devices can comprise a first plurality of transistors forming a first analog vector matrix multiplier (VMM). In some embodiments, each scalar multiplication in the first analog VMM can require only one of the first plurality of transistors. In some embodiments, one or more of the first plurality of transistors can be programmable. In some embodiments, the plurality of second parallel linear computational devices can comprise a second plurality of transistors forming a second analog VMM. In other embodiments, each scalar multiplication in the second analog VMM can require only one of the plurality of transistors. In still other embodiments, one or more of the first plurality of transistors can be programmable. In yet other embodiments, the plurality non-linear parallel computational devices comprise a plurality of n-channel field effect transistors (nFET). In some embodiments, the device can further comprise one or more low-pass filters. In some embodiments, each of the plurality of non-linear parallel computational devices can comprise an individually tunable negative offset and an integrate and fire neuron. In some embodiments, these integrate and fire neurons can comprise non-leaky integrate and fire neurons. Embodiments of the present invention can also comprise a system for implementing a Hopfield network. In some embodiments, the system can comprise a field programmable analog array (FPAA). The FPAA can comprise a first plurality of transistors forming a first vector multiplication matrix (VMM) for applying the each of the plurality of input signals to each of a plurality of feedforward excitation signals to generate a plurality of first output signals, a second plurality of transistors forming a second vector multiplication matrix (VMM) for applying each of a plurality of second output signals to each of a plurality of lateral inhibition signals to generate a plurality of recurrent feedback signals, and a plurality of modified current mirrors for subtracting each of the plurality of recurrent feedback signals from each of the plurality of first input signals to generate a plurality of intermediate signals and applying each of the plurality of intermediate signals to a non-linear computation to generate a plurality of second output signals. In some embodiments, the system can further comprise a plurality of digital-to-analog converters for converting a first sparse vector from a plurality of sparse vectors into a plurality of input signals. In other embodiments, each scalar multiplication in the first VMM or the second VMM can require only one transistor of the first or second plurality of transistors. In still other embodiments, each of the first and second plurality of transistors can comprises one or more floating gates and the charge on each of the one or more floating gates can determine the weight of the scalar multiplication produced by that transistor. In some embodiments, each of the modified current mirrors can comprise a negative offset current. In other embodiments, the negative offset current can be individually tunable for each of the modified current minors. In still other embodiments, the negative offset current can be provided by a floating gate transistor and the charge of the floating gate transistor can determine the magnitude of the negative current offset. In still other embodiments, the nonlinear computations comprise integrate and fire neurons. These and other objects, features and advantages of the present invention will become more apparent upon reading the following specification in conjunction with the accompanying drawing figures. These and other objects, features and advantages of the present invention will become more apparent upon reading the following specification in conjunction with the accompanying drawing figures. Embodiments of the present invention relate generally to sparse approximation and specifically to a system for sparse approximation using accurate, energy efficient, analog sparse approximation. This system can provide sparse approximation with reduced energy consumption and reduced computational expense. In some embodiments, the system can comprise sub-threshold current mode circuits on a Field Programmable Analog Array (FPAA) or on a custom analog chip. To simplify and clarify explanation, the system is described below as a system for solving sparse problems using an FPAA. One skilled in the art will recognize, however, that the invention is not so limited and, for example, other analog or digital circuitry can be used. In addition, while explained below in the context of solving sparse approximations, one of skill in the art will recognize that the system and method is more generally a hardware implementation of a Hopfield Network. As such, the system and method could also be used to solve other optimization problems such as, for example and not limitation, quadratic programs/linear programs (QPs/LPs). For ease of explanation, specific components (e.g., CMOS chips) are described below; however, one skilled in the art will recognize that existing and future components and algorithms can be used. The materials described hereinafter as making up the various elements of the present invention are intended to be illustrative and not restrictive. Many suitable materials that would perform the same or a similar function as the materials described herein are intended to be embraced within the scope of the invention. Such other materials not described herein can include, but are not limited to, materials that are developed after the time of the development of the invention, for example. Any dimensions listed in the various drawings are for illustrative purposes only and are not intended to be limiting. Other dimensions and proportions are contemplated and intended to be included within the scope of the invention. As discussed above, a problem with current Hopfield Networks, in general, and sparse signal recovery algorithms, in particular, is that they are computationally expensive and time consuming, preventing practical deployment of digital solutions for portable, low-power applications. Recent work in computational neuroscience, however, has demonstrated that a continuous-time dynamical system where (1) the steady-state response is the solution to a regularized least-squares optimization and (2) the architecture of the system is designed to efficiently deal with sparsity-inducing non-smoothness conditions can be effective and efficient. This Hopfield Neural-Network-like architecture can enable the use of analog circuitry, which can provide several benefits. Even the most efficient iterative digital algorithms currently available require O(N2) floating point operations per iteration. In contrast, the solution time in a parallel analog architecture is proportional to the RC time constant, which scales O(N)). In other words, the analog solution is exponentially more efficient. In addition, total energy consumption can also be reduced by using analog vector matrix multipliers (VMMs) that require only one transistor per multiplication (i.e., instead of using the large multipliers required for digital processing). Using a programmable analog device like, for example and not limitation, an FPAA enables the implementation and testing of circuits without the time and cost of chip fabrication and also enables compensation for errors caused by the inherent mismatch in transistor sizes. Embodiments of the present invention, therefore, can comprise an analog approach to implementing a Hopfield network that can provide solutions with lower power, greater speed, and better scaling properties than is possible in conventional digital solutions. The system can enable significantly a number of practical applications that would otherwise not be possible, even with substantial improvements in digital algorithms, due to time and/or power constraints. In CS applications, as discussed above, an analog system can be especially powerful, enabling signals to be acquired (e.g., with coded apertures) and recovered very quickly. This can eliminate the post-processing, for example, that has become the “accepted” bottleneck with CS systems. Optimization Problem Formulation As discussed above, sparse approximation methods achieve efficient signal representations by using only a small subset of dictionary elements and taking advantage of the known statistical structure of the signal. As shown in Where a vector input yεM is represented with an over complete dictionary Φ=[φ1, . . . , φN] using coefficients aεN, with additive Gaussian white noise v. Given these definitions, the desired Maximum A-Posteriori (MAP) estimate of the linear generative model, assuming a sparse prior distribution on coefficients a: where C(•) is a sparsity-inducing cost function (e.g., lo-Norm). Unfortunately, direct optimization of the problem—i.e., where C(•) counts non-zeros—is intractable. In some embodiments, Therefore, Basis Pursuit De-Noising (BPDN) can be used. In this technique, a common surrogate for l0-Norm optimization uses C(•) set to the l1-Norm. In this configuration, Eq. 3, below, is equivalent to the convex optimization: where the first term in the objective function represents the mean squared error of the approximation, the second term represents the sparsity of the solution via the l1-Norm, ∥a∥1=Σi|ai|, and λ is a tradeoff parameter (e.g., balancing data fidelity against solution sparsity). A locally competitive algorithm (LCA) can be described as a system of nonlinear ordinary differential equations (ODEs). Fortunately, as discussed below, these equations translate readily into a Hopfield-Network-like system architecture. System of Differential Equations In some embodiments, the LCA can be a continuous time algorithm which acts on a set of internal state variables, um(t) for m=1, . . . , M. Fortunately, these internal states are guaranteed to exponentially converge to the equilibrium state, which is the solution to the objective function in Eq. 3. Restricting a(t)>0, the dynamics of the nodes can be described by the following set of ODEs: where τ is the time constant of the system, and bεM= is the vector of driving inputs. The feedback between the nodes can be computed by H=ΦtΦ−I. The sparsity constraint and the nonlinearity can be introduced by the threshold operator Tλ(•), which decreases the absolute value of u(t) by λ. Once the state variables u(t) have reached equilibrium, the output vector a(t) is the solution to the objective function. System Architecture for Hardware As in most neural networks, the internal state variables in Eq. 4 evolve in a parallel fashion. The architecture of the LCA can be implemented as an analog hardware system, an example of which is presented in The first VMM represents a feedforward multiplier. It accepts the input vector y from the Current Digital-to-Analog Converters (DACS) (after they are mirrored) and performs the operation b= to compute the driving inputs. The second block, or recurrent VMM, performs the operation h(t)=Ha(t) and computes the recurrent feedback. It should be noted that the feedback is similar to a stable, convergent Hopfield Network. In other words, nodes do no inhibit themselves (i.e., Hm,m=0) and the inhibition between nodes is symmetric (i.e., Hm,n=Hn,m). As shown in In a preferred embodiment, for scalar multiplication accuracy, the input and output devices should have matching drain voltages. To this end, the input drain voltage can be regulated with an operational transconductance amplifier (OTA) that provides a power source to both the input and output currents. In this configuration, the OTA can scale the input power with the number and strength of the outputs. As shown in In a preferred embodiment, for improved accuracy of the current minor, the nFETs can be well matched and have substantially identical drain voltages. To this end, mismatch can be minimized by simply enlarging the devices. Fortunately, this enlargement is not a major factor with regard to system density because there are O(N) mirrors, but O(N2) VMMs. In other words, the number of mirrors is exponentially smaller than the number of VMMs and, thus, does not have a significant effect on system size. As with the VMMs, OTAs can be used to regulate the input drain voltage. In addition, because the minor outputs are the same as the VMM inputs (which also have a regulated voltage), the drain voltages are matched. Similarly, the current minor OTAs also allow matching of the drain voltages in the VMM. The transfer function of the double current mirror is then: From the VMMs, we get b=ΦTy and h=(ΦTΦ−I)a(t). Thus, combining these relationships yields the original Eq. 4. As shown in In some embodiments, the RASP 2.9 v can comprise several design innovations that make it particularly well suited for implementing and testing the LCA. The majority of the FGEs are directly programmed devices, for example, meaning that the programmed device is directly in the final circuit. Thus, while this adds a selection register to the signal path, it also eliminates mismatch issues seen in earlier FPAAs. The direct devices allow the programming of current sources (e.g., those needed for the threshold current and inputs) to 7 bits of accuracy. This represents less than 1% error. An automated calibration routine can be used and can comprise the Enz Krummenacher-Vittoz (EKV) model to determine the relationship between the floating gate programming targets and the multiplier weight. On the RASP 2.9 v, this routine improves the programming of current-mode VMMs to 6 bits of accuracy. Control and communication with the RASP 2.9 v can be provided by a USB connection to an AT91sam7s Microcontroller. Of course, the microcontroller can also communicate with onboard ADCs and DACs, which enables analog voltages on the FPAA to be set and read. In some embodiments, the interface with the Microcontroller can be provided by a suite of commands (e.g., Mathworks MATLAB©) or similar scripts written in MATLAB© can also enable the programming and testing to be automated. In other embodiments, a series of chain of tools for the RASP chips can enable the user to convert an entire library of functions into circuits, for example, and then to place and route these circuits on the RASP 2.9 v. Multiple LCA systems were implemented on the RASP 2.9 v and are discussed below. The smaller of these was a single-ended 2×3 system (two inputs, three outputs), built for illustrative purposes. Because the input vector preferably coincides with the unit circle, the input in practice had only one degree of freedom, making the results easier to display. Its dictionary was: A larger single-ended 4×6 system was also implemented to demonstrate the scalability of the system architecture: The six dictionary elements are chosen to fully span the input domain and to observe the restricted isometry property (RIP), i.e., where the eigenvalues of the matrix are restricted to a certain range. While a matrix of random Gaussian variables is typically used to satisfy the RIP, the dimensions were small enough here that a set matrix could do so more easily. In addition to the necessary VMMs and current mirrors, on-chip 8-bit current DACs can be programmed to allow control of the input currents. These inputs can be normalized to a ratio of 60 nA:l. The threshold current Iλ was programmed to 6 nA, which results in a tradeoff parameter of λ=0.1. Each soft threshold node can be implemented with multiple output transistors. A first can be used to drive the rest of the circuit. A second can be used as a system output. As shown in Using the dynamical switches, both individual components and the complete can be easily tested. In some embodiments, on-chip current DACs can be used to inject currents with a constant l2-Norm into the circuit. For the 2×3 network, the input can be swept on the unit circle. For the 4×6 network, 100 randomly generated inputs can be used. For both systems, the input currents, the outputs of the feedforward VMMs (with and without thresholding), the outputs of the recurrent VMMs, and the system outputs can be separately measured. Accuracy of Results In order to verify the accuracy of the analog LCA, the inputs can be run through 11-1s, a known digital sparse approximation algorithm. For both the 2×3 and 4×6 systems, the solution produced by the hardware network was very similar to that produced by the digital solver. As shown in As shown in Power and Scaling The power used by the RASP 2.9 v implementation of the LCA is dominated by two terms: (1) overhead—703 μA is used by the FPAA even without programming and (2) 20 μA for the high speed current-to-voltage converter. The remainder of the current flow can be accounted for with the OTAs, since every source to sink path in the LCA passes through at least one OTA. The OTAs are differential pairs with a double current mirror, so they naturally use twice their bias current regardless of any sourcing or sinking any current. Because every signal in the LCA chip sinks to an OTA, however, all the active currents in the chip can simply be summed to find the total additional power used. Each VMM input requires an OTA. In addition, each current mirror for the inputs requires an OTA (and they sink twice the input current). The soft thresholder requires two OTAs, and sinks twice the lateral inhibition Ha, twice the threshold current λ, and twice the output a. The total current used by the system is therefore: where IV is the bias current of the VMM OTAs, and IM is the bias current of the mirror OTAs. In both of the 2×3 and 4×6 networks, IM was set to 500 nA. This current was sufficient to sink three 60 nA currents (the third being used only when the node is directly measured) while maintaining a high OTA transconductance. IV was also set to 500 nA in the 2×3 network, and to 800 nA in the 4×6 network. Excluding overhead, therefore, the active circuits of the 2×3 LCA had a total current of 11.8 μA, with small variations depending on the signals being passed. This is actually slightly less than the 13 μA that would be expected from Eq. 6. Similarly, the total power use of the 4×6 system was only 31.1 μA, which is also somewhat less than the 32 μA predicted by Eq. 6. These discrepancies are most likely due small inaccuracies in the bias current programming. As discussed below, the OTAs must have a bias current large enough to sink or source all the appropriate currents while maintaining a high transconductance. Temporal Evolution of the System The convergence curves varied considerably from predicted LCA dynamics. Theoretical analysis and simulations of the LCA's temporal evolution show exponential convergence for active nodes in less than 10τ. The theoretical upper bound on convergence time, on the other hand, is proportional to τ/γ, where τ is the RC time constant, and γ is the smallest eigenvalue of the active subspace of the matrix Φ (i.e., the same term that determines error amplification above). As shown in The input resistance R can be derived from the small signal model (shown in For small IIN, the first term dominates, and the system dynamics approach: These dynamics are depicted in For IIN>IAσ/2≈3 nA, the second term in Eq. 7 dominates, and the system acts as a low pass filter with RC time constant τ=2CLUT/(κIA). To make this the dominant pole in the LCA system, the load capacitance CL can be made extremely large—e.g., higher than 50 pF—by shorting it to a chip pad, for example. The capacitance, on the other hand, could be reduced to approximately 2-3 pF (i.e., the capacitance of a vertical routing wire) at the cost of slightly altering the LCA dynamics. In some embodiments, this could speed up convergence times by a factor of 10 or more. In addition to the approximately 240 μs required for convergence, each 8-bit input DACs takes approximately 5.8 μs to load and reading an output node requires 520 ns adding about 26 μs for interfacing. These costs are imposed by the microcontroller, however, and are not inherent to the RASP 2.9 v. As the system scales, we would expect the convergence time is expected to scale with the time constant τ=2CLUT/(κIA). In this equation, only the load capacitance CL will increase with scale at roughly O(N). Because CL is already much larger than necessary, however, a custom built large N implementation would actually be expected to converge more quickly. Sparse approximation has recently been suggested as a model for sensory coding in the human brain, according to the hypothesis that the brain attempts to make efficient use of computational resources.2 In the sparse coding model for the primary visual cortex, for example, a small subset of learned dictionary elements can encode most natural images.3 This sparse basis is mapped directly to neural activity, so only a small subset of the cortical neurons need be active to represent the high dimensional visual inputs. As discussed above, sparse approximation can also be used to recover linearly compressed signals for compressed sensing applications.2See, e.g., H. B. Barlow, Possible principles underlying the transformation of sensory messages, in: W. Rosenblith (Ed.), Sensory Communication, M.I.T. Press, Cambridge Mass., 1961.3See, B. Olshausen, D. Field, Emergence of simple-cell receptive field properties by learning in a sparse code for natural images, Nature 381 (1996); B. Olshausen, M. Lewicki, et al., Sparse codes and spikes, Probabilistic Models of the Brain: Perception and Neural Function (2001). Impact of a Spiking Implementation Inspired in part by the brain's extreme computational efficiency and by recent advances in implementing neurons in silicon, in some embodiments, a spiking neural network can be used for solving sparse approximation problems. This approach has several benefits relative to both the digital and the non-spiking analog methods described above. As discussed above, an analog system offers considerable power savings relative to digital solutions for the portions of the optimization that rely on linear computation. Analog vector-matrix multipliers (VMMs), for example, are several orders of magnitude more computationally efficient than comparable digital multipliers. The power in these systems is proportional to the maximum possible signal; however, and in a system where the output is expected to be sparse this can be wasteful, since few signals will be nonzero. As a result, in some embodiments, a rate-based spiking system can be used to leverage the sparsity of the signals. In other words, by following the lead of the sparse neural coding activity and mapping the sparsity of the input to neural activity, both the number of spiking neurons and their spike rates can be minimized. This, in turn, minimizes total power consumption, since synapses only consume power when they spike. A spiking system could be used, for example and not limitation, for compressed sensing applications. Compressed sensing has led to the design of new coded aperture sensing systems that, for example, require many fewer measurements to collect data at a specified resolution. A spiking system could be used to recover the compressed signal very quickly, however virtually eliminating the post-processing that has become the accepted bottleneck (e.g. in medical imaging). Alternatively, the spiking system could be optimized for low power, allowing compressed sensing techniques to be used for channel sensing in portable devices where power concerns outweigh processing speed. As discussed above, the LCA is described by a system of nonlinear ordinary differential equations (ODEs). In order to convert it to a spiking system each component of the system can be analyzed to find neuronal equivalents. Converting LCA to a Spiking Architecture To create a spiking network, ideal Integrate and Fire (IF) Neurons can be used to compute the nonlinear portions of Eq. 4, above. A stochastic rate model of the neurons can be used, for example, and spikes can be generated using an instantaneous spike rate (or intensity) depending on the time-varying input to the neuron. The intensity of the entire population of neurons â(t) can be used to encode the system output. The gain function of the IF neurons, for example, can be derived by analyzing the normalized neural potential as a function of the normalized current input where VTH and V0 are the threshold and reset potentials of the neuron, and C is its capacitance. When the voltage reaches the threshold (v(t)=1), therefore, the neuron emits a spike at that time and resets the voltage. Of course, the neuron will only spike if the input is positive, which means that the neuron conveniently acts as a natural rectifier. The inter spike interval (ISI) at steady state will be approximately 1/u, and the intensity â(t)=max (u(t),0). By adding a small negative offset λ to the input current, Eq. 10 becomes: and the firing rate becomes, which is the soft threshold operator used in the LCA. The linear portion of the network can be generated with synaptic connections. These synapses have a linear response to each incoming spike, the kernel α(t)=U(t)e−t/τ, where U(t) is the Heaviside step function, and τ is the synaptic time constant. The synapses can be arbitrarily weighted and their outputs can be shorted together (i.e. their currents can be summed via Kirchoff's Current Law), enabling the recurrent matrix ΦTΦ−I to be created. The normalized input to neuron i can then be set to: where b=ΦTy is the driving input, H=ΦTΦ—I, and tj,kFB is the kth spike time of neuron j. By randomizing the initial states of the neurons, the normalized expectation of neuron j producing a spike at time t as the instantaneous rate âj(t), or intensity, can be defined. The expectation, E[ui−(t)], can be defined as A Laplace transform of both sides can be performed, then divided by the filter, α(s)=1/(1+sτ). Performing the inverse Laplace transform then yields: converting to matrix form gives: Eqs. 11 and 15 can then be combined to create the spiking LCA system: An example of a particular spiking LCA system is shown in System Components In order to implement a dense spiking LCA on the same RASP 2.9 v used above, the design can be limited to using the available CAB elements of the chip. The most complex system component, based on the Axon-Hillock circuit shown in The feedback produces a change of roughly on VIN. In order to ramp back up to VTH and produce a spike, therefore, IIN needs to produce a charge of On the RASP 2.9 v, for example, the smallest explicit capacitors are 500 fF, and VDD=2.4V. This leads to a ramp time of: tRAMP=QRAMP=IIN=1.2 pC/IIN. In a preferred embodiment, the neurons would have a fixed refractory period while the voltage was reset. In this case, however, the RASP 2.9 v has insufficient transmission gates to cut off the incoming current during reset at the desired density. In response, the circuit shown in The IF Neuron can utilize 4 explicit nFETs. On the RASP 2.9 v, for example, there are 18 CABs that contain at least four, which effectively limits the density of the system in this configuration to 18 neurons. The synaptic grids can perform the operation τh(t)+h(t)=Hâ(t) to compute the recurrent inhibition. As before, the feedback mechanism is similar to that of a Hopfield Network. In other words, there is no feedback from one node to itself (Hi,i=0) and the feedback between nodes is symmetric (Hi,j=Hj,i). As shown in Because the gates of the FGEs are not locally accessible, however, the topology shown in A VMM can act as the feedforward multiplier, performing the linear operation b=ΦTy. Experimentally, this multiplication was performed digitally and directly projected the current I+ from the 18-8 bit current DACs to the positive input terminals of the neurons. There are several circuits on-chip, however, that could be used to perform the multiplication. One example is a current mode VMM structure.4 This configuration has a small area, low power consumption, an easily scalable design while operating in the sub-threshold region, and fits easily on the RASP 2.9 v.4See, e.g., C. Schlottmann, P. Hasler, A highly dense, low power, programmable analog vector-matrix multiplier: The FPAA implementation, 1 IEEE J. on Emerging and Selected Topics in Circuits and Systems 3, 403-411 (2011); S. Shapero, P. Hasler, Precise programming and mismatch compensation for low power analog computation on an FPAA, IEEE Trans. Circuits and Systems I, in press (both of which are incorporated herein by reference). In some embodiments, the charge on each FGE can be programmed to set the weight of each scalar multiplication. A current mode VMM, for example, can accept the input vector y from the current DACS and perform the operation b=Φty to compute the driving inputs to the neurons. Power consumption with this configuration is proportional to O(N√{square root over (N)}) with the number of output nodes, however, which is not ideal. In alternative embodiment, the VMM component can be implemented as a synaptic matrix like the recurrent multiplier. In this configuration, the spikes to drive the synapses are preferably generated on-chip. This could be accomplished, for example and not limitation, either by another bank of neurons or by spike generation circuits.5 5See, e.g., J. Schemmel, D. Brüderle, K. Meier, B. Ostendorf, Modeling synaptic plasticity within networks of highly accelerated I&F neurons, IEEE International Symposium on Circuits and Systems (2007); S. Brink, S. Nease, S. Ramakrishnan, R. Wunderlich, P. Hasler, A. Basu, B. Degnan, A learning-enabled neuron array IC based upon transistor channel models of biological phenomena, Accepted to IEEE Trans. in Biomedical Circuits and Systems (both of which are incorporated herein by reference). To test the configuration described above, a network of 18 neurons, with 12 driving inputs, can be implemented on the RASP 2.9 v. This network enables the solution of BPDN for arbitrary 12×18 dictionaries of non-negative elements. In addition to the components discussed above, an on-chip 8-bit current DACs can be used to inject vectors of currents onto the chip. See, Many of the nodes in the neurons required calibration. This is easily accomplished, however, using volatile switch lines to pass outputs to onboard ADCs and a picoammeter to measure voltages and currents, respectively. Conveniently, the volatile switch lines can also be used to process the system output. In other words, the spikes can be passed to a rapid ADC and the number of spikes in a 1 ms window, for example, can be counted and used to calculate a spike rate for each neuron. The solution to the sparse approximation problem was proportional to these spike rates (43 kHz:50 nA). The system can be validated by running trials for each sparsity. For k=1 and k=2, 18 and 135 trials, respectively, were sufficient to exhaust the possible basis sets and vary the relative magnitudes of the components. For k=3 and 4, 200 randomly chosen basis vectors with random values for each component can be used. To assess the accuracy of the spiking solutions, they can be compared to a digital solution derived via an L1-LS algorithm.6 6S.-J. Kim, K. Koh, M. Lustig, S. Boyd, D. Gorinevsky, A method for large scale 11-regularized least squares, 1 IEEE Journal on Selected Topics in Signal Processing 4 (2007). Analysis of Results As before, there are several ways of quantifying the performance of a sparse approximation system. For compressed sensing systems, for example, the goal is to recapture the sparse vector that was used to generate the input. As shown in As shown in It should be noted, however, that at k=4 the digital algorithm was also identifying the correct basis set in less than 60% of the trials. In addition, with a 12 dimensional input generated from 4 dictionary elements, it is debatable as to whether the input could still be considered sparse. As shown in System Dynamics and Performance Using the experimental setup herein, only one neuron can be recorded at a time. As a result, the same trial was run multiple times to measure the dynamics of the entire system. The results of one such exemplary trial are shown in This rapid convergence indicates that the actual speed of computation is dominated by the measurement time. From both simulations ( The spike rate could be measurably improved by moving to a more custom architecture, instead of the RASP 2.9 v. The chip, while convenient for experimental purposes, has significant interconnect capacitances (about 1 pF), that could be eliminated with a custom chip. The custom 180 nm Spikey chip, for example, includes an array of over 300 neurons capable of firing at over 5 MHz.7 Leveraging these firing rates, a measurement window of 10 μs would give approximately the same relative quantization error seen here.7J. Schemmel, D. Brüderle, K. Meier, B. Ostendorf, Modeling synaptic plasticity within networks of highly accelerated I&F neurons, in: IEEE International Symposium on Circuits and Systems, 2007. Power The spiking LCA uses approximately 3 mW of power, or 1.26 mA at 2.4 V. Up to an additional 10 μA of current draw was observed depending on the output. As before, the majority of this power comes from chip overhead (the RASP 2.9 v drains approximately 703 μA of current even when nothing is programmed). When none of the neurons spike, the rest of the power is consumed by the OTAs in the neurons. Of the 559 μA used, the vast majority, or 502 μA, is consumed by the second OTA in the comparator of the IF neurons. See, The remaining 57 μA, or 3.2 μA per neuron, is divided evenly between the other OTAs. The first OTA is part of the active current minor in Scaling With 18 neurons, the spiking LCA is scaled to the maximum extent on the RASP 2.9 v (i.e., the chip has 36 regular CABs, and each neuron requires 2). Indeed, the system described herein uses over 1400 of the floating gates, representing the largest system synthesized on a RASP chip to date. Further scaling is required, however, to meet even the most basic requirements for sparse coding applications. This can be achieved with a more customized chip with dedicated synaptic and neural circuitry. In order to reach a size of 1000 neurons, for example, the chip would contain over one million synapses. Fortunately, as discussed above, this is easily implementable using current technology. The power benefits of the LCA increase as the system becomes larger. Replacing the second stage of the comparator with an inverter (as in In addition, accuracy is expected to remain substantially the same as system size increases. Relative quantization error, for example, is independent of size and synchronization errors should actually decrease as the number of active neurons increases. Similarly, gain error should marginally decrease as a larger dictionary better respects the RIP. Additionally, as shown in Up to N=1000, the convergence time is not expected to increase. This is because the convergence time scales with the LCA time constant, which here is equivalent to the synaptic time constant. We would not expect time constant to meaningfully increase because the load capacitor of the synapses already spans the length of the chip. As a result, this distance cannot be larger. Similarly, the measurement time is not expected to noticeably increase because, at a constant accuracy, the measurement window scales with the spiking frequencies of the neurons. At a larger system size, however, spiking would likely become faster because the interconnect capacitances in the neuron would be substantially eliminated. Using 130 nm technology, for example, if N becomes significantly larger than 1000, the chip could be increased in size or the spiking LCA could use multiple chips. In either case, capacitances would increase, increasing the synaptic time constant and, in turn, proportionately increasing convergence time. For the measurement to be useful, the spike data is preferably retrieved from the chip in real time. An addressed event response (AER) system, for example, can be used to accomplish this task, but at the cost of significant extra power. Generally, this power is dominated by the cost of sending the address of each spike off-chip. Assuming a 10 bit address, a 50 pF load capacitance, and 2.4 V supply, for example, each spike would use about 1.5 nJ. The total number of spikes scales no faster than the root of the number of active neurons k, O√{square root over (k)}. With a maximum input sparsity of approximately 60, this gives a maximum of 280,000 spikes per second, using 422 μW of power at peak activity. Even with the AER power added, therefore, the hypothetical thousand neuron system compares extremely well with state-of-the-art digital BPDN implementations. Conventional BPDN can solve for N=1024 in 46 ms using an Intel i7 CPU. Estimating that this calculation required 1.2 GMACs over 46 ms, and that the i7 CPU calculates 7 GMAC/s/W, the estimated active power requirements for the calculation are approximately 3.8 W.8 This is approximately 500 times the active power used by the Spiking LCA.8A. Borghi, J. Darbon, S. Peyronnet, T. F. Chan, S. Osher., A simple compressive sensing algorithm for parallel many-core architectures., Tech. rep., UCLA Computational and Applied Mathematics Technical Report (September 2008). From the above discussion, and Table 1 it is clear that a spiking LCA has advantages over the non-spiking system. While exhibiting similar scaling properties for accuracy and convergence time, for example, the spiking LCA exhibits superior power scaling. A scaled implementation of the spiking LCA would be an advantageous platform for quickly solving large sparse approximation problems or any other neural network application that can benefit from precise synaptic programming. Embodiments of the present invention, including the LCA analog circuit disclosed herein, can provide efficient Hopfield network implementation, including solving sparse approximations with substantially identical accuracy as known digital solutions, but with vastly improved processing times. A pair of example circuits were implemented on the RASP 2.9 v and successfully converged on results that were substantially identical to known digital solvers. This analog solution can be particularly useful for, for example and not limitation, low powered applications, such as channel sensing for portable devices. Successful operation of the system at small sizes (N=6) has been demonstrated and simulations demonstrate the potential value of the LCA for larger system sizes. The RASP 2.9 v described herein will allow moderate scaling of the LCA. The chip contains 18 8-bit DACs and enough stand-alone nFETs for 36 current mirrors. The thresholder nodes require two current minors, thus limiting the number of inputs M and outputs N to M+2N≦36. As a result, a practical maximum size of the current configuration is approximately 8×14. Scaling to this maximum size would not significantly impact total power output (which would still be dominated by overhead costs), and would only meaningfully impact the interface time to load and retrieve data (since convergence time is relatively fixed). Scaling to much larger sizes (N≈1000) is easily achieved with multiple chips and/or application specific chips. This hypothetical chip would require approximately one million FGEs, which is implementable given the technology disclosed herein. The RASP 2.9a is a 5 mm×5 mm 350 nm process chip, for example, and contains 133,000 FGEs. Using a 130 nm process, on the other hand, would allow over one million FGEs on a chip the same size as the RASP 2.9 v. At this scale, the convergence time would still not be expected to change markedly (since the capacitive load would not exceed that of a chip pad) as shown in the simulations. The total processing time would be dominated by interfacing costs, which would scale to approximately 4.4 ms. Improvements could be achieved by implementing some parallelization. Power consumption would be dominated by the O(N√{square root over (N)}) scaling of the VMM OTAs, to about 149 mW. Accuracy would remain relatively constant, since the average error and average eigenvalue do not scale with problem size. These results are summarized in Table 2. These hypothetical results compare extremely well with state-of-the-art digital BPDN implementations. Conventional BPDN solutions, for example, consume more than 25 times more power than the LCA. The LCA could also be increased to 4000 nodes by using a full 2 cm×2 cm reticle. In this configuration, the LCA would have more than 16 million devices. Further scaling could be achieved in several ways including, for example and not limitation, multiple chips or a denser chip process. Although the circuit shown here had only single sided inputs and outputs, multiple inputs/outputs could be used (e.g., four-quadrant behavior) can also be easily implemented. Extra nodes can be added to represent negative outputs, for example, while negative multiplication can be induced by simply connecting the driving VMM outputs to the negative input of the thresholding device (or for the recurrent VMM, connecting them to the positive input terminal). Of course, other configurations are possible and are contemplated herein. The hardware LCA is easy to integrate into CS systems, for example, since it inherently contains mechanisms for rapid data interface. In addition, because the multipliers used here are reprogrammable, the system can also be used to recover arbitrary linear compressions of sparse signals (e.g., using a number of recovery methods). The multiplier weights can also be made to adapt to structure in the input and to learn more efficient dictionaries, enabling the system to be used even when the sparsity basis is unknown. In some embodiments, the system can be used in multiple applications including, but not limited to, CS recovery with ultra low power and/or real time processing. Embodiments of the present invention can also comprise a Hopfield network comprising a spiking LCA network. The network can comprise, for example and not limitation, a network of 18 integrate and fire neurons and reconfigurable synapses, programmed on the RASP 2.9 v. This spiking network can be configured to be computationally equivalent to the LCA. The fully implemented spiking LCA was able to converge on results in less than 25 μs. In addition, the system has superior power scaling properties relative to digital BPDN solutions and to non-spiking LCA implementations. Due to the extremely low power consumption, among other things, the spiking system can be advantageously used for high speed, low power applications, such as, for example and not limitation, channel sensing for portable devices. While several possible embodiments are disclosed above, embodiments of the present invention are not so limited. For instance, while several possible configurations for the RASP 2.9 v have been disclosed, other suitable reconfigurable or custom chips could be selected without departing from the spirit of embodiments of the invention. The system and method are described above as a system for solving sparse matrices. One skilled in the art will recognize, however, that embodiments of the present invention are equally applicable to other optimization problems such as, for example, QPs/LPs. In addition, the location and configuration of components used for various embodiments of the present invention can be varied according to a particular application or installation that requires a slight variation due to, for example, the materials used and/or space or power constraints. Such changes are intended to be embraced within the scope of the invention. The specific configurations, choice of materials, and the size and shape of various elements can be varied according to particular design specifications or constraints requiring a device, system, or method constructed according to the principles of the invention. Such changes are intended to be embraced within the scope of the invention. The presently disclosed embodiments, therefore, are considered in all respects to be illustrative and not restrictive. The scope of the invention is indicated by the appended claims, rather than the foregoing description, and all changes that come within the meaning and range of equivalents thereof are intended to be embraced therein. A system and device for solving sparse algorithms using hardware solutions is described. The hardware solution can comprise one or more analog devices for providing fast, energy efficient solutions to small, medium, and large sparse approximation problems. The system can comprise sub-threshold current mode circuits on a Field Programmable Analog Array (FPAA) or on a custom analog chip. The system can comprise a plurality of floating gates for solving linear portions of a sparse signal. The system can also comprise one or more analog devices for solving non-linear portions of sparse signal. 1. A method comprising:
applying each of a plurality of input signals to each of a plurality of feedforward excitation signals to generate a plurality of first output signals; applying each of a plurality of second output signals to each of a plurality of lateral inhibition signals to generate a plurality of recurrent feedback signals; subtracting each the plurality of recurrent feedback signals from each of the plurality of first output signals to generate a plurality of intermediate signals; and applying each of the plurality of intermediate signals to a non-linear computation to generate the plurality of second output signals. 2. The method of converting a first sparse vector of a plurality of sparse vectors to a plurality of input signals. 3. The method of 4. The method of 5. The method of 6. The method of 7. The method of the plurality of first output signals and the plurality of recurrent feedback signals are analog; and the plurality of second output signals are digital. 8. The method of 9. The method of 10. An analog device comprising:
a plurality of first parallel linear computational devices for applying each of a plurality of input signals to each of a plurality of feedforward excitation signals to generate a plurality of first output signals; a plurality of second parallel linear computational devices for applying each of a plurality of second output signals to each of a plurality of lateral inhibition signals to generate a plurality of recurrent feedback signals; and a plurality of non-linear parallel computational devices for subtracting each of the plurality of recurrent feedback signals from each of the plurality of first input signals to generate a plurality of intermediate signals and applying each of the plurality of intermediate signals to generate the plurality of second output signals. 11. The device of a plurality of digital-to-analog converters for converting a plurality of digital signals into the plurality of input signals. 12. The device of 13. The device of 14. The device of 15. The device of 16. The device of 17. The device of 18. The device of 19. The device of 20. The device of 21. The device of 22. A system comprising:
a field programmable analog array (FPAA) comprising:
a first plurality of transistors forming a first vector multiplication matrix (VMM) for applying the each of the plurality of input signals to each of a plurality of feedforward excitation signals to generate a plurality of first output signals; a second plurality of transistors forming a second vector multiplication matrix (VMM) for applying each of a plurality of second output signals to each of a plurality of lateral inhibition signals to generate a plurality of recurrent feedback signals; and a plurality of modified current mirrors for subtracting each of the plurality of recurrent feedback signals from each of the plurality of first input signals to generate a plurality of intermediate signals and applying each of the plurality of intermediate signals to a non-linear computation to generate the plurality of second output signals. 23. The system of 24. The system of 25. The system of wherein the charge on each of the one or more floating gates determines the weight of the scalar multiplication produced by that transistor. 26. The system of 27. The system of 28. The system of wherein the charge of the floating gate transistor determines the magnitude of the negative current offset. 29. The system of 30. The system of CROSS REFERENCE TO RELATED APPLICATIONS
GOVERNMENT LICENSE RIGHTS
BACKGROUND
SUMMARY
BRIEF DESCRIPTION OF THE FIGURES
DETAILED DESCRIPTION
LCA Architecture
τ{dot over (u)}(
τ{dot over (u)}(Example 1
LCA Circuitry on Reconfigurable Analog Hardware
Spiking Solutions for Sparse Computing
Description of the Neuronal Architecture
â(Hardware Implementation
Example 2
Results of the Fully Implemented System
Performance Comparison System 666 × 1k 666 × 1k 12 × 18 Spiking LCA Analog LCA 1k CPU Spiking LCA (Hypothetical) (Hypoth.)[4] [3] Power (Active) 1.34 mW 7.68 mW 149 mW ≈3.8 W (Total) 3.02 mW 9.79 mW 151 mW ≈100 W Time (Converge) 25 μs ≈25 μs ≈240 μs 46 ms Time (Total) 1.03 ms 1.03 ms 4.62 ms 46 ms Error (RMS) 4.8% (@ K = 3) ≈4.8% ≈5% — Extra Cost (Avg) 1.7% (@ K = 3) ≈1.7% ≈1% — CONCLUSION
Performance Comparison System Size LCA LCA LCA (Hyp.) CPU 2 × 3 4 × 6 666 × 1k 1k Power (Active) 28.3 μW 74.6 μW 149 mW ≈3.8 W (Total) 1.76 mW 1.81 mW 151 mW ≈100 W Time (Cvg.) 240 μs <240 μs 46 ms Time (Total) 266 μs 4.62 ms 46 ms Error (RMS) 2% 5% ≈5% — Extra Cost 0.2% 1% ≈1% —