This will count as one of your downloads.
You will have access to both the presentation and article (if available).
Existing general purpose computing devices such as CPUs and GPUs are limited both by thermal dissipation per unit area and yield associated with large chips.13, 14 The design of Application Specific Integrated circuits (ASICs) has aided in decreasing the energy consumption per workload substantially by limiting the supported operations on chip. An example of this is the first generation tensor processing unit (TPU)15 which is able to perform the inference of large convolutional neural networks in datacenter in <10ms with an idle power of 28W and an workload power of 40W. It may seen counterintuitive then that the limiting factor for the implementation of DNNs is not computation, but rather the energy and bandwidth associated with reading and writing data from memory as well as the energy cost of moving data inside of the ASIC.15, 16 Several emerging technologies, such as in-memory computing,17 memristive crossbar arrays18 promise increased performance, but these emerging architectures suffer from calibration issues and limited accuracy.19
Photonics as a field has had tremendous success in improving the energy efficiency of data interconnects.20 This has motivated the creation of optical neural networks (ONNs) based on 3D-printed diffractive elements,21 spiking neural networks utilizing ring-resonators,22 reservoir computing23 and nanophotonic circuits.24 However, these architectures have several issues. 3D-printed diffractive networks and schemes requiring spatial light modulators are non-programmable, meaning that they are unable to perform the task of training. Nanophotonic circuits allow for an O(N2) array of interferometers to be programmed, providing passive matrix-vector multiplication. However, the large (≈1mm2) size of on chip electro-optic interferometers means that scaling to an array of 100x100 would require 10; 000mm2 of silicon, demonstrating the limitations of scaling this architecture. To date no architecture has demonstrated high-speed (GHz) speed computation with more than N ≥ 10; 000 neurons.
Here we present an architecture that is scalable to N ≥ 106 neurons. The key mechanism of this architecture is balanced homodyne detection. By scaling the architecture to such a large size we show that we can decimate energy costs per operation associated with the optical component of this architecture, reaching a bound set by shot noise on the receiving photodetectors which leads to classification error. We call this bound a standard quantum limit (SQL) which reaches 100zJ/MAC on problems such as MNIST. We also analyze the energy consumption using existing technologies and show that sub-fJ/MAC energy consumption should be possible.
This paper is organized as follows: In section 1 we will discuss the function of this architecture as a matrixmatrix processor. In section 2 we will analyze the energy consumption of the architecture. In section 3 we will discuss methods for training and extending the accelerator to a broader scope of problems, namely convolutionally neural networks (CNNs).
View contact details
No SPIE Account? Create one