Proceedings Article | 30 September 2024
KEYWORDS: Field programmable gate arrays, Image processing, Computer hardware, Energy efficiency, Computer architecture, Computer vision technology, Power consumption
Deep Neural Network (DNN) algorithms have become ubiquitous within the vision domain, encompassing various tasks, including object detection, segmentation, and classification. However, executing complex DNNs in realtime systems demands improved energy efficiency, runtime, and accuracy. Traditional embedded imaging designs, typically implemented on homogeneous architectures, face hardware limitations, prompting the need for heterogeneous computing architectures. These architectures combine CPUs, GPUs, FPGAs, and other accelerators, enabling applications to use the most efficient architecture for a given algorithm. The challenge lies in scheduling and partitioning algorithms across accelerators with different computing paradigms and tool-sets. This requires balancing computational power, memory bandwidth, and communication overhead. Effective scheduling involves considering task dependencies, resource availability, and synchronisation. Current deep learning libraries often target single architectures and lack mechanisms to intelligently partition sub-operations across the most suitable processors.
This paper introduces a scheduler for heterogeneous vision systems that finely partitions and maps suboperations of convolutional neural networks and image processing algorithms. Leveraging state-of-the-art compiler frameworks such as PyTorch, TVM, and ONNX, the proposed scheduler optimally distributes tasks across heterogeneous components. Experimental results show that the heterogeneous platform achieves on average 1.12× and 1.08× improvements in kernel runtime and energy consumption, compared to the best-performing discrete hardware counterparts, GPU and FPGA. The study demonstrates that partitioning algorithms based on their runtime and energy properties and optimally scheduling them improves energy and runtime efficiency compared to homogeneous components executing the complete algorithm.