A Runtime Reconfigurable Design of Compute-in-Memory–Based Hardware Accelerator for Deep Learning Inference

Anni Lu; Xiaochen Peng; Yandong Luo; Shanshi Huang; Shimeng Yu

doi:10.1145/3460436

A Runtime Reconfigurable Design of Compute-in-Memory–Based Hardware Accelerator for Deep Learning Inference

ACM Transactions on Design Automation of Electronic Systems ◽

10.1145/3460436 ◽

2021 ◽

Vol 26 (6) ◽

pp. 1-18

Author(s):

Anni Lu ◽

Xiaochen Peng ◽

Yandong Luo ◽

Shanshi Huang ◽

Shimeng Yu

Keyword(s):

Deep Learning ◽

Mapping Method ◽

System Level ◽

Chip Area ◽

Reconfigurable Design ◽

Trade Offs ◽

Input Side ◽

Performance Benchmark ◽

Extensive Computation ◽

High Flexibility

Compute-in-memory (CIM) is an attractive solution to address the “memory wall” challenges for the extensive computation in deep learning hardware accelerators. For custom ASIC design, a specific chip instance is restricted to a specific network during runtime. However, the development cycle of the hardware is normally far behind the emergence of new algorithms. Although some of the reported CIM-based architectures can adapt to different deep neural network (DNN) models, few details about the dataflow or control were disclosed to enable such an assumption. Instruction set architecture (ISA) could support high flexibility, but its complexity would be an obstacle to efficiency. In this article, a runtime reconfigurable design methodology of CIM-based accelerators is proposed to support a class of convolutional neural networks running on one prefabricated chip instance with ASIC-like efficiency. First, several design aspects are investigated: (1) the reconfigurable weight mapping method; (2) the input side of data transmission, mainly about the weight reloading; and (3) the output side of data processing, mainly about the reconfigurable accumulation. Then, a system-level performance benchmark is performed for the inference of different DNN models, such as VGG-8 on a CIFAR-10 dataset and AlexNet GoogLeNet, ResNet-18, and DenseNet-121 on an ImageNet dataset to measure the trade-offs between runtime reconfigurability, chip area, memory utilization, throughput, and energy efficiency.

Download Full-text

EMC Component Modeling and System-Level Simulations of Power Converters: AC Motor Drives

Energies ◽

10.3390/en14061568 ◽

2021 ◽

Vol 14 (6) ◽

pp. 1568

Author(s):

Bernhard Wunsch ◽

Stanislav Skibin ◽

Ville Forsström ◽

Ivica Stevanovic

Keyword(s):

Power Converters ◽

Filter Design ◽

Magnetic Coupling ◽

Power Converter ◽

System Level ◽

Noise Propagation ◽

Ac Motor ◽

Input Side ◽

Indispensable Tool ◽

Set Up

EMC simulations are an indispensable tool to analyze EMC noise propagation in power converters and to assess the best filtering options. In this paper, we first show how to set up EMC simulations of power converters and then we demonstrate their use on the example of an industrial AC motor drive. Broadband models of key power converter components are reviewed and combined into a circuit model of the complete power converter setup enabling detailed EMC analysis. The approach is demonstrated by analyzing the conducted noise emissions of a 75 kW power converter driving a 45 kW motor. Based on the simulations, the critical impedances, the dominant noise propagation, and the most efficient filter component and location within the system are identified. For the analyzed system, maxima of EMC noise are caused by resonances of the long motor cable and can be accurately predicted as functions of type, length, and layout of the motor cable. The common-mode noise at the LISN is shown to have a dominant contribution caused by magnetic coupling between the noisy motor side and the AC input side of the drive. All the predictions are validated by measurements and highlight the benefit of simulation-based EMC analysis and filter design.

Download Full-text

System-level power-performance trade-offs in task scheduling for dynamically reconfigurable architectures

Proceedings of the international conference on Compilers, architectures and synthesis for embedded systems - CASES '03 ◽

10.1145/951710.951722 ◽

2003 ◽

Cited By ~ 16

Author(s):

Juanjo Noguera ◽

Rosa M. Badia

Keyword(s):

Task Scheduling ◽

System Level ◽

Reconfigurable Architectures ◽

Power Performance ◽

Dynamically Reconfigurable ◽

Trade Offs

Download Full-text

RISC-V Virtual Platform-Based Convolutional Neural Network Accelerator Implemented in SystemC

Electronics ◽

10.3390/electronics10131514 ◽

2021 ◽

Vol 10 (13) ◽

pp. 1514

Author(s):

Seung-Ho Lim ◽

WoonSik William Suh ◽

Jin-Young Kim ◽

Sang-Young Cho

Keyword(s):

Neural Network ◽

Deep Learning ◽

Network Model ◽

Neural Network Model ◽

Deep Neural Network ◽

System Level ◽

Neural Network Models ◽

Data Set ◽

Embedded Device ◽

Virtual Platform

The optimization for hardware processor and system for performing deep learning operations such as Convolutional Neural Networks (CNN) in resource limited embedded devices are recent active research area. In order to perform an optimized deep neural network model using the limited computational unit and memory of an embedded device, it is necessary to quickly apply various configurations of hardware modules to various deep neural network models and find the optimal combination. The Electronic System Level (ESL) Simulator based on SystemC is very useful for rapid hardware modeling and verification. In this paper, we designed and implemented a Deep Learning Accelerator (DLA) that performs Deep Neural Network (DNN) operation based on the RISC-V Virtual Platform implemented in SystemC in order to enable rapid and diverse analysis of deep learning operations in an embedded device based on the RISC-V processor, which is a recently emerging embedded processor. The developed RISC-V based DLA prototype can analyze the hardware requirements according to the CNN data set through the configuration of the CNN DLA architecture, and it is possible to run RISC-V compiled software on the platform, can perform a real neural network model like Darknet. We performed the Darknet CNN model on the developed DLA prototype, and confirmed that computational overhead and inference errors can be analyzed with the DLA prototype developed by analyzing the DLA architecture for various data sets.

Download Full-text

DEEP LEARNING BASED FEATURE MATCHING AND ITS APPLICATION IN IMAGE ORIENTATION

ISPRS Annals of Photogrammetry Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-annals-v-2-2020-25-2020 ◽

2020 ◽

Vol V-2-2020 ◽

pp. 25-33 ◽

Cited By ~ 1

Author(s):

L. Chen ◽

F. Rottensteiner ◽

C. Heipke

Keyword(s):

Deep Learning ◽

Feature Matching ◽

Shape Estimation ◽

Feature Description ◽

Performance Benchmark ◽

Matching Performance ◽

Affine Shape ◽

Image Orientation ◽

Learned Features ◽

Better Than

Abstract. Matching images containing large viewpoint and viewing direction changes, resulting in large perspective differences, still is a very challenging problem. Affine shape estimation, orientation assignment and feature description algorithms based on detected hand crafted features have shown to be error prone. In this paper, affine shape estimation, orientation assignment and description of local features is achieved through deep learning. Those three modules are trained based on loss functions optimizing the matching performance of input patch pairs. The trained descriptors are first evaluated on the Brown dataset (Brown et al., 2011), a standard descriptor performance benchmark. The whole pipeline is then tested on images of small blocks acquired with an aerial penta camera, to compute image orientation. The results show that learned features perform significantly better than alternatives based on hand crafted features.

Download Full-text

An Enhanced System Level to Link Level Mapping Method for 3GPP LTE System Level Simulation

Communications in Computer and Information Science - Advanced Research on Computer Science and Information Engineering ◽

10.1007/978-3-642-21411-0_61 ◽

2011 ◽

pp. 371-377 ◽

Cited By ~ 6

Author(s):

Yuan Gao ◽

HongYi Yu

Keyword(s):

Mapping Method ◽

System Level ◽

3Gpp Lte ◽

System Level Simulation

Download Full-text

DLAG-TA: Deep Learning-Based Adaptive Grid Builder for System-Level Thermal Analysis

2021 20th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (iTherm) ◽

10.1109/itherm51669.2021.9503154 ◽

2021 ◽

Author(s):

Wen-Sheng Lo ◽

Hong-Wen Chiou ◽

Shih-Chieh Hsu ◽

Yu-Min Lee

Keyword(s):

Thermal Analysis ◽

Deep Learning ◽

Adaptive Grid ◽

System Level

Download Full-text

System-Level and Architectural Trade-offs

Architectures and Synthesizers for Ultra-low Power Fast Frequency-Hopping WSN Radios ◽

10.1007/978-94-007-0183-0_2 ◽

2011 ◽

pp. 19-43

Author(s):

Emanuele Lopelli ◽

Johan van der Tang ◽

Arthur van Roermund

Keyword(s):

System Level ◽

Trade Offs

Download Full-text

Improved Parallel Legalization Schemes for Standard Cell Placement with Obstacles

Technologies ◽

10.3390/technologies7010003 ◽

2018 ◽

Vol 7 (1) ◽

pp. 3

Author(s):

Panagiotis Oikonomou ◽

Antonios Dadaliaris ◽

Kostas Kolomvatsos ◽

Thanasis Loukopoulos ◽

Athanasios Kakarountas ◽

...

Keyword(s):

Target Function ◽

Problem Formulation ◽

Search Space ◽

Standard Cell ◽

Chip Area ◽

Time Performance ◽

Trade Offs ◽

Cell Placement ◽

Placement Algorithm ◽

Algorithmic Approaches

In standard cell placement, a circuit is given consisting of cells with a standard height, (different widths) and the problem is to place the cells in the standard rows of a chip area so that no overlaps occur and some target function is optimized. The process is usually split into at least two phases. In a first pass, a global placement algorithm distributes the cells across the circuit area, while in the second step, a legalization algorithm aligns the cells to the standard rows of the power grid and alleviates any overlaps. While a few legalization schemes have been proposed in the past for the basic problem formulation, few obstacle-aware extensions exist. Furthermore, they usually provide extreme trade-offs between time performance and optimization efficiency. In this paper, we focus on the legalization step, in the presence of pre-allocated modules acting as obstacles. We extend two known algorithmic approaches, namely Tetris and Abacus, so that they become obstacle-aware. Furthermore, we propose a parallelization scheme to tackle the computational complexity. The experiments illustrate that the proposed parallelization method achieves a good scalability, while it also efficiently prunes the search space resulting in a superlinear speedup. Furthermore, this time performance comes at only a small cost (sometimes even improvement) concerning the typical optimization metrics.

Download Full-text

Survey of Deep-Learning Approaches for Remote Sensing Observation Enhancement

Sensors ◽

10.3390/s19183929 ◽

2019 ◽

Vol 19 (18) ◽

pp. 3929 ◽

Cited By ~ 22

Author(s):

Grigorios Tsagkatakis ◽

Anastasia Aidini ◽

Konstantina Fotiadou ◽

Michalis Giannopoulos ◽

Anastasia Pentari ◽

...

Keyword(s):

Remote Sensing ◽

Deep Learning ◽

Large Body ◽

Super Resolution ◽

Imaging Systems ◽

Learning Approaches ◽

Language Understanding ◽

Learning Tasks ◽

Trade Offs ◽

Sensing Platforms

Deep Learning, and Deep Neural Networks in particular, have established themselves as the new norm in signal and data processing, achieving state-of-the-art performance in image, audio, and natural language understanding. In remote sensing, a large body of research has been devoted to the application of deep learning for typical supervised learning tasks such as classification. Less yet equally important effort has also been allocated to addressing the challenges associated with the enhancement of low-quality observations from remote sensing platforms. Addressing such channels is of paramount importance, both in itself, since high-altitude imaging, environmental conditions, and imaging systems trade-offs lead to low-quality observation, as well as to facilitate subsequent analysis, such as classification and detection. In this paper, we provide a comprehensive review of deep-learning methods for the enhancement of remote sensing observations, focusing on critical tasks including single and multi-band super-resolution, denoising, restoration, pan-sharpening, and fusion, among others. In addition to the detailed analysis and comparison of recently presented approaches, different research avenues which could be explored in the future are also discussed.

Download Full-text

Minimizing the Effects of On-Chip Hotspots Using Multi-Objective Optimization of Flow Distribution in Water-Cooled Parallel Microchannel Heatsinks

Journal of Electronic Packaging ◽

10.1115/1.4048590 ◽

2020 ◽

Vol 143 (2) ◽

Author(s):

Yaser Hadad ◽

Vahideh Radmard ◽

Srikanth Rangarajan ◽

Mahdi Farahikia ◽

Gamal Refai-Ahmed ◽

...

Keyword(s):

Flow Distribution ◽

System Level ◽

Temperature Uniformity ◽

Liquid Cooling ◽

Two Phase ◽

Chip Area ◽

Chip Temperature ◽

Cooling Design ◽

On Chip ◽

Novel Concept

Abstract The industry shift to multicore microprocessor architecture will likely cause higher temperature nonuniformity on chip surfaces, exacerbating the problem of chip reliability and lifespan. While advanced cooling technologies like two phase embedded cooling exist, the technological risks of such solutions make conventional cooling technologies more desirable. One such solution is remote cooling with heatsinks with sequential conduction resistance from chip to module. The objective of this work is to numerically demonstrate a novel concept to remotely cool chips with hotspots and maximize chip temperature uniformity using an optimized flow distribution under constrained geometric parameters for the heatsink. The optimally distributed flow conditions presented here are intended to maximize the heat transfer from a nonuniform chip power map by actively directing flow to a hotspot region. The hotspot-targeted parallel microchannel liquid cooling design is evaluated against a baseline uniform flow conventional liquid cooling design for the industry pressure drop limit of approximately 20 kPa. For an average steady-state heat flux of 145 W/cm2 on core areas (hotspots) and 18 W/cm2 on the remaining chip area (background), the chip temperature uniformity is improved by 10%. Moreover, the heatsink design has improved chip temperature uniformity without a need for any additional system level complexity, which also reduces reliability risks.

Download Full-text