Deploying Multi-tenant FPGAs within Linux-based Cloud Infrastructure

Joel Mandebi Mbongue; Danielle Tchuinkou Kwadjo; Alex Shuping; Christophe Bobda

doi:10.1145/3474058

Deploying Multi-tenant FPGAs within Linux-based Cloud Infrastructure

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3474058 ◽

2022 ◽

Vol 15 (2) ◽

pp. 1-31

Author(s):

Joel Mandebi Mbongue ◽

Danielle Tchuinkou Kwadjo ◽

Alex Shuping ◽

Christophe Bobda

Keyword(s):

Software Architecture ◽

Hardware Acceleration ◽

Maximum Frequency ◽

Cloud Infrastructure ◽

Fpga Design ◽

Data Movement ◽

Field Programmable ◽

Minimal Data ◽

On Chip ◽

Cloud Users

Cloud deployments now increasingly exploit Field-Programmable Gate Array (FPGA) accelerators as part of virtual instances. While cloud FPGAs are still essentially single-tenant, the growing demand for efficient hardware acceleration paves the way to FPGA multi-tenancy. It then becomes necessary to explore architectures, design flows, and resource management features that aim at exposing multi-tenant FPGAs to the cloud users. In this article, we discuss a hardware/software architecture that supports provisioning space-shared FPGAs in Kernel-based Virtual Machine (KVM) clouds. The proposed hardware/software architecture introduces an FPGA organization that improves hardware consolidation and support hardware elasticity with minimal data movement overhead. It also relies on VirtIO to decrease communication latency between hardware and software domains. Prototyping the proposed architecture with a Virtex UltraScale+ FPGA demonstrated near specification maximum frequency for on-chip data movement and high throughput in virtual instance access to hardware accelerators. We demonstrate similar performance compared to single-tenant deployment while increasing FPGA utilization, which is one of the goals of virtualization. Overall, our FPGA design achieved about 2× higher maximum frequency than the state of the art and a bandwidth reaching up to 28 Gbps on 32-bit data width.

Download Full-text

An Efficient FPGA-Based Convolutional Neural Network for Classification: Ad-MobileNet

Electronics ◽

10.3390/electronics10182272 ◽

2021 ◽

Vol 10 (18) ◽

pp. 2272

Author(s):

Safa Bouguezzi ◽

Hana Ben Fredj ◽

Tarek Belabed ◽

Carlos Valderrama ◽

Hassene Faiedh ◽

...

Keyword(s):

Recognition Rate ◽

Hardware Acceleration ◽

Implementation Model ◽

Gate Arrays ◽

Proposed Model ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Computer Vision Applications ◽

On Chip ◽

Segmentation Image

Convolutional Neural Networks (CNN) continue to dominate research in the area of hardware acceleration using Field Programmable Gate Arrays (FPGA), proving its effectiveness in a variety of computer vision applications such as object segmentation, image classification, face detection, and traffic signs recognition, among others. However, there are numerous constraints for deploying CNNs on FPGA, including limited on-chip memory, CNN size, and configuration parameters. This paper introduces Ad-MobileNet, an advanced CNN model inspired by the baseline MobileNet model. The proposed model uses an Ad-depth engine, which is an improved version of the depth-wise separable convolution unit. Moreover, we propose an FPGA-based implementation model that supports the Mish, TanhExp, and ReLU activation functions. The experimental results using the CIFAR-10 dataset show that our Ad-MobileNet has a classification accuracy of 88.76% while requiring little computational hardware resources. Compared to state-of-the-art methods, our proposed method has a fairly high recognition rate while using fewer computational hardware resources. Indeed, the proposed model helps to reduce hardware resources by more than 41% compared to that of the baseline model.

Download Full-text

ANALYSIS OF EFFECTS OF USING 9/7 WAVELET COEFFICIENTS IN MULTI-RESOLUTION ANALYSIS

SMART MOVES JOURNAL IJOSCIENCE ◽

10.24113/ijoscience.v2i1.68 ◽

2016 ◽

Vol 2 (1) ◽

Author(s):

Manish Sharma ◽

Prof. Sonu Lal

Keyword(s):

High Speed ◽

Utilization Efficiency ◽

Parallel Structure ◽

Discrete Wavelet ◽

Distributed Arithmetic ◽

Fpga Design ◽

Multi Resolution Analysis ◽

Field Programmable ◽

On Chip ◽

Area Efficient

Conventional distributed arithmetic (DA) is popular in field programmable gate array (FPGA) design, and it features on-chip ROM to achieve high speed and regularity. In this paper, we describe high speed area efficient 1-D discrete wavelet transform (DWT) using 9/7 filter based new efficient distributed arithmetic (NEDA) Technique. Being area efficient architecture free of ROM, multiplication, and subtraction, NEDA can also expose the redundancy existing in the adder array consisting of entries of 0 and 1. This architecture supports any size of image pixel value and any level of decomposition. The parallel structure has 100% hardware utilization efficiency.

Download Full-text

xDNN: Inference for Deep Convolutional Neural Networks

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3473334 ◽

2022 ◽

Vol 15 (2) ◽

pp. 1-29

Author(s):

Paolo D'Alberto ◽

Victor Wu ◽

Aaron Ng ◽

Rahul Nimaiyar ◽

Elliott Delaye ◽

...

Keyword(s):

Neural Networks ◽

Power Efficiency ◽

Digital Signal ◽

Fpga Design ◽

Deep Convolutional Neural Networks ◽

Parametric Function ◽

Field Programmable ◽

Scale Down ◽

On Chip ◽

Numerical Precision

We present xDNN, an end-to-end system for deep-learning inference based on a family of specialized hardware processors synthesized on Field-Programmable Gate Array (FPGAs) and Convolution Neural Networks (CNN). We present a design optimized for low latency, high throughput, and high compute efficiency with no batching. The design is scalable and a parametric function of the number of multiply-accumulate units, on-chip memory hierarchy, and numerical precision. The design can produce a scale-down processor for embedded devices, replicated to produce more cores for larger devices, or resized to optimize efficiency. On Xilinx Virtex Ultrascale+ VU13P FPGA, we achieve 800 MHz that is close to the Digital Signal Processing maximum frequency and above 80% efficiency of on-chip compute resources. On top of our processor family, we present a runtime system enabling the execution of different networks for different input sizes (i.e., from 224× 224 to 2048× 1024). We present a compiler that reads CNNs from native frameworks (i.e., MXNet, Caffe, Keras, and Tensorflow), optimizes them, generates codes, and provides performance estimates. The compiler combines quantization information from the native environment and optimizations to feed the runtime with code as efficient as any hardware expert could write. We present tools partitioning a CNN into subgraphs for the division of work to CPU cores and FPGAs. Notice that the software will not change when or if the FPGA design becomes an ASIC, making our work vertical and not just a proof-of-concept FPGA project. We show experimental results for accuracy, latency, and power for several networks: In summary, we can achieve up to 4 times higher throughput, 3 times better power efficiency than the GPUs, and up to 20 times higher throughput than the latest CPUs. To our knowledge, we provide solutions faster than any previous FPGA-based solutions and comparable to any other top-of-the-shelves solutions.

Download Full-text

A Heterogeneous Hardware Accelerator for Image Classification in Embedded Systems

Sensors ◽

10.3390/s21082637 ◽

2021 ◽

Vol 21 (8) ◽

pp. 2637

Author(s):

Ignacio Pérez ◽

Miguel Figueroa

Keyword(s):

Image Classification ◽

High Speed ◽

Hardware Acceleration ◽

Graphics Processors ◽

Embedded Processor ◽

Gate Arrays ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Computationally Intensive ◽

On Chip

Convolutional neural networks (CNN) have been extensively employed for image classification due to their high accuracy. However, inference is a computationally-intensive process that often requires hardware acceleration to operate in real time. For mobile devices, the power consumption of graphics processors (GPUs) is frequently prohibitive, and field-programmable gate arrays (FPGA) become a solution to perform inference at high speed. Although previous works have implemented CNN inference on FPGAs, their high utilization of on-chip memory and arithmetic resources complicate their application on resource-constrained edge devices. In this paper, we present a scalable, low power, low resource-utilization accelerator architecture for inference on the MobileNet V2 CNN. The architecture uses a heterogeneous system with an embedded processor as the main controller, external memory to store network data, and dedicated hardware implemented on reconfigurable logic with a scalable number of processing elements (PE). Implemented on a XCZU7EV FPGA running at 200 MHz and using four PEs, the accelerator infers with 87% top-5 accuracy and processes an image of 224×224 pixels in 220 ms. It consumes 7.35 W of power and uses less than 30% of the logic and arithmetic resources used by other MobileNet FPGA accelerators.

Download Full-text

Future Field Programmable Gate Array (FPGA) Design Methodologies and Tool Flows

10.21236/ada492273 ◽

2008 ◽

Cited By ~ 3

Author(s):

Michael Wirthlin ◽

Brent Nelson ◽

Brad Hutchings ◽

Peter Athanas ◽

Shawn Bohner

Keyword(s):

Field Programmable Gate Array ◽

Fpga Design ◽

Design Methodologies ◽

Field Programmable ◽

Gate Array

Download Full-text

Reconfigurable field‐programmable gate array‐based on‐chip learning neuromorphic digital implementation for nonlinear function approximation

International Journal of Circuit Theory and Applications ◽

10.1002/cta.3075 ◽

2021 ◽

Author(s):

Morteza Gholami ◽

Edris Zaman Farsa ◽

Gholamreza Karimi

Keyword(s):

Field Programmable Gate Array ◽

Function Approximation ◽

Nonlinear Function ◽

Digital Implementation ◽

Field Programmable ◽

Gate Array ◽

On Chip ◽

Nonlinear Function Approximation

Download Full-text

Hardware-Accelerated Dual-Split Trees

Proceedings of the ACM on Computer Graphics and Interactive Techniques ◽

10.1145/3406185 ◽

2020 ◽

Vol 3 (2) ◽

pp. 1-21

Author(s):

Daqi Lin ◽

Elena Vasiou ◽

Cem Yuksel ◽

Daniel Kopta ◽

Erik Brunvand

Keyword(s):

Ray Tracing ◽

Hardware Acceleration ◽

Memory Storage ◽

Compact Representation ◽

Space Partitioning ◽

Data Movement ◽

Bounding Volume ◽

Bounding Boxes ◽

Split Trees ◽

Bounding Volume Hierarchies

Bounding volume hierarchies (BVH) are the most widely used acceleration structures for ray tracing due to their high construction and traversal performance. However, the bounding planes shared between parent and children bounding boxes is an inherent storage redundancy that limits further improvement in performance due to the memory cost of reading these redundant planes. Dual-split trees can create identical space partitioning as BVHs, but in a compact form using less memory by eliminating the redundancies of the BVH structure representation. This reduction in memory storage and data movement translates to faster ray traversal and better energy efficiency. Yet, the performance benefits of dual-split trees are undermined by the processing required to extract the necessary information from their compact representation. This involves bit manipulations and branching instructions which are inefficient in software. We introduce hardware acceleration for dual-split trees and show that the performance advantages over BVHs are emphasized in a hardware ray tracing context that can take advantage of such acceleration. We provide details on how the operations needed for decoding dual-split tree nodes can be implemented in hardware and present experiments in a number of scenes with different sizes using path tracing. In our experiments, we have observed up to 31% reduction in render time and 38% energy saving using dual-split trees as compared to binary BVHs representing identical space partitioning.

Download Full-text

Comparative analysis of soft and hard on-chip interconnects for field-programmable gate arrays

IET Computers & Digital Techniques ◽

10.1049/iet-cdt.2011.0169 ◽

2012 ◽

Vol 6 (6) ◽

pp. 396-405 ◽

Cited By ~ 2

Author(s):

J.Y. Hur ◽

M.A. Wahlah ◽

L. Mhamdi ◽

K. Goossens

Keyword(s):

Comparative Analysis ◽

Field Programmable Gate Arrays ◽

Gate Arrays ◽

Field Programmable ◽

Programmable Gate Arrays ◽

On Chip

Download Full-text

A System-On-Chip Approach in Designing a Dedicated RISC Microcontroller Unit Using the Field-Programmable Gate Array

2010 Fifth International Conference on Systems ◽

10.1109/icons.2010.40 ◽

2010 ◽

Author(s):

Elena Roxana Buhus ◽

Alexandru Lazar ◽

Adriano Tavares

Keyword(s):

Field Programmable Gate Array ◽

System On Chip ◽

Field Programmable ◽

Gate Array ◽

On Chip ◽

Microcontroller Unit

Download Full-text

EFFICIENT QRS COMPLEX DETECTION ALGORITHM IMPLEMENTATION ON SOC-BASED EMBEDDED SYSTEM

Jurnal Teknologi ◽

10.11113/jt.v78.9450 ◽

2016 ◽

Vol 78 (7-5) ◽

Cited By ~ 1

Author(s):

Muhammad Amin Hashim ◽

Yuan Wen Hau ◽

Rabia Baktheri

Keyword(s):

Embedded System ◽

Detection Algorithm ◽

Detection Accuracy ◽

Qrs Complex ◽

Qrs Detection ◽

Qrs Complex Detection ◽

Moving Windows ◽

Field Programmable ◽

Complex Detection ◽

On Chip

This paper studies two different Electrocardiography (ECG) preprocessing algorithms, namely Pan and Tompkins (PT) and Derivative Based (DB) algorithm, which is crucial of QRS complex detection in cardiovascular disease detection. Both algorithms are compared in terms of QRS detection accuracy and computation timing performance, with implementation on System-on-Chip (SoC) based embedded system that prototype on Altera DE2-115 Field Programmable Gate Array (FPGA) platform as embedded software. Both algorithms are tested with 30 minutes ECG data from each of 48 different patient records obtain from MIT-BIH arrhythmia database. Results show that PT algorithm achieve 98.15% accuracy with 56.33 seconds computation while DB algorithm achieve 96.74% with only 22.14 seconds processing time. Based on the study, an optimized PT algorithm with improvement on Moving Windows Integrator (MWI) has been proposed to accelerate its computation. Result shows that the proposed optimized Moving Windows Integrator algorithm achieves 9.5 times speed up than original MWI while retaining its QRS detection accuracy.

Download Full-text