A High-throughput Parallel Viterbi Algorithm via Bitslicing

In this work, we present a novel bitsliced high-performance Viterbi algorithm suitable for high-throughput and data-intensive communication. A new column-major data representation scheme coupled with the bitsliced architecture is employed in our proposed Viterbi decoder that enables the maximum utilization of the parallel processing units in modern parallel accelerators. With the help of the proposed alteration of the data scheme, instead of the conventional bit-by-bit operations, 32-bit chunks of data are processed by each processing unit. This means that a single bitsliced parallel Viterbi decoder is capable of decoding 32 different chunks of data simultaneously. Here, the Viterbi’s Add-Compare-Select procedure is implemented with our proposed bitslicing technique, where it is shown that the bitsliced operations for the Viterbi internal functionalities are efficient in terms of their performance and complexity. We have achieved this level of high parallelism while keeping an acceptable bit error rate performance for our proposed methodology. Our suggested hard and soft-decision Viterbi decoder implementations on GPU platforms outperform the fastest previously proposed works by 4.3{\times } and 2.3{\times } , achieving 21.41 and 8.24 Gbps on Tesla V100, respectively.

Download Full-text

Low-precision Floating-point Arithmetic for High-performance FPGA-based CNN Acceleration

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3474597 ◽

2022 ◽

Vol 15 (1) ◽

pp. 1-21

Author(s):

Chen Wu ◽

Mingyu Wang ◽

Xinyuan Chu ◽

Kun Wang ◽

Lei He

Keyword(s):

Fixed Point ◽

High Performance ◽

Good Accuracy ◽

Data Representation ◽

Floating Point ◽

Average Throughput ◽

Precision Data ◽

Content Type ◽

Point Arithmetic ◽

Better Than

Low-precision data representation is important to reduce storage size and memory access for convolutional neural networks (CNNs). Yet, existing methods have two major limitations: (1) requiring re-training to maintain accuracy for deep CNNs and (2) needing 16-bit floating-point or 8-bit fixed-point for a good accuracy. In this article, we propose a low-precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration to overcome the above limitations. Without any re-training, LPFP finds an optimal 8-bit data representation with negligible top-1/top-5 accuracy loss (within 0.5%/0.3% in our experiments, respectively, and significantly better than existing methods for deep CNNs). Furthermore, we implement one 8-bit LPFP multiplication by one 4-bit multiply-adder and one 3-bit adder, and therefore implement four 8-bit LPFP multiplications using one DSP48E1 of Xilinx Kintex-7 family or DSP48E2 of Xilinx Ultrascale/Ultrascale+ family, whereas one DSP can implement only two 8-bit fixed-point multiplications. Experiments on six typical CNNs for inference show that on average, we improve throughput by over existing FPGA accelerators. Particularly for VGG16 and YOLO, compared to six recent FPGA accelerators, we improve average throughput by 3.5 and 27.5 and average throughput per DSP by 4.1 and 5 , respectively.

Download Full-text

A High Performance Soft Decision Viterbi Decoder for Wlan and Broadband Applications

2006 Canadian Conference on Electrical and Computer Engineering ◽

10.1109/ccece.2006.277834 ◽

2006 ◽

Cited By ~ 2

Author(s):

Abdul-rafeeq Abdul-Shakoor ◽

Valek Szwarc

Keyword(s):

High Performance ◽

Viterbi Decoder ◽

Soft Decision

Download Full-text

AMIDE v2: High-Throughput Screening Based on AutoDock-GPU and Improved Workflow Leading to Better Performance and Reliability

International Journal of Molecular Sciences ◽

10.3390/ijms22147489 ◽

2021 ◽

Vol 22 (14) ◽

pp. 7489

Author(s):

Pierre Darme ◽

Manuel Dauchez ◽

Arnaud Renard ◽

Laurence Voutquenne-Nazabadioko ◽

Dominique Aubert ◽

...

Keyword(s):

High Throughput ◽

High Throughput Screening ◽

High Performance ◽

Graphics Processing Unit ◽

Target Identification ◽

Computation Time ◽

Processing Unit ◽

Biological Target ◽

Speed Up ◽

Graphics Processing

Molecular docking is widely used in computed drug discovery and biological target identification, but getting fast results can be tedious and often requires supercomputing solutions. AMIDE stands for AutoMated Inverse Docking Engine. It was initially developed in 2014 to perform inverse docking on High Performance Computing. AMIDE version 2 brings substantial speed-up improvement by using AutoDock-GPU and by pulling a total revision of programming workflow, leading to better performances, easier use, bug corrections, parallelization improvements and PC/HPC compatibility. In addition to inverse docking, AMIDE is now an optimized tool capable of high throughput inverse screening. For instance, AMIDE version 2 allows acceleration of the docking up to 12.4 times for 100 runs of AutoDock compared to version 1, without significant changes in docking poses. The reverse docking of a ligand on 87 proteins takes only 23 min on 1 GPU (Graphics Processing Unit), while version 1 required 300 cores to reach the same execution time. Moreover, we have shown an exponential acceleration of the computation time as a function of the number of GPUs used, allowing a significant reduction of the duration of the inverse docking process on large datasets.

Download Full-text

High-performance computer graphics technologies in engineering applications

World Journal of Engineering ◽

10.1108/wje-05-2018-0158 ◽

2019 ◽

Vol 16 (2) ◽

pp. 304-308

Author(s):

Chao Peng

Keyword(s):

Computer Graphics ◽

High Performance ◽

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

Data Streaming ◽

Engineering Applications ◽

Content Type ◽

Central Processing ◽

Advantages And Disadvantages

Purpose The purpose of this paper is to investigate possibilities to adopt state-of-the-art computer graphics technologies for big data visualization in engineering applications. Toward this purpose, a conceptual heterogeneous system is proposed for graphical rendering, which is established with multiple central processing unit cores and multiple graphics processing unit GPUs. Design/methodology/approach The design of the system supports both general-purpose computation and graphics-related computation. Three processing components are discussed to fulfill the execution requirements in load balancing, data streaming and display. This design fully uses computational and memory resources and enhances the performance with the support of GPU-based parallelization. Findings The advantages and disadvantages of particular technical methods for each processing component are discussed. The possible ways to integrate them are analyzed. Originality/value This work has contributions of using computer graphics technologies in engineering applications.

Download Full-text

The design of Viterbi decoder for low power consumption space time trellis code without adder architecture using RTL model

World Journal of Engineering ◽

10.1108/wje-09-2016-0088 ◽

2016 ◽

Vol 13 (6) ◽

pp. 540-546 ◽

Cited By ~ 1

Author(s):

Mohd Azlan Abu ◽

Harlisya Harun ◽

Mohammad Yazdi Harmin ◽

Noor Izzri Abdul Wahab ◽

Muhd Khairulzaman Abdul Kadir

Keyword(s):

Viterbi Algorithm ◽

Original Data ◽

Space Time ◽

Modulation Scheme ◽

Viterbi Decoder ◽

Content Type ◽

Trellis Code ◽

Shift Keying ◽

Programmable Logic Devices ◽

Space Time Trellis Code

Purpose This paper aims to describe the real-time design and implementation of a Space Time Trellis Code decoder using Altera Complex Programmable Logic Devices (CPLD). Design/methodology/approach The code uses a generator matrix designed for four-state space time trellis code (STTC) that uses quadrature phase shift keying (QPSK) modulation scheme. The decoding process has been carried out using maximum likelihood sequences estimation through the Viterbi algorithm. Findings The results showed that the STTC decoder can successfully decipher the encoded symbols from the STTC encoder and can fully recover the original data. The data rate of the decoder is 50 Mbps. Originality/value It has been shown that 96 per cent improvement of the total logic elements in Max V CPLD is used compared to the previous literature review.

Download Full-text

Adaptive In-Network Collaborative Caching for Enhanced Ensemble Deep Learning at Edge

Mathematical Problems in Engineering ◽

10.1155/2021/9285802 ◽

2021 ◽

Vol 2021 ◽

pp. 1-14

Author(s):

Yana Qin ◽

Danye Wu ◽

Zhiwei Xu ◽

Jie Tian ◽

Yujun Zhang

Keyword(s):

Ensemble Learning ◽

High Performance ◽

Data Representation ◽

Edge Computing ◽

Privacy And Security ◽

Collaborative Caching ◽

Data Intensive ◽

Caching Scheme ◽

Efficient Data ◽

Transmission Overhead

To enhance the quality and speed of data processing and protect the privacy and security of the data, edge computing has been extensively applied to support data-intensive intelligent processing services at edge. Among these data-intensive services, ensemble learning-based services can, in natural, leverage the distributed computation and storage resources at edge devices to achieve efficient data collection, processing, and analysis. Collaborative caching has been applied in edge computing to support services close to the data source, in order to take the limited resources at edge devices to support high-performance ensemble learning solutions. To achieve this goal, we propose an adaptive in-network collaborative caching scheme for ensemble learning at edge. First, an efficient data representation structure is proposed to record cached data among different nodes. In addition, we design a collaboration scheme to facilitate edge nodes to cache valuable data for local ensemble learning, by scheduling local caching according to a summarization of data representations from different edge nodes. Our extensive simulations demonstrate the high performance of the proposed collaborative caching scheme, which significantly reduces the learning latency and the transmission overhead.

Download Full-text

RDMA data transfer and GPU acceleration methods for high-throughput online processing of serial crystallography images

Journal of Synchrotron Radiation ◽

10.1107/s1600577520008140 ◽

2020 ◽

Vol 27 (5) ◽

pp. 1297-1306

Author(s):

Raphael Ponsard ◽

Nicolas Janvier ◽

Jerome Kieffer ◽

Dominique Houzet ◽

Vincent Fristot

Keyword(s):

High Throughput ◽

High Performance ◽

Data Transfer ◽

Direct Memory Access ◽

Large Data ◽

Processing Unit ◽

Image Correction ◽

Online Data ◽

Online Processing ◽

Serial Crystallography

The continual evolution of photon sources and high-performance detectors drives cutting-edge experiments that can produce very high throughput data streams and generate large data volumes that are challenging to manage and store. In these cases, efficient data transfer and processing architectures that allow online image correction, data reduction or compression become fundamental. This work investigates different technical options and methods for data placement from the detector head to the processing computing infrastructure, taking into account the particularities of modern modular high-performance detectors. In order to compare realistic figures, the future ESRF beamline dedicated to macromolecular X-ray crystallography, EBSL8, is taken as an example, which will use a PSI JUNGFRAU 4M detector generating up to 16 GB of data per second, operating continuously during several minutes. Although such an experiment seems possible at the target speed with the 100 Gb s−1 network cards that are currently available, the simulations generated highlight some potential bottlenecks when using a traditional software stack. An evaluation of solutions is presented that implements remote direct memory access (RDMA) over converged ethernet techniques. A synchronization mechanism is proposed between a RDMA network interface card (RNIC) and a graphics processing unit (GPU) accelerator in charge of the online data processing. The placement of the detector images onto the GPU is made to overlap with the computation carried out, potentially hiding the transfer latencies. As a proof of concept, a detector simulator and a backend GPU receiver with a rejection and compression algorithm suitable for a synchrotron serial crystallography (SSX) experiment are developed. It is concluded that the available transfer throughput from the RNIC to the GPU accelerator is at present the major bottleneck in online processing for SSX experiments.

Download Full-text

A High Performance Image Processing Unit for On-orbit Servicing

57th International Astronautical Congress ◽

10.2514/6.iac-06-d1.2.03 ◽

2006 ◽

Cited By ~ 1

Author(s):

Hiroshi Yamamoto ◽

Yasufumi Nagai ◽

Shinichi Kimura ◽

Hiroshi Takahashi ◽

Satoko Mizumoto ◽

...

Keyword(s):

Image Processing ◽

High Performance ◽

Processing Unit

Download Full-text

Beyond MPI

ACM SIGMOD Record ◽

10.1145/3456859.3456862 ◽

2021 ◽

Vol 49 (4) ◽

pp. 12-17

Author(s):

Feilong Liu ◽

Claude Barthels ◽

Spyros Blanas ◽

Hideaki Kimura ◽

Garret Swart

Keyword(s):

High Performance ◽

Processing System ◽

Complex Interaction ◽

Remote Memory ◽

Interaction Patterns ◽

Round Trip ◽

Data Processing System ◽

Data Intensive ◽

Multiple Round ◽

Programming Interface

Networkswith Remote DirectMemoryAccess (RDMA) support are becoming increasingly common. RDMA, however, offers a limited programming interface to remote memory that consists of read, write and atomic operations. With RDMA alone, completing the most basic operations on remote data structures often requires multiple round-trips over the network. Data-intensive systems strongly desire higher-level communication abstractions that supportmore complex interaction patterns. A natural candidate to consider is MPI, the de facto standard for developing high-performance applications in the HPC community. This paper critically evaluates the communication primitives of MPI and shows that using MPI in the context of a data processing system comes with its own set of insurmountable challenges. Based on this analysis, we propose a new communication abstraction named RDMO, or Remote DirectMemory Operation, that dispatches a short sequence of reads, writes and atomic operations to remote memory and executes them in a single round-trip.

Download Full-text

Rough diamonds in emerging markets: legacy, competitiveness, and sustained high performance

Cross Cultural & Strategic Management ◽

10.1108/ccsm-03-2019-0057 ◽

2019 ◽

Vol 26 (3) ◽

pp. 363-386

Author(s):

Seung Ho Park ◽

Gerardo R. Ungson

Keyword(s):

Human Capital ◽

Emerging Markets ◽

High Performance ◽

Core Competencies ◽

Market Potential ◽

Government Support ◽

Screening Tests ◽

Institutional Voids ◽

Content Type ◽

Over Time

Purpose The purpose of this paper is to uncover the underlying drivers of sustained high performing companies based on a field study of 127 companies in Brazilian, Russian, Indian and Chinese (BRIC) and Association of Southeast Asian Nations (ASEAN) emerging markets. Understanding these companies provides a complementary way of appraising the growth, development and transformation of emerging markets. The authors synthesize the findings in an overarching framework that covers six strategies for building and sustaining legacy that leads to the succession of intergenerational wealth over time: overcoming institutional voids, inclusive markets, deepening localization, nurturing government support, building core competencies and harnessing human capital. The authors relate these strategies to different levels of development using Prahalad and Hart’s BOP framework. Design/methodology/approach This study examines the underlying drivers of sustained high-performance companies based on field studies from an initial set of 105,260 BRIC companies and close to 500 companies in ASEAN. The methods employed four screening tests to arrive at a selection of the highest-performing firms: 70 firms in the BRIC nations and 58 firms from ASEAN. Following the selection, the authors constructed cases using primary interviews and secondary data, with the assistance of Ernst & Young and with academic colleagues in Manila. These studies were originally conducted in two separate time periods and reported accordingly. This paper synthesizes the findings of these two studies to arrive at an extended integrative framework. Findings From the cases, the authors examine six strategies for building and sustaining legacy that lead to high performance over time: overcoming institutional voids, creating inclusive markets, deepening localization, nurturing government support, building core competencies and harnessing human capital. To address the evolving state of institutional voids in these countries, the authors employ similar methods to hypothesize the placement of these strategies in the context of the world economic pyramid, initially formulated as the “bottom of the pyramid” framework. Originality/value This paper synthesizes and extends the authors’ previous works by proposing the concept of legacy to describe the emergence and succession of local exemplary firms in emerging markets. This study aims to complement extant measures of nation-growth based primarily on GDP. The paper also extends the literature on institutional voids in shifting the focus from the mix of voids to their evolving state. Altogether, the paper provides a complementary narrative on assessing the market potential of emerging markets by adopting several categories of performance.

Download Full-text