A High-throughput Parallel Viterbi Algorithm via Bitslicing

2021 ◽  
Vol 8 (4) ◽  
pp. 1-25
Author(s):  
Saleh Khalaj Monfared ◽  
Omid Hajihassani ◽  
Vahid Mohsseni ◽  
Dara Rahmati ◽  
Saeid Gorgin

In this work, we present a novel bitsliced high-performance Viterbi algorithm suitable for high-throughput and data-intensive communication. A new column-major data representation scheme coupled with the bitsliced architecture is employed in our proposed Viterbi decoder that enables the maximum utilization of the parallel processing units in modern parallel accelerators. With the help of the proposed alteration of the data scheme, instead of the conventional bit-by-bit operations, 32-bit chunks of data are processed by each processing unit. This means that a single bitsliced parallel Viterbi decoder is capable of decoding 32 different chunks of data simultaneously. Here, the Viterbi’s Add-Compare-Select procedure is implemented with our proposed bitslicing technique, where it is shown that the bitsliced operations for the Viterbi internal functionalities are efficient in terms of their performance and complexity. We have achieved this level of high parallelism while keeping an acceptable bit error rate performance for our proposed methodology. Our suggested hard and soft-decision Viterbi decoder implementations on GPU platforms outperform the fastest previously proposed works by 4.3{\times } and 2.3{\times } , achieving 21.41 and 8.24 Gbps on Tesla V100, respectively.

2022 ◽  
Vol 15 (1) ◽  
pp. 1-21
Author(s):  
Chen Wu ◽  
Mingyu Wang ◽  
Xinyuan Chu ◽  
Kun Wang ◽  
Lei He

Low-precision data representation is important to reduce storage size and memory access for convolutional neural networks (CNNs). Yet, existing methods have two major limitations: (1) requiring re-training to maintain accuracy for deep CNNs and (2) needing 16-bit floating-point or 8-bit fixed-point for a good accuracy. In this article, we propose a low-precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration to overcome the above limitations. Without any re-training, LPFP finds an optimal 8-bit data representation with negligible top-1/top-5 accuracy loss (within 0.5%/0.3% in our experiments, respectively, and significantly better than existing methods for deep CNNs). Furthermore, we implement one 8-bit LPFP multiplication by one 4-bit multiply-adder and one 3-bit adder, and therefore implement four 8-bit LPFP multiplications using one DSP48E1 of Xilinx Kintex-7 family or DSP48E2 of Xilinx Ultrascale/Ultrascale+ family, whereas one DSP can implement only two 8-bit fixed-point multiplications. Experiments on six typical CNNs for inference show that on average, we improve throughput by over existing FPGA accelerators. Particularly for VGG16 and YOLO, compared to six recent FPGA accelerators, we improve average throughput by 3.5 and 27.5 and average throughput per DSP by 4.1 and 5 , respectively.


2021 ◽  
Vol 22 (14) ◽  
pp. 7489
Author(s):  
Pierre Darme ◽  
Manuel Dauchez ◽  
Arnaud Renard ◽  
Laurence Voutquenne-Nazabadioko ◽  
Dominique Aubert ◽  
...  

Molecular docking is widely used in computed drug discovery and biological target identification, but getting fast results can be tedious and often requires supercomputing solutions. AMIDE stands for AutoMated Inverse Docking Engine. It was initially developed in 2014 to perform inverse docking on High Performance Computing. AMIDE version 2 brings substantial speed-up improvement by using AutoDock-GPU and by pulling a total revision of programming workflow, leading to better performances, easier use, bug corrections, parallelization improvements and PC/HPC compatibility. In addition to inverse docking, AMIDE is now an optimized tool capable of high throughput inverse screening. For instance, AMIDE version 2 allows acceleration of the docking up to 12.4 times for 100 runs of AutoDock compared to version 1, without significant changes in docking poses. The reverse docking of a ligand on 87 proteins takes only 23 min on 1 GPU (Graphics Processing Unit), while version 1 required 300 cores to reach the same execution time. Moreover, we have shown an exponential acceleration of the computation time as a function of the number of GPUs used, allowing a significant reduction of the duration of the inverse docking process on large datasets.


2019 ◽  
Vol 16 (2) ◽  
pp. 304-308
Author(s):  
Chao Peng

Purpose The purpose of this paper is to investigate possibilities to adopt state-of-the-art computer graphics technologies for big data visualization in engineering applications. Toward this purpose, a conceptual heterogeneous system is proposed for graphical rendering, which is established with multiple central processing unit cores and multiple graphics processing unit GPUs. Design/methodology/approach The design of the system supports both general-purpose computation and graphics-related computation. Three processing components are discussed to fulfill the execution requirements in load balancing, data streaming and display. This design fully uses computational and memory resources and enhances the performance with the support of GPU-based parallelization. Findings The advantages and disadvantages of particular technical methods for each processing component are discussed. The possible ways to integrate them are analyzed. Originality/value This work has contributions of using computer graphics technologies in engineering applications.


2016 ◽  
Vol 13 (6) ◽  
pp. 540-546 ◽  
Author(s):  
Mohd Azlan Abu ◽  
Harlisya Harun ◽  
Mohammad Yazdi Harmin ◽  
Noor Izzri Abdul Wahab ◽  
Muhd Khairulzaman Abdul Kadir

Purpose This paper aims to describe the real-time design and implementation of a Space Time Trellis Code decoder using Altera Complex Programmable Logic Devices (CPLD). Design/methodology/approach The code uses a generator matrix designed for four-state space time trellis code (STTC) that uses quadrature phase shift keying (QPSK) modulation scheme. The decoding process has been carried out using maximum likelihood sequences estimation through the Viterbi algorithm. Findings The results showed that the STTC decoder can successfully decipher the encoded symbols from the STTC encoder and can fully recover the original data. The data rate of the decoder is 50 Mbps. Originality/value It has been shown that 96 per cent improvement of the total logic elements in Max V CPLD is used compared to the previous literature review.


2021 ◽  
Vol 2021 ◽  
pp. 1-14
Author(s):  
Yana Qin ◽  
Danye Wu ◽  
Zhiwei Xu ◽  
Jie Tian ◽  
Yujun Zhang

To enhance the quality and speed of data processing and protect the privacy and security of the data, edge computing has been extensively applied to support data-intensive intelligent processing services at edge. Among these data-intensive services, ensemble learning-based services can, in natural, leverage the distributed computation and storage resources at edge devices to achieve efficient data collection, processing, and analysis. Collaborative caching has been applied in edge computing to support services close to the data source, in order to take the limited resources at edge devices to support high-performance ensemble learning solutions. To achieve this goal, we propose an adaptive in-network collaborative caching scheme for ensemble learning at edge. First, an efficient data representation structure is proposed to record cached data among different nodes. In addition, we design a collaboration scheme to facilitate edge nodes to cache valuable data for local ensemble learning, by scheduling local caching according to a summarization of data representations from different edge nodes. Our extensive simulations demonstrate the high performance of the proposed collaborative caching scheme, which significantly reduces the learning latency and the transmission overhead.


2020 ◽  
Vol 27 (5) ◽  
pp. 1297-1306
Author(s):  
Raphael Ponsard ◽  
Nicolas Janvier ◽  
Jerome Kieffer ◽  
Dominique Houzet ◽  
Vincent Fristot

The continual evolution of photon sources and high-performance detectors drives cutting-edge experiments that can produce very high throughput data streams and generate large data volumes that are challenging to manage and store. In these cases, efficient data transfer and processing architectures that allow online image correction, data reduction or compression become fundamental. This work investigates different technical options and methods for data placement from the detector head to the processing computing infrastructure, taking into account the particularities of modern modular high-performance detectors. In order to compare realistic figures, the future ESRF beamline dedicated to macromolecular X-ray crystallography, EBSL8, is taken as an example, which will use a PSI JUNGFRAU 4M detector generating up to 16 GB of data per second, operating continuously during several minutes. Although such an experiment seems possible at the target speed with the 100 Gb s−1 network cards that are currently available, the simulations generated highlight some potential bottlenecks when using a traditional software stack. An evaluation of solutions is presented that implements remote direct memory access (RDMA) over converged ethernet techniques. A synchronization mechanism is proposed between a RDMA network interface card (RNIC) and a graphics processing unit (GPU) accelerator in charge of the online data processing. The placement of the detector images onto the GPU is made to overlap with the computation carried out, potentially hiding the transfer latencies. As a proof of concept, a detector simulator and a backend GPU receiver with a rejection and compression algorithm suitable for a synchrotron serial crystallography (SSX) experiment are developed. It is concluded that the available transfer throughput from the RNIC to the GPU accelerator is at present the major bottleneck in online processing for SSX experiments.


Author(s):  
Hiroshi Yamamoto ◽  
Yasufumi Nagai ◽  
Shinichi Kimura ◽  
Hiroshi Takahashi ◽  
Satoko Mizumoto ◽  
...  

2021 ◽  
Vol 49 (4) ◽  
pp. 12-17
Author(s):  
Feilong Liu ◽  
Claude Barthels ◽  
Spyros Blanas ◽  
Hideaki Kimura ◽  
Garret Swart

Networkswith Remote DirectMemoryAccess (RDMA) support are becoming increasingly common. RDMA, however, offers a limited programming interface to remote memory that consists of read, write and atomic operations. With RDMA alone, completing the most basic operations on remote data structures often requires multiple round-trips over the network. Data-intensive systems strongly desire higher-level communication abstractions that supportmore complex interaction patterns. A natural candidate to consider is MPI, the de facto standard for developing high-performance applications in the HPC community. This paper critically evaluates the communication primitives of MPI and shows that using MPI in the context of a data processing system comes with its own set of insurmountable challenges. Based on this analysis, we propose a new communication abstraction named RDMO, or Remote DirectMemory Operation, that dispatches a short sequence of reads, writes and atomic operations to remote memory and executes them in a single round-trip.


2019 ◽  
Vol 26 (3) ◽  
pp. 363-386
Author(s):  
Seung Ho Park ◽  
Gerardo R. Ungson

Purpose The purpose of this paper is to uncover the underlying drivers of sustained high performing companies based on a field study of 127 companies in Brazilian, Russian, Indian and Chinese (BRIC) and Association of Southeast Asian Nations (ASEAN) emerging markets. Understanding these companies provides a complementary way of appraising the growth, development and transformation of emerging markets. The authors synthesize the findings in an overarching framework that covers six strategies for building and sustaining legacy that leads to the succession of intergenerational wealth over time: overcoming institutional voids, inclusive markets, deepening localization, nurturing government support, building core competencies and harnessing human capital. The authors relate these strategies to different levels of development using Prahalad and Hart’s BOP framework. Design/methodology/approach This study examines the underlying drivers of sustained high-performance companies based on field studies from an initial set of 105,260 BRIC companies and close to 500 companies in ASEAN. The methods employed four screening tests to arrive at a selection of the highest-performing firms: 70 firms in the BRIC nations and 58 firms from ASEAN. Following the selection, the authors constructed cases using primary interviews and secondary data, with the assistance of Ernst & Young and with academic colleagues in Manila. These studies were originally conducted in two separate time periods and reported accordingly. This paper synthesizes the findings of these two studies to arrive at an extended integrative framework. Findings From the cases, the authors examine six strategies for building and sustaining legacy that lead to high performance over time: overcoming institutional voids, creating inclusive markets, deepening localization, nurturing government support, building core competencies and harnessing human capital. To address the evolving state of institutional voids in these countries, the authors employ similar methods to hypothesize the placement of these strategies in the context of the world economic pyramid, initially formulated as the “bottom of the pyramid” framework. Originality/value This paper synthesizes and extends the authors’ previous works by proposing the concept of legacy to describe the emergence and succession of local exemplary firms in emerging markets. This study aims to complement extant measures of nation-growth based primarily on GDP. The paper also extends the literature on institutional voids in shifting the focus from the mix of voids to their evolving state. Altogether, the paper provides a complementary narrative on assessing the market potential of emerging markets by adopting several categories of performance.


Sign in / Sign up

Export Citation Format

Share Document