Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software

Goran Flegar; Hartwig Anzt; Terry Cojean; Enrique S. Quintana-Ortí

doi:10.1145/3441850

Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software

ACM Transactions on Mathematical Software ◽

10.1145/3441850 ◽

2021 ◽

Vol 47 (2) ◽

pp. 1-28

Author(s):

Goran Flegar ◽

Hartwig Anzt ◽

Terry Cojean ◽

Enrique S. Quintana-Ortí

Keyword(s):

Linear Algebra ◽

Graphics Processing Units ◽

High Performance ◽

Numerical Algorithms ◽

Mixed Precision ◽

Before And After ◽

Memory Accesses ◽

Specialized Hardware ◽

The Individual ◽

Graphics Processing

The use of mixed precision in numerical algorithms is a promising strategy for accelerating scientific applications. In particular, the adoption of specialized hardware and data formats for low-precision arithmetic in high-end GPUs (graphics processing units) has motivated numerous efforts aiming at carefully reducing the working precision in order to speed up the computations. For algorithms whose performance is bound by the memory bandwidth, the idea of compressing its data before (and after) memory accesses has received considerable attention. One idea is to store an approximate operator–like a preconditioner–in lower than working precision hopefully without impacting the algorithm output. We realize the first high-performance implementation of an adaptive precision block-Jacobi preconditioner which selects the precision format used to store the preconditioner data on-the-fly, taking into account the numerical properties of the individual preconditioner blocks. We implement the adaptive block-Jacobi preconditioner as production-ready functionality in the Ginkgo linear algebra library, considering not only the precision formats that are part of the IEEE standard, but also customized formats which optimize the length of the exponent and significand to the characteristics of the preconditioner blocks. Experiments run on a state-of-the-art GPU accelerator show that our implementation offers attractive runtime savings.

Download Full-text

High performance computing on graphics processing units

Pollack Periodica ◽

10.1556/pollack.3.2008.2.3 ◽

2008 ◽

Vol 3 (2) ◽

pp. 27-34 ◽

Cited By ~ 2

Author(s):

Balázs Tukora ◽

Tibor Szalay

Keyword(s):

High Performance Computing ◽

Graphics Processing Units ◽

High Performance ◽

Graphics Processing ◽

Performance Computing

Download Full-text

DSPSR: Digital Signal Processing Software for Pulsar Astronomy

Publications of the Astronomical Society of Australia ◽

10.1071/as10021 ◽

2011 ◽

Vol 28 (1) ◽

pp. 1-14 ◽

Cited By ~ 172

Author(s):

W. van Straten ◽

M. Bailes

Keyword(s):

Signal Processing ◽

Digital Signal Processing ◽

Graphics Processing Units ◽

High Performance ◽

Digital Signal ◽

General Purpose ◽

Design Decisions ◽

Extensive Range ◽

Processing Software ◽

Graphics Processing

Abstractdspsr is a high-performance, open-source, object-oriented, digital signal processing software library and application suite for use in radio pulsar astronomy. Written primarily in C++, the library implements an extensive range of modular algorithms that can optionally exploit both multiple-core processors and general-purpose graphics processing units. After over a decade of research and development, dspsr is now stable and in widespread use in the community. This paper presents a detailed description of its functionality, justification of major design decisions, analysis of phase-coherent dispersion removal algorithms, and demonstration of performance on some contemporary microprocessor architectures.

Download Full-text

A lightweight approach to performance portability with targetDP

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016682071 ◽

2016 ◽

Vol 32 (2) ◽

pp. 288-301

Author(s):

Alan Gray ◽

Kevin Stratford

Keyword(s):

Particle Physics ◽

Message Passing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Message Passing Interface ◽

Graphics Processing Unit ◽

Processing Unit ◽

Performance Portability ◽

Graphics Processing

Leading high performance computing systems achieve their status through use of highly parallel devices such as NVIDIA graphics processing units or Intel Xeon Phi many-core CPUs. The concept of performance portability across such architectures, as well as traditional CPUs, is vital for the application programmer. In this paper we describe targetDP, a lightweight abstraction layer which allows grid-based applications to target data parallel hardware in a platform agnostic manner. We demonstrate the effectiveness of our pragmatic approach by presenting performance results for a complex fluid application (with which the model was co-designed), plus separate lattice quantum chromodynamics particle physics code. For each application, a single source code base is seen to achieve portable performance, as assessed within the context of the Roofline model. TargetDP can be combined with Message Passing Interface (MPI) to allow use on systems containing multiple nodes: we demonstrate this through provision of scaling results on traditional and graphics processing unit-accelerated large scale supercomputers.

Download Full-text

Big Data and IT Network Data Visualization

International Journal of Mathematical Engineering and Management Sciences ◽

10.33889/ijmems.2018.3.1-002 ◽

2018 ◽

Vol 3 (1) ◽

pp. 9-16 ◽

Cited By ~ 3

Author(s):

Lidong Wang

Keyword(s):

Big Data ◽

Network Analysis ◽

Graphics Processing Units ◽

Data Analytics ◽

High Performance ◽

Big Data Analytics ◽

Network Visualization ◽

Network Data ◽

Graphics Processing ◽

Performance Computing

Visualization with graphs is popular in the data analysis of Information Technology (IT) networks or computer networks. An IT network is often modelled as a graph with hosts being nodes and traffic being flows on many edges. General visualization methods are introduced in this paper. Applications and technology progress of visualization in IT network analysis and big data in IT network visualization are presented. The challenges of visualization and Big Data analytics in IT network visualization are also discussed. Big Data analytics with High Performance Computing (HPC) techniques, especially Graphics Processing Units (GPUs) helps accelerate IT network analysis and visualization.

Download Full-text

The VOLNA-OP2 tsunami code (version 1.5)

Geoscientific Model Development ◽

10.5194/gmd-11-4621-2018 ◽

2018 ◽

Vol 11 (11) ◽

pp. 4621-4635 ◽

Cited By ~ 7

Author(s):

Istvan Z. Reguly ◽

Daniel Giles ◽

Devaraj Gopinathan ◽

Laure Quivy ◽

Joakim H. Beck ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Shallow Water Equation ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Central Processing ◽

Domain Specific ◽

Computing Platforms ◽

Graphics Processing ◽

Intel Xeon

Abstract. In this paper, we present the VOLNA-OP2 tsunami model and implementation; a finite-volume non-linear shallow-water equation (NSWE) solver built on the OP2 domain-specific language (DSL) for unstructured mesh computations. VOLNA-OP2 is unique among tsunami solvers in its support for several high-performance computing platforms: central processing units (CPUs), the Intel Xeon Phi, and graphics processing units (GPUs). This is achieved in a way that the scientific code is kept separate from various parallel implementations, enabling easy maintainability. It has already been used in production for several years; here we discuss how it can be integrated into various workflows, such as a statistical emulator. The scalability of the code is demonstrated on three supercomputers, built with classical Xeon CPUs, the Intel Xeon Phi, and NVIDIA P100 GPUs. VOLNA-OP2 shows an ability to deliver productivity as well as performance and portability to its users across a number of platforms.

Download Full-text

Accelerated FDPS: Algorithms to use accelerators with FDPS

Publications of the Astronomical Society of Japan ◽

10.1093/pasj/psz133 ◽

2020 ◽

Vol 72 (1) ◽

Cited By ~ 2

Author(s):

Masaki Iwasawa ◽

Daisuke Namekata ◽

Keigo Nitadori ◽

Kentaro Nomura ◽

Long Wang ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

General Purpose ◽

Performance Model ◽

Performance Tuning ◽

Data Types ◽

Interaction Function ◽

Current Implementation ◽

And Performance ◽

Graphics Processing

Abstract We describe algorithms implemented in FDPS (Framework for Developing Particle Simulators) to make efficient use of accelerator hardware such as GPGPUs (general-purpose computing on graphics processing units). We have developed FDPS to make it possible for researchers to develop their own high-performance parallel particle-based simulation programs without spending large amounts of time on parallelization and performance tuning. FDPS provides a high-performance implementation of parallel algorithms for particle-based simulations in a “generic” form, so that researchers can define their own particle data structure and interparticle interaction functions. FDPS compiled with user-supplied data types and interaction functions provides all the necessary functions for parallelization, and researchers can thus write their programs as though they are writing simple non-parallel code. It has previously been possible to use accelerators with FDPS by writing an interaction function that uses the accelerator. However, the efficiency was limited by the latency and bandwidth of communication between the CPU and the accelerator, and also by the mismatch between the available degree of parallelism of the interaction function and that of the hardware parallelism. We have modified the interface of the user-provided interaction functions so that accelerators are more efficiently used. We also implemented new techniques which reduce the amount of work on the CPU side and the amount of communication between CPU and accelerators. We have measured the performance of N-body simulations on a system with an NVIDIA Volta GPGPU using FDPS and the achieved performance is around 27% of the theoretical peak limit. We have constructed a detailed performance model, and found that the current implementation can achieve good performance on systems with much smaller memory and communication bandwidth. Thus, our implementation will be applicable to future generations of accelerator system.

Download Full-text

State-of-the-art in Heterogeneous Computing

Scientific Programming ◽

10.1155/2010/540159 ◽

2010 ◽

Vol 18 (1) ◽

pp. 1-33 ◽

Cited By ~ 96

Author(s):

Andre R. Brodtkorb ◽

Christopher Dyken ◽

Trond R. Hagen ◽

Jon M. Hjelmervik ◽

Olaf O. Storaasli

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Heterogeneous Computing ◽

State Of The Art ◽

Peak Performance ◽

Fine Grained ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Cost Efficient ◽

Graphics Processing

Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.

Download Full-text

MULTIBEAM GPU TRANSIENT PIPELINE FOR THE MEDICINA BEST-2 ARRAY

Journal of Astronomical Instrumentation ◽

10.1142/s2251171713500086 ◽

2013 ◽

Vol 02 (01) ◽

pp. 1350008 ◽

Cited By ~ 6

Author(s):

A. MAGRO ◽

J. HICKISH ◽

K. Z. ADAMI

Keyword(s):

Event Detection ◽

Graphics Processing Units ◽

High Performance ◽

Data Transfer ◽

Digital Signal ◽

Dispersion Measure ◽

Radio Telescopes ◽

Arrival Times ◽

Detection Techniques ◽

Graphics Processing

Radio transient discovery using next generation radio telescopes will pose several digital signal processing and data transfer challenges, requiring specialized high-performance backends. Several accelerator technologies are being considered as prototyping platforms, including Graphics Processing Units (GPUs). In this paper we present a real-time pipeline prototype capable of processing multiple beams concurrently, performing Radio Frequency Interference (RFI) rejection through thresholding, correcting for the delay in signal arrival times across the frequency band using brute-force dedispersion, event detection and clustering, and finally candidate filtering, with the capability of persisting data buffers containing interesting signals to disk. This setup was deployed at the BEST-2 SKA pathfinder in Medicina, Italy, where several benchmarks and test observations of astrophysical transients were conducted. These tests show that on the deployed hardware eight 20 MHz beams can be processed simultaneously for ~640 Dispersion Measure (DM) values. Furthermore, the clustering and candidate filtering algorithms employed prove to be good candidates for online event detection techniques. The number of beams which can be processed increases proportionally to the number of servers deployed and number of GPUs, making it a viable architecture for current and future radio telescopes.

Download Full-text

Transmission line matrix algorithms for high performance computing hardware with graphics processing units

2011 International Conference on Electromagnetics in Advanced Applications ◽

10.1109/iceaa.2011.6046463 ◽

2011 ◽

Author(s):

Poman So

Keyword(s):

Transmission Line ◽

High Performance Computing ◽

Graphics Processing Units ◽

High Performance ◽

Transmission Line Matrix ◽

Matrix Algorithms ◽

Graphics Processing ◽

Performance Computing

Download Full-text

High performance image processing of satellite images using graphics processing units

2011 IEEE International Geoscience and Remote Sensing Symposium ◽

10.1109/igarss.2011.6049189 ◽

2011 ◽

Cited By ~ 2

Author(s):

Michal Rumanek ◽

Tomasz Danek ◽

Andrzej Lesniak

Keyword(s):

Image Processing ◽

Graphics Processing Units ◽

High Performance ◽

Satellite Images ◽

Graphics Processing

Download Full-text