Efficient parallelization of SPH algorithm on modern multi-core CPUs and massively parallel GPUs

International Journal of Modeling Simulation and Scientific Computing ◽

10.1142/s1793962321500549 ◽

2021 ◽

pp. 2150054

Author(s):

Pravin Jagtap ◽

Rupesh Nasre ◽

V. S. Sanapala ◽

B. S. V. Patnaik

Keyword(s):

High Performance ◽

Performance Metrics ◽

Computational Simulation ◽

Massively Parallel ◽

Benchmark Problems ◽

Processing Unit ◽

Central Processing ◽

Neighbor Search ◽

Computational Performance ◽

Sph Algorithm

Smoothed Particle Hydrodynamics (SPH) is fast emerging as a practically useful computational simulation tool for a wide variety of engineering problems. SPH is also gaining popularity as the back bone for fast and realistic animations in graphics and video games. The Lagrangian and mesh-free nature of the method facilitates fast and accurate simulation of material deformation, interface capture, etc. Typically, particle-based methods would necessitate particle search and locate algorithms to be implemented efficiently, as continuous creation of neighbor particle lists is a computationally expensive step. Hence, it is advantageous to implement SPH, on modern multi-core platforms with the help of High-Performance Computing (HPC) tools. In this work, the computational performance of an SPH algorithm is assessed on multi-core Central Processing Unit (CPU) as well as massively parallel General Purpose Graphical Processing Units (GP-GPU). Parallelizing SPH faces several challenges such as, scalability of the neighbor search process, force calculations, minimizing thread divergence, achieving coalesced memory access patterns, balancing workload, ensuring optimum use of computational resources, etc. While addressing some of these challenges, detailed analysis of performance metrics such as speedup, global load efficiency, global store efficiency, warp execution efficiency, occupancy, etc. is evaluated. The OpenMP and Compute Unified Device Architecture[Formula: see text] parallel programming models have been used for parallel computing on Intel Xeon[Formula: see text] E5-[Formula: see text] multi-core CPU and NVIDIA Quadro M[Formula: see text] and NVIDIA Tesla p[Formula: see text] massively parallel GPU architectures. Standard benchmark problems from the Computational Fluid Dynamics (CFD) literature are chosen for the validation. The key concern of how to identify a suitable architecture for mesh-less methods which essentially require heavy workload of neighbor search and evaluation of local force fields from neighbor interactions is addressed.

Download Full-text

Comparative study of the implementation of the Lagrange interpolation algorithm on GPU and CPU using CUDA to compute the density of a material at different temperatures

SHS Web of Conferences ◽

10.1051/shsconf/202111907002 ◽

2021 ◽

Vol 119 ◽

pp. 07002

Author(s):

Youness Rtal ◽

Abdelkader Hadjoudja

Keyword(s):

Parallel Computing ◽

Graphics Processing Units ◽

Lagrange Interpolation ◽

Polynomial Interpolation ◽

Programming Model ◽

Interpolation Method ◽

Processing Unit ◽

Central Processing ◽

Computational Performance ◽

Different Temperatures

Graphics Processing Units (GPUs) are microprocessors attached to graphics cards, which are dedicated to the operation of displaying and manipulating graphics data. Currently, such graphics cards (GPUs) occupy all modern graphics cards. In a few years, these microprocessors have become potent tools for massively parallel computing. Such processors are practical instruments that serve in developing several fields like image processing, video and audio encoding and decoding, the resolution of a physical system with one or more unknowns. Their advantages: faster processing and consumption of less energy than the power of the central processing unit (CPU). In this paper, we will define and implement the Lagrange polynomial interpolation method on GPU and CPU to calculate the sodium density at different temperatures Ti using the NVIDIA CUDA C parallel programming model. It can increase computational performance by harnessing the power of the GPU. The objective of this study is to compare the performance of the implementation of the Lagrange interpolation method on CPU and GPU processors and to deduce the efficiency of the use of GPUs for parallel computing.

Download Full-text

SeisNoise.jl: Ambient Seismic Noise Cross Correlation on the CPU and GPU in Julia

Seismological Research Letters ◽

10.1785/0220200192 ◽

2020 ◽

Vol 92 (1) ◽

pp. 517-527

Author(s):

Timothy Clements ◽

Marine A. Denolle

Keyword(s):

Seismic Noise ◽

High Performance ◽

Cross Correlation ◽

Graphic Processing Unit ◽

Ambient Seismic Noise ◽

Processing Unit ◽

Central Processing ◽

And Performance ◽

Noise Cross Correlation ◽

Performance Computing

Abstract We introduce SeisNoise.jl, a library for high-performance ambient seismic noise cross correlation, written entirely in the computing language Julia. Julia is a new language, with syntax and a learning curve similar to MATLAB (see Data and Resources), R, or Python and performance close to Fortran or C. SeisNoise.jl is compatible with high-performance computing resources, using both the central processing unit and the graphic processing unit. SeisNoise.jl is a modular toolbox, giving researchers common tools and data structures to design custom ambient seismic cross-correlation workflows in Julia.

Download Full-text

SPUX - a Scalable Package for Bayesian Uncertainty Quantification

10.5194/egusphere-egu2020-9834 ◽

2020 ◽

Author(s):

Jonas Sukys ◽

Marco Bacci

Keyword(s):

Bayesian Inference ◽

Uncertainty Quantification ◽

High Performance ◽

Performance Metrics ◽

Expert Knowledge ◽

Simple Random Walk ◽

Scientific Workflow ◽

Model Parameters ◽

Parameter Uncertainties ◽

Computational Performance

<div> <div>SPUX (Scalable Package for Uncertainty Quantification in "X") is a modular framework for Bayesian inference and uncertainty quantification. The SPUX framework aims at harnessing high performance scientific computing to&#160;tackle&#160;complex&#160;aquatic&#160;dynamical&#160;systems&#160;rich&#160;in&#160;intrinsic&#160;uncertainties,</div> <div>such as ecological ecosystems, hydrological catchments, lake dynamics, subsurface flows, urban floods, etc. The challenging task of quantifying input, output and/or parameter uncertainties in such stochastic models is tackled using Bayesian inference techniques, where numerical sampling and filtering algorithms assimilate prior expert knowledge and available experimental data. The SPUX framework greatly simplifies uncertainty quantification for realistic computationally costly models and provides an accessible, modular, portable, scalable, interpretable and reproducible scientific workflow. To achieve this, SPUX can be coupled to any serial or parallel model written in any programming language (e.g. Python, R, C/C++, Fortran, Java), can be installed either on a laptop or on a parallel cluster, and has built-in support for automatic reports, including algorithmic and computational performance metrics. I will present key SPUX concepts using a simple random walk example, and showcase recent realistic applications for catchment and lake models. In particular, uncertainties in model parameters, meteorological inputs, and data observation processes are inferred by assimilating available in-situ and remotely sensed datasets.</div> </div>

Download Full-text

Controllers: An abstraction to ease the use of hardware accelerators

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017702962 ◽

2017 ◽

Vol 32 (6) ◽

pp. 838-853 ◽

Cited By ~ 4

Author(s):

Ana Moreton–Fernandez ◽

Hector Ortega–Arranz ◽

Arturo Gonzalez–Escribano

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Abstract Entity ◽

Hardware Accelerators ◽

Processing Unit ◽

Central Processing ◽

Computing Platforms ◽

Graphics Processing ◽

Performance Computing ◽

Selection Of

Nowadays the use of hardware accelerators, such as the graphics processing units or XeonPhi coprocessors, is key in solving computationally costly problems that require high performance computing. However, programming solutions for an efficient deployment for these kind of devices is a very complex task that relies on the manual management of memory transfers and configuration parameters. The programmer has to carry out a deep study of the particular data that needs to be computed at each moment, across different computing platforms, also considering architectural details. We introduce the controller concept as an abstract entity that allows the programmer to easily manage the communications and kernel launching details on hardware accelerators in a transparent way. This model also provides the possibility of defining and launching central processing unit kernels in multi-core processors with the same abstraction and methodology used for the accelerators. It internally combines different native programming models and technologies to exploit the potential of each kind of device. Additionally, the model also allows the programmer to simplify the proper selection of values for several configuration parameters that can be selected when a kernel is launched. This is done through a qualitative characterization process of the kernel code to be executed. Finally, we present the implementation of the controller model in a prototype library, together with its application in several case studies. Its use has led to reductions in the development and porting costs, with significantly low overheads in the execution times when compared to manually programmed and optimized solutions which directly use CUDA and OpenMP.

Download Full-text

Efficient, high-performance semantic segmentation using multi-scale feature extraction

PLoS ONE ◽

10.1371/journal.pone.0255397 ◽

2021 ◽

Vol 16 (8) ◽

pp. e0255397

Author(s):

Moritz Knolle ◽

Georgios Kaissis ◽

Friederike Jungmann ◽

Sebastian Ziegelmayer ◽

Daniel Sasse ◽

...

Keyword(s):

Deep Learning ◽

Graphics Processing Units ◽

Substantial Reduction ◽

Image Features ◽

Tumor Segmentation ◽

Processing Unit ◽

Central Processing ◽

Multi Scale ◽

Computational Performance ◽

Wide Range

The success of deep learning in recent years has arguably been driven by the availability of large datasets for training powerful predictive algorithms. In medical applications however, the sensitive nature of the data limits the collection and exchange of large-scale datasets. Privacy-preserving and collaborative learning systems can enable the successful application of machine learning in medicine. However, collaborative protocols such as federated learning require the frequent transfer of parameter updates over a network. To enable the deployment of such protocols to a wide range of systems with varying computational performance, efficient deep learning architectures for resource-constrained environments are required. Here we present MoNet, a small, highly optimized neural-network-based segmentation algorithm leveraging efficient multi-scale image features. MoNet is a shallow, U-Net-like architecture based on repeated, dilated convolutions with decreasing dilation rates. We apply and test our architecture on the challenging clinical tasks of pancreatic segmentation in computed tomography (CT) images as well as brain tumor segmentation in magnetic resonance imaging (MRI) data. We assess our model’s segmentation performance and demonstrate that it provides performance on par with compared architectures while providing superior out-of-sample generalization performance, outperforming larger architectures on an independent validation set, while utilizing significantly fewer parameters. We furthermore confirm the suitability of our architecture for federated learning applications by demonstrating a substantial reduction in serialized model storage requirement as a surrogate for network data transfer. Finally, we evaluate MoNet’s inference latency on the central processing unit (CPU) to determine its utility in environments without access to graphics processing units. Our implementation is publicly available as free and open-source software.

Download Full-text

Comparative evaluation of performance and scalability of convolutional neural network implementations on a multisystem HPC architecture

Journal of Physics Conference Series ◽

10.1088/1742-6596/2062/1/012008 ◽

2021 ◽

Vol 2062 (1) ◽

pp. 012008

Author(s):

Sunil Pandey ◽

Naresh Kumar Nagwani ◽

Shrish Verma

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

High Performance ◽

Processing Unit ◽

Neural Network Training ◽

Central Processing ◽

Network Training ◽

Hot Rolled ◽

Hot Rolled Steel ◽

Performance Computing

Abstract The convolutional neural network training algorithm has been implemented for a central processing unit based high performance multisystem architecture machine. The multisystem or the multicomputer is a parallel machine model which is essentially an abstraction of distributed memory parallel machines. In actual practice, this model corresponds to high performance computing clusters. The proposed implementation of the convolutional neural network training algorithm is based on modeling the convolutional neural network as a computational pipeline. The various functions or tasks of the convolutional neural network pipeline have been mapped onto the multiple nodes of a central processing unit based high performance computing cluster for task parallelism. The pipeline implementation provides a first level performance gain through pipeline parallelism. Further performance gains are obtained by distributing the convolutional neural network training onto the different nodes of the compute cluster. The two gains are multiplicative. In this work, the authors have carried out a comparative evaluation of the computational performance and scalability of this pipeline implementation of the convolutional neural network training with a distributed neural network software program which is based on conventional multi-model training and makes use of a centralized server. The dataset considered for this work is the North Eastern University’s hot rolled steel strip surface defects imaging dataset. In both the cases, the convolutional neural networks have been trained to classify the different defects on hot rolled steel strips on the basis of the input image. One hundred images corresponding to each class of defects have been used for the training in order to keep the training times manageable. The hyperparameters of both the convolutional neural networks were kept identical and the programs were run on the same computational cluster to enable fair comparison. Both the convolutional neural network implementations have been observed to train to nearly 80% training accuracy in 200 epochs. In effect, therefore, the comparison is on the time taken to complete the training epochs.

Download Full-text

BLVector: Fast BLAST-Like Algorithm for Manycore CPU With Vectorization

Frontiers in Genetics ◽

10.3389/fgene.2021.618659 ◽

2021 ◽

Vol 12 ◽

Author(s):

Sergio Gálvez ◽

Federico Agostini ◽

Javier Caselli ◽

Pilar Hernandez ◽

Gabriel Dorado

Keyword(s):

Amino Acids ◽

Execution Time ◽

High Performance ◽

Protein Sequences ◽

Processing Unit ◽

Central Processing ◽

Real Scenario ◽

High Level ◽

Performance Computing ◽

Comprehensive Study

New High-Performance Computing architectures have been recently developed for commercial central processing unit (CPU). Yet, that has not improved the execution time of widely used bioinformatics applications, like BLAST+. This is due to a lack of optimization between the bases of the existing algorithms and the internals of the hardware that allows taking full advantage of the available CPU cores. To optimize the new architectures, algorithms must be revised and redesigned; usually rewritten from scratch. BLVector adapts the high-level concepts of BLAST+ to the x86 architectures with AVX-512, to harness their capabilities. A deep comprehensive study has been carried out to optimize the approach, with a significant reduction in time execution. BLVector reduces the execution time of BLAST+ when aligning up to mid-size protein sequences (∼750 amino acids). The gain in real scenario cases is 3.2-fold. When applied to longer proteins, BLVector consumes more time than BLAST+, but retrieves a much larger set of results. BLVector and BLAST+ are fine-tuned heuristics. Therefore, the relevant results returned by both are the same, although they behave differently specially when performing alignments with low scores. Hence, they can be considered complementary bioinformatics tools.

Download Full-text

Effective Implementation of Edge-Preserving Filtering on CPU Microarchitectures

Applied Sciences ◽

10.3390/app8101985 ◽

2018 ◽

Vol 8 (10) ◽

pp. 1985 ◽

Cited By ~ 5

Author(s):

Yoshihiro Maeda ◽

Norishige Fukushima ◽

Hiroshi Matsuo

Keyword(s):

Computational Cost ◽

Bilateral Filter ◽

Processing Unit ◽

Normal Numbers ◽

Edge Preserving ◽

Central Processing ◽

Local Means ◽

Kernel Weights ◽

Computational Performance ◽

Non Local

In this paper, we propose acceleration methods for edge-preserving filtering. The filters natively include denormalized numbers, which are defined in IEEE Standard 754. The processing of the denormalized numbers has a higher computational cost than normal numbers; thus, the computational performance of edge-preserving filtering is severely diminished. We propose approaches to prevent the occurrence of the denormalized numbers for acceleration. Moreover, we verify an effective vectorization of the edge-preserving filtering based on changes in microarchitectures of central processing units by carefully treating kernel weights. The experimental results show that the proposed methods are up to five-times faster than the straightforward implementation of bilateral filtering and non-local means filtering, while the filters maintain the high accuracy. In addition, we showed effective vectorization for each central processing unit microarchitecture. The implementation of the bilateral filter is up to 14-times faster than that of OpenCV. The proposed methods and the vectorization are practical for real-time tasks such as image editing.

Download Full-text

High-performance computing in water resources hydrodynamics

Journal of Hydroinformatics ◽

10.2166/hydro.2020.163 ◽

2020 ◽

Vol 22 (5) ◽

pp. 1217-1235 ◽

Cited By ~ 3

Author(s):

M. Morales-Hernández ◽

M. B. Sharif ◽

S. Gangrade ◽

T. T. Dullo ◽

S.-C. Kao ◽

...

Keyword(s):

Water Resources ◽

High Performance Computing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Test Case ◽

Processing Unit ◽

Central Processing ◽

Graphics Processing ◽

Performance Computing

Abstract This work presents a vision of future water resources hydrodynamics codes that can fully utilize the strengths of modern high-performance computing (HPC). The advances to computing power, formerly driven by the improvement of central processing unit processors, now focus on parallel computing and, in particular, the use of graphics processing units (GPUs). However, this shift to a parallel framework requires refactoring the code to make efficient use of the data as well as changing even the nature of the algorithm that solves the system of equations. These concepts along with other features such as the precision for the computations, dry regions management, and input/output data are analyzed in this paper. A 2D multi-GPU flood code applied to a large-scale test case is used to corroborate our statements and ascertain the new challenges for the next-generation parallel water resources codes.

Download Full-text

BEAGLE 3: Improved Performance, Scaling, and Usability for a High-Performance Computing Library for Statistical Phylogenetics

Systematic Biology ◽

10.1093/sysbio/syz020 ◽

2019 ◽

Vol 68 (6) ◽

pp. 1052-1061 ◽

Cited By ~ 41

Author(s):

Daniel L Ayres ◽

Michael P Cummings ◽

Guy Baele ◽

Aaron E Darling ◽

Paul O Lewis ◽

...

Keyword(s):

High Performance ◽

Evolutionary Model ◽

Data Partitioning ◽

Partial Likelihood ◽

Processing Unit ◽

Data Set ◽

Diverse Range ◽

Central Processing ◽

Automated Method ◽

Software Packages

Abstract BEAGLE is a high-performance likelihood-calculation library for phylogenetic inference. The BEAGLE library defines a simple, but flexible, application programming interface (API), and includes a collection of efficient implementations for calculation under a variety of evolutionary models on different hardware devices. The library has been integrated into recent versions of popular phylogenetics software packages including BEAST and MrBayes and has been widely used across a diverse range of evolutionary studies. Here, we present BEAGLE 3 with new parallel implementations, increased performance for challenging data sets, improved scalability, and better usability. We have added new OpenCL and central processing unit-threaded implementations to the library, allowing the effective utilization of a wider range of modern hardware. Further, we have extended the API and library to support concurrent computation of independent partial likelihood arrays, for increased performance of nucleotide-model analyses with greater flexibility of data partitioning. For better scalability and usability, we have improved how phylogenetic software packages use BEAGLE in multi-GPU (graphics processing unit) and cluster environments, and introduced an automated method to select the fastest device given the data set, evolutionary model, and hardware. For application developers who wish to integrate the library, we also have developed an online tutorial. To evaluate the effect of the improvements, we ran a variety of benchmarks on state-of-the-art hardware. For a partitioned exemplar analysis, we observe run-time performance improvements as high as 5.9-fold over our previous GPU implementation. BEAGLE 3 is free, open-source software licensed under the Lesser GPL and available at https://beagle-dev.github.io.

Download Full-text