NUMERICAL IMPLEMENTATION OF A PARALLEL ALGORITHM FOR SOLVING THE PROBLEM OF POLLUTANT TRANSPORT IN A RESERVOIR ON A HIGH-PERFORMANCE COMPUTER SYSTEM

Vestnik komp iuternykh i informatsionnykh tekhnologii ◽

10.14489/vkit.2021.04.pp.027-036 ◽

2021 ◽

pp. 27-36

Author(s):

A. V. Nikitina ◽

A. E. Chistyakov ◽

A. M. Atayan

Keyword(s):

Message Passing Interface ◽

Operating Time ◽

Computation Time ◽

Computing System ◽

Computational Grid ◽

Computational Grids ◽

Processing Unit ◽

Parallel Solution ◽

Central Processing ◽

Cuda Technology

The purpose of this work is to create a software package for a distributed solution of the problem of transporting a pollutant in a reservoir with complex bathymetry and the presence of technological structures. An algorithm has been developed for the parallel solution of the problem of transporting a pollutant (pollutant) in a reservoir on a graphics accelerator controlled by the CUDA (Compute Unified Device Architecture) system; a comparative analysis of the operation of algorithms on a CPU (Central Processing Unit) and on a graphics accelerator GPU (Graphics Processing Unit) made it possible to evaluate their performance. The software implementation of the modules included in the complex is described, the main classes and implemented methods are documented. The results of numerical experiments showed that solving of pollutant transport’s problem based on the CUDA technology is ineffective for small grids (up to 100 ´ 100 computational nodes). In the case of large grids (1000 ´ 1000 computational nodes), the use of CUDA technology reduces the computation time by an order of magnitude. An analysis of the experiments carried out with the developed components of software showed that the maximum value of the ratio of the algorithm operating time that implements the set task of transferring matter in a shallow water on a GPU to the operating time of a similar algorithm on the CPU was 24.92 times, which is achieved on a grid of 1000 ´ 1000 computational nodes. Implementation of methods for decomposition of grid regions is proposed for solving computationally laborious problems of diffusion-convection, including the problem of transporting pollutants in a reservoir with complex bathymetry with technological objects that take into account the architecture and parameters of a MSC (Multiprocessor Computing System) located on the basis of the infrastructure facility of the STU (Scientific and Technological University) “Sirius” (Sochi, Russia). Consideration was made for such a property of a computing system as the time it takes to transmit and receive floating point data. An algorithm for the parallel solution of the task under the control of MPI (Message Passing Interface) technology has been developed, and its efficiency has been assessed. The acceleration values of the proposed algorithm are obtained depending on the number of involved computers (processors) and the size of the computational grid. The maximum number of computers used is 24, the maximum size of the computational grid was 10 000 ´ 10 000 computational nodes. The developed algorithm showed low efficiency for small computational grids (up to 100 ´ 100 computational nodes). In the case of large computational grids ( from 1000  1000 computational nodes), the use of MPI reduces the computation time by several times.

Download Full-text

Influence of gravity effect to the recovery rate at uranium in-situ leaching

Bulletin of the National Engineering Academy of the Republic of Kazakhstan ◽

10.47533/2020.1606-146x.125 ◽

2021 ◽

Vol 82 (4) ◽

pp. 148-157

Author(s):

M. B. Kurmanseiit ◽

◽

M. S. Tungatarova ◽

K. A. Alibayeva ◽

◽

...

Keyword(s):

Gravitational Effect ◽

Computational Grids ◽

Processing Unit ◽

Gravity Effect ◽

Central Processing ◽

Cuda Technology ◽

Uranium In Situ Leaching ◽

Leaching Solution ◽

The Difference

In-Situ Leaching is a method of extracting minerals by selectively dissolving it with a leaching solution directly in the place of occurrence of the mineral. In practice, during the development of deposits with the In-Situ Leaching method, situations arise when the solution tends to go down below the active thickness of the stratum. This may be due to geological heterogeneity of the rock or gravitational sedimentation of the solution in the rock due to the difference in the densities of the solution and groundwater. As a result of the deposition of the solution along the height, there is a decrease in the recovery of the metal located in the upper part of the geological layers. This article examines the effect of gravity on the flow regime during the filtration of the solution in the rock. The influence of the gravitational effect on the flow of solution in the rock is studied for different ratios of the densities of the solution and groundwater without taking into account the interaction of the solution with the rock. The CUDA technology is used to improve the performance of calculations. The results show that the use of CUDA technology allows to increase the performance of calculations by 40-80 times compared to calculations on a central processing unit (CPU) for different computational grids.

Download Full-text

A Hybrid MPI-OpenMP Parallel Algorithm for the Assessment of the Multifractal Spectrum of River Networks

Water ◽

10.3390/w13213122 ◽

2021 ◽

Vol 13 (21) ◽

pp. 3122

Author(s):

Leonardo Primavera ◽

Emilia Florio

Keyword(s):

River Basin ◽

Message Passing Interface ◽

Multifractal Spectrum ◽

Computation Method ◽

Processing Unit ◽

Unit Hydrograph ◽

Fast Evaluation ◽

Correlation Integral ◽

Central Processing ◽

Instantaneous Unit Hydrograph

The possibility to create a flood wave in a river network depends on the geometric properties of the river basin. Among the models that try to forecast the Instantaneous Unit Hydrograph (IUH) of rainfall precipitation, the so-called Multifractal Instantaneous Unit Hydrograph (MIUH) by De Bartolo et al. (2003) rather successfully connects the multifractal properties of the river basin to the observed IUH. Such properties can be assessed through different types of analysis (fixed-size algorithm, correlation integral, fixed-mass algorithm, sandbox algorithm, and so on). The fixed-mass algorithm is the one that produces the most precise estimate of the properties of the multifractal spectrum that are relevant for the MIUH model. However, a disadvantage of this method is that it requires very long computational times to produce the best possible results. In a previous work, we proposed a parallel version of the fixed-mass algorithm, which drastically reduced the computational times almost proportionally to the number of Central Processing Unit (CPU) cores available on the computational machine by using the Message Passing Interface (MPI), which is a standard for distributed memory clusters. In the present work, we further improved the code in order to include the use of the Open Multi-Processing (OpenMP) paradigm to facilitate the execution and improve the computational speed-up on single processor, multi-core workstations, which are much more common than multi-node clusters. Moreover, the assessment of the multifractal spectrum has also been improved through a direct computation method. Currently, to the best of our knowledge, this code represents the state-of-the-art for a fast evaluation of the multifractal properties of a river basin, and it opens up a new scenario for an effective flood forecast in reasonable computational times.

Download Full-text

Fast and Accurate Finite Transducer Analysis Method for Wireless Passive Impedance-Loaded SAW Sensors

Sensors ◽

10.3390/s18113988 ◽

2018 ◽

Vol 18 (11) ◽

pp. 3988

Author(s):

Wei Luo ◽

Yang Yuan ◽

Yi Wang ◽

Qiuyun Fu ◽

Hui Xia ◽

...

Keyword(s):

Computation Time ◽

Processing Unit ◽

Bulk Wave ◽

Analysis Method ◽

Accurate Analysis ◽

Central Processing ◽

Finite Transducer ◽

Long Time ◽

High Degree ◽

Element Method

An accurate and fast simulation tool plays an important role in the design of wireless passive impedance-loaded surface acoustic wave (SAW) sensors which have received much attention recently. This paper presents a finite transducer analysis method for wireless passive impedance-loaded SAW sensors. The finite transducer analysis method uses a numerically combined finite element method-boundary element method (FEM/BEM) model to analyze non-periodic transducers. In non-periodic transducers, FEM/BEM was the most accurate analysis method until now, however this method consumes central processing unit (CPU) time. This paper presents a faster algorithm to calculate the bulk wave part of the equation coefficient which usually requires a long time. A complete non-periodic FEM/BEM model of the impedance sensors was constructed. Modifications were made to the final equations in the FEM/BEM model to adjust for the impedance variation of the sensors. Compared with the conventional method, the proposed method reduces the computation time efficiently while maintaining the same high degree of accuracy. Simulations and their comparisons with experimental results for test devices are shown to prove the effectiveness of the analysis method.

Download Full-text

Implementation of algebraic procedures on the GPU using CUDA architecture on the example of generalized eigenvalue problem

Open Computer Science ◽

10.1515/comp-2016-0006 ◽

2016 ◽

Vol 6 (1) ◽

pp. 79-90

Author(s):

Łukasz Syrocki ◽

Grzegorz Pestka

Keyword(s):

Eigenvalue Problem ◽

Graphics Processing Unit ◽

Generalized Eigenvalue Problem ◽

Processing Unit ◽

Graphics Processors ◽

Central Processing ◽

Generalized Eigenvalue ◽

Cuda Technology ◽

Cuda Architecture ◽

High Level

AbstractThe ready to use set of functions to facilitate solving a generalized eigenvalue problem for symmetric matrices in order to efficiently calculate eigenvalues and eigenvectors, using Compute Unified Device Architecture (CUDA) technology from NVIDIA, is provided. An integral part of the CUDA is the high level programming environment enabling tracking both code executed on Central Processing Unit and on Graphics Processing Unit. The presented matrix structures allow for the analysis of the advantages of using graphics processors in such calculations.

Download Full-text

CPU AND GPU (CUDA) TEMPLATE MATCHING COMPARISON / CPU IR GPU (CUDA) PALYGINIMAS VYKDANT ŠABLONŲ ATITIKTIES ALGORITMĄ

Mokslas - Lietuvos ateitis ◽

10.3846/mla.2014.16 ◽

2014 ◽

Vol 6 (2) ◽

pp. 129-133

Author(s):

Evaldas Borcovas ◽

Gintautas Daunys

Keyword(s):

Template Matching ◽

Gpu Computing ◽

Computing Time ◽

Processing Unit ◽

Compute Unified Device Architecture ◽

Central Processing ◽

Device Architecture ◽

Cuda Technology ◽

Dual Core ◽

Template Size

Image processing, computer vision or other complicated opticalinformation processing algorithms require large resources. It isoften desired to execute algorithms in real time. It is hard tofulfill such requirements with single CPU processor. NVidiaproposed CUDA technology enables programmer to use theGPU resources in the computer. Current research was madewith Intel Pentium Dual-Core T4500 2.3 GHz processor with4 GB RAM DDR3 (CPU I), NVidia GeForce GT320M CUDAcompliable graphics card (GPU I) and Intel Core I5-2500K3.3 GHz processor with 4 GB RAM DDR3 (CPU II), NVidiaGeForce GTX 560 CUDA compatible graphic card (GPU II).Additional libraries as OpenCV 2.1 and OpenCV 2.4.0 CUDAcompliable were used for the testing. Main test were made withstandard function MatchTemplate from the OpenCV libraries.The algorithm uses a main image and a template. An influenceof these factors was tested. Main image and template have beenresized and the algorithm computing time and performancein Gtpix/s have been measured. According to the informationobtained from the research GPU computing using the hardwarementioned earlier is till 24 times faster when it is processing abig amount of information. When the images are small the performanceof CPU and GPU are not significantly different. Thechoice of the template size makes influence on calculating withCPU. Difference in the computing time between the GPUs canbe explained by the number of cores which they have. Vaizdų apdorojimas, kompiuterinė rega ir kiti sudėtingi algoritmai, apdorojantys optinę informaciją, naudoja dideliusskaičiavimo išteklius. Dažnai šiuos algoritmus reikia realizuoti realiuoju laiku. Šį uždavinį išspręsti naudojant tik vienoCPU (angl. Central processing unit) pajėgumus yra sudėtinga. nVidia pasiūlyta CUDA (angl. Compute unified device architecture)technologija leidžia panaudoti GPU (angl. Graphic processing unit) išteklius. Tyrimui atlikti buvo pasirinkti du skirtingiCPU: Intel Pentium Dual-Core T4500 ir Intel Core I5 2500K, bei GPU: nVidia GeForce GT320M ir NVidia GeForce 560.Tyrime buvo panaudotos vaizdų apdorojimo bibliotekos: OpenCV 2.1 ir OpenCV 2.4. Tyrimui buvo pasirinktas šablonų atitiktiesalgoritmas. Algoritmui realizuoti reikalingas analizuojamas vaizdas ir ieškomo objekto vaizdo šablonas. Tyrimo metu buvokeičiamas vaizdo ir šablono dydis bei stebima, kaip tai veikia algoritmo vykdymo trukmę ir vykdomų operacijų skaičių persekundę. Iš gautų rezultatų galima teigti, kad apdorojant didelį duomenų kiekį GPU realizuoja algoritmą iki 24 kartų greičiaunei tik CPU. Dirbant su nedideliu duomenų kiekiu, skirtumas tarp CPU ir GPU yra minimalus. Lyginant skaičiavimus dviejuoseGPU, pastebėta, kad skaičiavimų sparta yra tiesiogiai proporcinga GPU turimų branduolių kiekiui. Mūsų tyrimo atvejuspartesniame GPU jų buvo 16 kartų daugiau, tad ir skaičiavimai vyko 16 kartų sparčiau.

Download Full-text

ANumerical Uncertainty in Parallel Processing Using Computational Fluid Dynamics as Example

Athens Journal of Τechnology & Engineering ◽

10.30958/ajte.8-2-3 ◽

2021 ◽

Vol 8 (2) ◽

pp. 169-180

Author(s):

Mark Lin ◽

Periklis Papadopoulos

Keyword(s):

Fluid Dynamics ◽

Computational Fluid Dynamics ◽

Load Balancing ◽

Message Passing ◽

Message Passing Interface ◽

Mean Value ◽

Data Sets ◽

Processing Unit ◽

Central Processing ◽

Single Output

Computational methods such as Computational Fluid Dynamics (CFD) traditionally yield a single output – a single number that is much like the result one would get if one were to perform a theoretical hand calculation. However, this paper will show that computation methods have inherent uncertainty which can also be reported statistically. In numerical computation, because many factors affect the data collected, the data can be quoted in terms of standard deviations (error bars) along with a mean value to make data comparison meaningful. In cases where two data sets are obscured by uncertainty, the two data sets are said to be indistinguishable. A sample CFD problem pertaining to external aerodynamics is copied and ran on 29 identical computers in a university computer lab. The expectation is that all 29 runs should return exactly the same result; unfortunately, in a few cases the result turns out to be different. This is attributed to the parallelization scheme which partitions the mesh to run in parallel on multiple cores of the computer. The distribution of the computational load is hardware-driven depending on the available resource of each computer at the time. Things, such as load-balancing among multiple Central Processing Unit (CPU) cores using Message Passing Interface (MPI) are transparent to the user. Software algorithm such as METIS or JOSTLE is used to automatically divide up the load between different processors. As such, the user has no control over the outcome of the CFD calculation even when the same problem is computed. Because of this, numerical uncertainty arises from parallel (multicore) computing. One way to resolve this issue is to compute problems using a single core, without mesh repartitioning. However, as this paper demonstrates even this is not straight forward. Keywords: numerical uncertainty, parallelization, load-balancing, automotive aerodynamics

Download Full-text

GPU-Accelerated Parallel FDTD on Distributed Heterogeneous Platform

International Journal of Antennas and Propagation ◽

10.1155/2014/321081 ◽

2014 ◽

Vol 2014 ◽

pp. 1-8 ◽

Cited By ~ 2

Author(s):

Ronglin Jiang ◽

Shugang Jiang ◽

Yu Zhang ◽

Ying Xu ◽

Lei Xu ◽

...

Keyword(s):

Message Passing ◽

Message Passing Interface ◽

Graphics Processing Unit ◽

Processing Unit ◽

Problem Size ◽

Central Processing ◽

Execution Speed ◽

Speedup Ratio ◽

Electromagnetic Calculations ◽

Graphics Processing

This paper introduces a (finite difference time domain) FDTD code written in Fortran and CUDA for realistic electromagnetic calculations with parallelization methods of Message Passing Interface (MPI) and Open Multiprocessing (OpenMP). Since both Central Processing Unit (CPU) and Graphics Processing Unit (GPU) resources are utilized, a faster execution speed can be reached compared to a traditional pure GPU code. In our experiments, 64 NVIDIA TESLA K20m GPUs and 64 INTEL XEON E5-2670 CPUs are used to carry out the pure CPU, pure GPU, and CPU + GPU tests. Relative to the pure CPU calculations for the same problems, the speedup ratio achieved by CPU + GPU calculations is around 14. Compared to the pure GPU calculations for the same problems, the CPU + GPU calculations have 7.6%–13.2% performance improvement. Because of the small memory size of GPUs, the FDTD problem size is usually very small. However, this code can enlarge the maximum problem size by 25% without reducing the performance of traditional pure GPU code. Finally, using this code, a microstrip antenna array with16×18elements is calculated and the radiation patterns are compared with the ones of MoM. Results show that there is a well agreement between them.

Download Full-text

High Performance 3D PET Reconstruction Using Spherical Basis Functions on a Polar Grid

International Journal of Biomedical Imaging ◽

10.1155/2012/452910 ◽

2012 ◽

Vol 2012 ◽

pp. 1-11 ◽

Cited By ~ 6

Author(s):

J. Cabello ◽

J. E. Gillam ◽

M. Rafecas

Keyword(s):

High Performance ◽

Parallel Implementation ◽

Computation Time ◽

Basis Functions ◽

Image Space ◽

System Response ◽

Processing Unit ◽

Central Processing ◽

Spherical Basis ◽

Spherical Basis Functions

Statistical iterative methods are a widely used method of image reconstruction in emission tomography. Traditionally, the image space is modelled as a combination of cubic voxels as a matter of simplicity. After reconstruction, images are routinely filtered to reduce statistical noise at the cost of spatial resolution degradation. An alternative to produce lower noise during reconstruction is to model the image space with spherical basis functions. These basis functions overlap in space producing a significantly large number of non-zero elements in the system response matrix (SRM) to store, which additionally leads to long reconstruction times. These two problems are partly overcome by exploiting spherical symmetries, although computation time is still slower compared to non-overlapping basis functions. In this work, we have implemented the reconstruction algorithm using Graphical Processing Unit (GPU) technology for speed and a precomputed Monte-Carlo-calculated SRM for accuracy. The reconstruction time achieved using spherical basis functions on a GPU was 4.3 times faster than the Central Processing Unit (CPU) and 2.5 times faster than a CPU-multi-core parallel implementation using eight cores. Overwriting hazards are minimized by combining a random line of response ordering and constrained atomic writing. Small differences in image quality were observed between implementations.

Download Full-text

High Resolution and Fast Processing of Spectral Reconstruction in Fourier Transform Imaging Spectroscopy

Sensors ◽

10.3390/s18124159 ◽

2018 ◽

Vol 18 (12) ◽

pp. 4159

Author(s):

Weikang Zhang ◽

Desheng Wen ◽

Zongxi Song ◽

Xin Wei ◽

Gang Liu ◽

...

Keyword(s):

Fourier Transform ◽

High Resolution ◽

Parallel Algorithm ◽

Imaging Spectroscopy ◽

Operation Time ◽

High Resolution Spectrum ◽

Processing Unit ◽

Spectrum Estimation ◽

Parallel Solution ◽

Central Processing

High-resolution spectrum estimation has continually attracted great attention in spectrum reconstruction based on Fourier transform imaging spectroscopy (FTIS). In this paper, a parallel solution for interference data processing using high-resolution spectrum estimation is proposed to reconstruct the spectrum in a fast high-resolution way. In batch processing, we use high-performance parallel-computing on the graphics processing unit (GPU) for higher efficiency and lower operation time. In addition, a parallel processing mechanism is designed for our parallel algorithm to obtain higher performance. At the same time, other solving algorithms for the modern spectrum estimation model are introduced for discussion and comparison. We compare traditional high-resolution solving algorithms running on the central processing unit (CPU) and the parallel algorithm on the GPU for processing the interferogram. The experimental results illustrate that runtime is reduced by about 70% using our parallel solution, and the GPU has a great advantage in processing large data and accelerating applications.

Download Full-text

Multiparadigm Parallel Acceleration for Reservoir Simulation

SPE Journal ◽

10.2118/163591-pa ◽

2014 ◽

Vol 19 (04) ◽

pp. 716-725 ◽

Cited By ~ 5

Author(s):

Larry S.K. Fung ◽

Mohammad O. Sindi ◽

Ali H. Dogru

Keyword(s):

Shared Memory ◽

Reservoir Simulation ◽

Message Passing Interface ◽

Parallel Computer ◽

Processing Unit ◽

Pc Cluster ◽

Central Processing ◽

Compositional Model ◽

Parallel Performance ◽

Dual Cpu

Summary With the advent of the multicore central-processing unit (CPU), today's commodity PC clusters are effectively a collection of interconnected parallel computers, each with multiple multicore CPUs and large shared random access memory (RAM), connected together by means of high-speed networks. Each computer, referred to as a compute node, is a powerful parallel computer on its own. Each compute node can be equipped further with acceleration devices such as the general-purpose graphical processing unit (GPGPU) to further speed up computational-intensive portions of the simulator. Reservoir-simulation methods that can exploit this heterogeneous hardware system can be used to solve very-large-scale reservoir-simulation models and run significantly faster than conventional simulators. Because typical PC clusters are essentially distributed share-memory computers, this suggests that the use of the mixed-paradigm parallelism (distributed-shared memory), such as message-passing interface and open multiprocessing (MPI-OMP), should work well for computational efficiency and memory use. In this work, we compare and contrast the single-paradigm programming models, MPI or OMP, with the mixed paradigm, MPI-OMP, programming model for a class of solver method that is suited for the different modes of parallelism. The results showed that the distributed memory (MPI-only) model has superior multicompute-node scalability, whereas the shared memory (OMP-only) model has superior parallel performance on a single compute node. The mixed MPI-OMP model and OMP-only model are more memory-efficient for the multicore architecture than the MPI-only model because they require less or no halo-cell storage for the subdomains. To exploit the fine-grain shared memory parallelism available on the GPGPU architecture, algorithms should be suited to the single-instruction multiple-data (SIMD) parallelism, and any recursive operations are serialized. In addition, solver methods and data store need to be reworked to coalesce memory access and to avoid shared memory-bank conflicts. Wherever possible, the cost of data transfer through the peripheral component interconnect express (PCIe) bus between the CPU and GPGPU needs to be hidden by means of asynchronous communication. We applied multiparadigm parallelism to accelerate compositional reservoir simulation on a GPGPU-equipped PC cluster. On a dual-CPU-dual-GPGPU compute node, the parallelized solver running on the dual-GPGPU Fermi M2090Q achieved up to 19 times speedup over the serial CPU (1-core) results and up to 3.7 times speedup over the parallel dual-CPU X5675 results in a mixed MPI + OMP paradigm for a 1.728-million-cell compositional model. Parallel performance shows a strong dependency on the subdomain sizes. Parallel CPU solve has a higher performance for smaller domain partitions, whereas GPGPU solve requires large partitions for each chip for good parallel performance. This is related to improved cache efficiency on the CPU for small subdomains and the loading requirement for massive parallelism on the GPGPU. Therefore, for a given model, the multinode parallel performance decreases for the GPGPU relative to the CPU as the model is further subdivided into smaller subdomains to be solved on more compute nodes. To illustrate this, a modified SPE5 (Killough and Kossack 1987) model with various grid dimensions was run to generate comparative results. Parallel performances for three field compositional models of various sizes and dimensions are included to further elucidate and contrast CPU-GPGPU single-node and multiple-node performances. A PC cluster with the Tesla M2070Q GPGPU and the 6-core Xeon X5675 Westmere was used to produce the majority of the reported results. Another PC cluster with the Tesla M2090Q GPGPU was available for some cases, and the results are reported for the modified SPE5 (Killough and Kossack 1987) problems for comparison.

Download Full-text