Parallelization and sustainability of distributed genetic algorithms on many-core processors

Purpose – The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core processors such as graphics processing units (GPUs) and multi-core processors (MCPs). Design/methodology/approach – For distributed genetic algorithm (GA) models, the paper proposes a method where an island's ID number is added to the header of data transferred by this island for use in fault detection. Findings – The paper has shown that the processing time of the proposed idea is practically negligible in applications and also shown that an optimal solution can be obtained even with a single stuck-at fault or a transient fault, and that increasing the number of parallel threads makes the system less susceptible to faults. Originality/value – The study described in this paper is a new approach to increase the sustainability of application program using distributed GA on GPUs and MCPs.

Download Full-text

Minimizing Power Consumption of an Experimental HVAC System Based on Parallel Grid Searching

Energies ◽

10.3390/en13082083 ◽

2020 ◽

Vol 13 (8) ◽

pp. 2083 ◽

Cited By ~ 1

Author(s):

Wangqi Xiong ◽

Jiandong Wang

Keyword(s):

Power Consumption ◽

Graphics Processing Units ◽

Optimization Problem ◽

Search Algorithm ◽

Experimental Studies ◽

Optimal Solution ◽

Hvac System ◽

Convex Optimization Problem ◽

The One ◽

Graphics Processing

This paper proposes a parallel grid search algorithm to find an optimal operating point for minimizing the power consumption of an experimental heating, ventilating and air conditioning (HVAC) system. First, a multidimensional, nonlinear and non-convex optimization problem subject to constraints is formulated based on a semi-physical model of the experimental HVAC system. Second, the optimization problem is parallelized based on Graphics Processing Units to simultaneously compute optimization loss functions for different solutions in a searching grid, and to find the optimal solution as the one having the minimum loss function. The proposed algorithm has an advantage that the optimal solution is known with evidence as to the best one subject to current resolutions of the searching grid. Experimental studies are provided to support the proposed algorithm.

Download Full-text

An Efficient Block Cipher Implementation on Many-Core Graphics Processing Units

Journal of Information Processing Systems ◽

10.3745/jips.2012.8.1.159 ◽

2012 ◽

Vol 8 (1) ◽

pp. 159-174 ◽

Cited By ~ 6

Author(s):

Sang-Pil Lee ◽

Deok-Ho Kim ◽

Jae-Young Yi ◽

Won-Woo Ro

Keyword(s):

Graphics Processing Units ◽

Block Cipher ◽

Many Core ◽

Graphics Processing

Download Full-text

Parallel genetic algorithms on the graphics processing units using island model and simulated annealing

Advances in Mechanical Engineering ◽

10.1177/1687814017707413 ◽

2017 ◽

Vol 9 (7) ◽

pp. 168781401770741 ◽

Cited By ~ 6

Author(s):

Cheng-Chieh Li ◽

Chu-Hsing Lin ◽

Jung-Chun Liu

Keyword(s):

Genetic Algorithm ◽

Genetic Algorithms ◽

Simulated Annealing ◽

Graphics Processing Units ◽

Optimal Solution ◽

Approximate Solutions ◽

Optimization Method ◽

Approximate Algorithm ◽

Parallel Genetic Algorithms ◽

Graphics Processing

To solve a non-deterministic polynomial-hard problem, we can adopt an approximate algorithm for finding the near-optimal solution to reduce the execution time. Although this approach can come up with solutions much faster than brute-force methods, the downside of it is that only approximate solutions are found in most situations. The genetic algorithm is a global search heuristic and optimization method. Initially, genetic algorithms have many shortcomings, such as premature convergence and the tendency to converge toward local optimal solutions; hence, many parallel genetic algorithms are proposed to solve these problems. Currently, there exist many literatures on parallel genetic algorithms. Also, a variety of parallel genetic algorithms have been derived. This study mainly uses the advantages of graphics processing units, which has a large number of cores, and identifies optimized algorithms suitable for computation in single instruction, multiple data architecture of graphics processing units. Furthermore, the parallel simulated annealing method and spheroidizing annealing are also used to enhance performance of the parallel genetic algorithm.

Download Full-text

Fluid-film lubrication computing with many-core processors and graphics processing units

Advances in Mechanical Engineering ◽

10.1177/1687814018804719 ◽

2018 ◽

Vol 10 (10) ◽

pp. 168781401880471

Author(s):

Nenzi Wang ◽

Hsin-Yi Chen ◽

Yu-Wen Chen

Keyword(s):

Parallel Computing ◽

Graphics Processing Units ◽

Fluid Film ◽

The Many ◽

Many Core ◽

Graphics Processing ◽

Processor Cores ◽

Many Integrated Core ◽

Film Lubrication ◽

Fluid Film Lubrication

The advancement of modern processors with many-core and large-cache may have little computational advantages if only serial computing is employed. In this study, several parallel computing approaches, using devices with multiple or many processor cores, and graphics processing units are applied and compared to illustrate the potential applications in fluid-film lubrication study. Two Reynolds equations and an air bearing optimum design are solved using three parallel computing paradigms, OpenMP, Compute Unified Device Architecture, and OpenACC, on standalone shared-memory computers. The newly developed processors with many-integrated-core are also using OpenMP to release the computing potential. The results show that the OpenACC computing can have a better performance than the OpenMP computing for the discretized Reynolds equation with a large gridwork. This is mainly due to larger sizes of available cache in the tested graphics processing units. The bearing design can benefit most when the system with many-integrated-core processor is being used. This is due to the many-integrated-core system can perform computation in the optimization-algorithm-level and using the many processor cores effectively. A proper combination of parallel computing devices and programming models can complement efficient numerical methods or optimization algorithms to accelerate many tribological simulations or engineering designs.

Download Full-text

Visualizing 3D/4D environmental data using many-core graphics processing units (GPUs) and multi-core central processing units (CPUs)

Computers & Geosciences ◽

10.1016/j.cageo.2013.04.029 ◽

2013 ◽

Vol 59 ◽

pp. 78-89 ◽

Cited By ~ 39

Author(s):

Jing Li ◽

Yunfeng Jiang ◽

Chaowei Yang ◽

Qunying Huang ◽

Matt Rice

Keyword(s):

Graphics Processing Units ◽

Environmental Data ◽

Central Processing ◽

Many Core ◽

Graphics Processing

Download Full-text

A new approach to the lattice Boltzmann method for graphics processing units

Computers & Mathematics with Applications ◽

10.1016/j.camwa.2010.01.054 ◽

2011 ◽

Vol 61 (12) ◽

pp. 3628-3638 ◽

Cited By ~ 83

Author(s):

Christian Obrecht ◽

Frédéric Kuznik ◽

Bernard Tourancheau ◽

Jean-Jacques Roux

Keyword(s):

Lattice Boltzmann Method ◽

Lattice Boltzmann ◽

Graphics Processing Units ◽

New Approach ◽

Boltzmann Method ◽

Graphics Processing

Download Full-text

An adaptive methodology for multi-GPU programming in OpenCL

Engineering Computations ◽

10.1108/ec-12-2015-0392 ◽

2017 ◽

Vol 34 (4) ◽

pp. 1277-1292 ◽

Cited By ~ 1

Author(s):

Andre Luis Cavalcanti Bueno ◽

Noemi de La Rocque Rodriguez ◽

Elisa Dominguez Sotelino

Keyword(s):

Graphics Processing Units ◽

Design Methodology ◽

Gpu Programming ◽

Significant Gain ◽

Computational Power ◽

Content Type ◽

Auto Tuning ◽

Graphics Processing

Purpose The purpose of this work is to present a methodology that harnesses the computational power of multiple graphics processing units (GPUs) and hides the complexities of tuning GPU parameters from the users. Design/methodology/approach A methodology for auto-tuning OpenCL configuration parameters has been developed. Findings This described process helps simplify coding and generates a significant gain in time for each method execution. Originality/value Most authors develop their GPU applications for specific hardware configurations. In this work, a solution is offered to make the developed code portable to any GPU hardware.

Download Full-text

Evaluation of Selected Resource Allocation and Scheduling Methods in Heterogeneous Many-Core Processors and Graphics Processing Units

Foundations of Computing and Decision Sciences ◽

10.2478/fcds-2014-0013 ◽

2014 ◽

Vol 39 (4) ◽

pp. 233-248 ◽

Cited By ~ 1

Author(s):

Milosz Ciznicki ◽

Krzysztof Kurowski ◽

Jan Węglarz

Keyword(s):

Resource Allocation ◽

Task Scheduling ◽

Graphics Processing Units ◽

Heterogeneous Computing ◽

Heterogeneous Systems ◽

Application Programming Interface ◽

System Level ◽

Wide Range ◽

Many Core ◽

Graphics Processing

Abstract Heterogeneous many-core computing resources are increasingly popular among users due to their improved performance over homogeneous systems. Many developers have realized that heterogeneous systems, e.g. a combination of a shared memory multi-core CPU machine with massively parallel Graphics Processing Units (GPUs), can provide significant performance opportunities to a wide range of applications. However, the best overall performance can only be achieved if application tasks are efficiently assigned to different types of processor units in time taking into account their specific resource requirements. Additionally, one should note that available heterogeneous resources have been designed as general purpose units, however, with many built-in features accelerating specific application operations. In other words, the same algorithm or application functionality can be implemented as a different task for CPU or GPU. Nevertheless, from the perspective of various evaluation criteria, e.g. the total execution time or energy consumption, we may observe completely different results. Therefore, as tasks can be scheduled and managed in many alternative ways on both many-core CPUs or GPUs and consequently have a huge impact on the overall computing resources performance, there are needs for new and improved resource management techniques. In this paper we discuss results achieved during experimental performance studies of selected task scheduling methods in heterogeneous computing systems. Additionally, we present a new architecture for resource allocation and task scheduling library which provides a generic application programming interface at the operating system level for improving scheduling polices taking into account a diversity of tasks and heterogeneous computing resources characteristics.

Download Full-text

Exploiting multi–core and many–core parallelism for subspace clustering

International Journal of Applied Mathematics and Computer Science ◽

10.2478/amcs-2019-0006 ◽

2019 ◽

Vol 29 (1) ◽

pp. 81-91

Author(s):

Amitava Datta ◽

Amardeep Kaur ◽

Tobias Lauer ◽

Sami Chabbouh

Keyword(s):

Graphics Processing Units ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Subspace Clustering ◽

Research Problem ◽

Fine Grained ◽

Linear Speedup ◽

Many Core ◽

Graphics Processing ◽

Gpu Implementation

Abstract Finding clusters in high dimensional data is a challenging research problem. Subspace clustering algorithms aim to find clusters in all possible subspaces of the dataset, where a subspace is a subset of dimensions of the data. But the exponential increase in the number of subspaces with the dimensionality of data renders most of the algorithms inefficient as well as ineffective. Moreover, these algorithms have ingrained data dependency in the clustering process, which means that parallelization becomes difficult and inefficient. SUBSCALE is a recent subspace clustering algorithm which is scalable with the dimensions and contains independent processing steps which can be exploited through parallelism. In this paper, we aim to leverage the computational power of widely available multi-core processors to improve the runtime performance of the SUBSCALE algorithm. The experimental evaluation shows linear speedup. Moreover, we develop an approach using graphics processing units (GPUs) for fine-grained data parallelism to accelerate the computation further. First tests of the GPU implementation show very promising results.

Download Full-text

Parallel Shellsort Algorithm for Many-Core GPUs with CUDA

International Journal of Grid and High Performance Computing ◽

10.4018/jghpc.2012040101 ◽

2012 ◽

Vol 4 (2) ◽

pp. 1-16 ◽

Cited By ~ 1

Author(s):

Chun-Yuan Lin ◽

Wei Sheng Lee ◽

Chuan Yi Tang

Keyword(s):

Graphics Processing Units ◽

Algorithmic Problem ◽

Radix Sort ◽

Sorting Algorithms ◽

Atomic Operation ◽

Data Elements ◽

Sample Sort ◽

Many Core ◽

Graphics Processing ◽

Memory Utilization

Sorting is a classic algorithmic problem and its importance has led to the design and implementation of various sorting algorithms on many-core graphics processing units (GPUs). CUDPP Radix sort is the most efficient sorting on GPUs and GPU Sample sort is the best comparison-based sorting. Although the implementations of these algorithms are efficient, they either need an extra space for the data rearrangement or the atomic operation for the acceleration. Sorting applications usually deal with a large amount of data, thus the memory utilization is an important consideration. Furthermore, these sorting algorithms on GPUs without the atomic operation support can result in the performance degradation or fail to work. In this paper, an efficient implementation of a parallel shellsort algorithm, CUDA shellsort, is proposed for many-core GPUs with CUDA. Experimental results show that, on average, the performance of CUDA shellsort is nearly twice faster than GPU quicksort and 37% faster than Thrust mergesort under uniform distribution. Moreover, its performance is the same as GPU sample sort up to 32 million data elements, but only needs a constant space usage. CUDA shellsort is also robust over various data distributions and could be suitable for other many-core architectures.

Download Full-text