Parallelization and sustainability of distributed genetic algorithms on many-core processors

Author(s):  
Yuji Sato ◽  
Mikiko Sato

Purpose – The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core processors such as graphics processing units (GPUs) and multi-core processors (MCPs). Design/methodology/approach – For distributed genetic algorithm (GA) models, the paper proposes a method where an island's ID number is added to the header of data transferred by this island for use in fault detection. Findings – The paper has shown that the processing time of the proposed idea is practically negligible in applications and also shown that an optimal solution can be obtained even with a single stuck-at fault or a transient fault, and that increasing the number of parallel threads makes the system less susceptible to faults. Originality/value – The study described in this paper is a new approach to increase the sustainability of application program using distributed GA on GPUs and MCPs.

Energies ◽  
2020 ◽  
Vol 13 (8) ◽  
pp. 2083 ◽  
Author(s):  
Wangqi Xiong ◽  
Jiandong Wang

This paper proposes a parallel grid search algorithm to find an optimal operating point for minimizing the power consumption of an experimental heating, ventilating and air conditioning (HVAC) system. First, a multidimensional, nonlinear and non-convex optimization problem subject to constraints is formulated based on a semi-physical model of the experimental HVAC system. Second, the optimization problem is parallelized based on Graphics Processing Units to simultaneously compute optimization loss functions for different solutions in a searching grid, and to find the optimal solution as the one having the minimum loss function. The proposed algorithm has an advantage that the optimal solution is known with evidence as to the best one subject to current resolutions of the searching grid. Experimental studies are provided to support the proposed algorithm.


2012 ◽  
Vol 8 (1) ◽  
pp. 159-174 ◽  
Author(s):  
Sang-Pil Lee ◽  
Deok-Ho Kim ◽  
Jae-Young Yi ◽  
Won-Woo Ro

2017 ◽  
Vol 9 (7) ◽  
pp. 168781401770741 ◽  
Author(s):  
Cheng-Chieh Li ◽  
Chu-Hsing Lin ◽  
Jung-Chun Liu

To solve a non-deterministic polynomial-hard problem, we can adopt an approximate algorithm for finding the near-optimal solution to reduce the execution time. Although this approach can come up with solutions much faster than brute-force methods, the downside of it is that only approximate solutions are found in most situations. The genetic algorithm is a global search heuristic and optimization method. Initially, genetic algorithms have many shortcomings, such as premature convergence and the tendency to converge toward local optimal solutions; hence, many parallel genetic algorithms are proposed to solve these problems. Currently, there exist many literatures on parallel genetic algorithms. Also, a variety of parallel genetic algorithms have been derived. This study mainly uses the advantages of graphics processing units, which has a large number of cores, and identifies optimized algorithms suitable for computation in single instruction, multiple data architecture of graphics processing units. Furthermore, the parallel simulated annealing method and spheroidizing annealing are also used to enhance performance of the parallel genetic algorithm.


2018 ◽  
Vol 10 (10) ◽  
pp. 168781401880471
Author(s):  
Nenzi Wang ◽  
Hsin-Yi Chen ◽  
Yu-Wen Chen

The advancement of modern processors with many-core and large-cache may have little computational advantages if only serial computing is employed. In this study, several parallel computing approaches, using devices with multiple or many processor cores, and graphics processing units are applied and compared to illustrate the potential applications in fluid-film lubrication study. Two Reynolds equations and an air bearing optimum design are solved using three parallel computing paradigms, OpenMP, Compute Unified Device Architecture, and OpenACC, on standalone shared-memory computers. The newly developed processors with many-integrated-core are also using OpenMP to release the computing potential. The results show that the OpenACC computing can have a better performance than the OpenMP computing for the discretized Reynolds equation with a large gridwork. This is mainly due to larger sizes of available cache in the tested graphics processing units. The bearing design can benefit most when the system with many-integrated-core processor is being used. This is due to the many-integrated-core system can perform computation in the optimization-algorithm-level and using the many processor cores effectively. A proper combination of parallel computing devices and programming models can complement efficient numerical methods or optimization algorithms to accelerate many tribological simulations or engineering designs.


2011 ◽  
Vol 61 (12) ◽  
pp. 3628-3638 ◽  
Author(s):  
Christian Obrecht ◽  
Frédéric Kuznik ◽  
Bernard Tourancheau ◽  
Jean-Jacques Roux

2017 ◽  
Vol 34 (4) ◽  
pp. 1277-1292 ◽  
Author(s):  
Andre Luis Cavalcanti Bueno ◽  
Noemi de La Rocque Rodriguez ◽  
Elisa Dominguez Sotelino

Purpose The purpose of this work is to present a methodology that harnesses the computational power of multiple graphics processing units (GPUs) and hides the complexities of tuning GPU parameters from the users. Design/methodology/approach A methodology for auto-tuning OpenCL configuration parameters has been developed. Findings This described process helps simplify coding and generates a significant gain in time for each method execution. Originality/value Most authors develop their GPU applications for specific hardware configurations. In this work, a solution is offered to make the developed code portable to any GPU hardware.


2014 ◽  
Vol 39 (4) ◽  
pp. 233-248 ◽  
Author(s):  
Milosz Ciznicki ◽  
Krzysztof Kurowski ◽  
Jan Węglarz

Abstract Heterogeneous many-core computing resources are increasingly popular among users due to their improved performance over homogeneous systems. Many developers have realized that heterogeneous systems, e.g. a combination of a shared memory multi-core CPU machine with massively parallel Graphics Processing Units (GPUs), can provide significant performance opportunities to a wide range of applications. However, the best overall performance can only be achieved if application tasks are efficiently assigned to different types of processor units in time taking into account their specific resource requirements. Additionally, one should note that available heterogeneous resources have been designed as general purpose units, however, with many built-in features accelerating specific application operations. In other words, the same algorithm or application functionality can be implemented as a different task for CPU or GPU. Nevertheless, from the perspective of various evaluation criteria, e.g. the total execution time or energy consumption, we may observe completely different results. Therefore, as tasks can be scheduled and managed in many alternative ways on both many-core CPUs or GPUs and consequently have a huge impact on the overall computing resources performance, there are needs for new and improved resource management techniques. In this paper we discuss results achieved during experimental performance studies of selected task scheduling methods in heterogeneous computing systems. Additionally, we present a new architecture for resource allocation and task scheduling library which provides a generic application programming interface at the operating system level for improving scheduling polices taking into account a diversity of tasks and heterogeneous computing resources characteristics.


Author(s):  
Amitava Datta ◽  
Amardeep Kaur ◽  
Tobias Lauer ◽  
Sami Chabbouh

Abstract Finding clusters in high dimensional data is a challenging research problem. Subspace clustering algorithms aim to find clusters in all possible subspaces of the dataset, where a subspace is a subset of dimensions of the data. But the exponential increase in the number of subspaces with the dimensionality of data renders most of the algorithms inefficient as well as ineffective. Moreover, these algorithms have ingrained data dependency in the clustering process, which means that parallelization becomes difficult and inefficient. SUBSCALE is a recent subspace clustering algorithm which is scalable with the dimensions and contains independent processing steps which can be exploited through parallelism. In this paper, we aim to leverage the computational power of widely available multi-core processors to improve the runtime performance of the SUBSCALE algorithm. The experimental evaluation shows linear speedup. Moreover, we develop an approach using graphics processing units (GPUs) for fine-grained data parallelism to accelerate the computation further. First tests of the GPU implementation show very promising results.


Author(s):  
Chun-Yuan Lin ◽  
Wei Sheng Lee ◽  
Chuan Yi Tang

Sorting is a classic algorithmic problem and its importance has led to the design and implementation of various sorting algorithms on many-core graphics processing units (GPUs). CUDPP Radix sort is the most efficient sorting on GPUs and GPU Sample sort is the best comparison-based sorting. Although the implementations of these algorithms are efficient, they either need an extra space for the data rearrangement or the atomic operation for the acceleration. Sorting applications usually deal with a large amount of data, thus the memory utilization is an important consideration. Furthermore, these sorting algorithms on GPUs without the atomic operation support can result in the performance degradation or fail to work. In this paper, an efficient implementation of a parallel shellsort algorithm, CUDA shellsort, is proposed for many-core GPUs with CUDA. Experimental results show that, on average, the performance of CUDA shellsort is nearly twice faster than GPU quicksort and 37% faster than Thrust mergesort under uniform distribution. Moreover, its performance is the same as GPU sample sort up to 32 million data elements, but only needs a constant space usage. CUDA shellsort is also robust over various data distributions and could be suitable for other many-core architectures.


Sign in / Sign up

Export Citation Format

Share Document