scholarly journals Executing linear algebra kernels in heterogeneous distributed infrastructures with PyCOMPSs

Author(s):  
Ramon Amela ◽  
Cristian Ramon-Cortes ◽  
Jorge Ejarque ◽  
Javier Conejero ◽  
Rosa M. Badia

Python is a popular programming language due to the simplicity of its syntax, while still achieving a good performance even being an interpreted language. The adoption from multiple scientific communities has evolved in the emergence of a large number of libraries and modules, which has helped to put Python on the top of the list of the programming languages [1]. Task-based programming has been proposed in the recent years as an alternative parallel programming model. PyCOMPSs follows such approach for Python, and this paper presents its extensions to combine task-based parallelism and thread-level parallelism. Also, we present how PyCOMPSs has been adapted to support heterogeneous architectures, including Xeon Phi and GPUs. Results obtained with linear algebra benchmarks demonstrate that significant performance can be obtained with a few lines of Python.

Author(s):  
А.А. Федоров ◽  
А.Н. Быков

Приводится описание метода двухуровневого распараллеливания прогонки (на общей памяти средствами OpenMP и на распределенной памяти средствами MPI) для решения трехдиагональных линейных систем, возникающих при моделировании двумерных и трехмерных физических процессов. Анализируются особенности реализации метода как на ЭВМ с универсальными процессорами, так и на гибридных ЭВМ с многоядерными сопроцессорами Intel Xeon Phi. Оценивается арифметическая сложность реализованного метода. Обсуждаются результаты численных экспериментов по исследованию масштабируемости метода. A method of two-level parallelization of the Thomas algorithm for solving tridiagonal linear systems (the thread-level parallelism using OpenMP and the process-level parallelism using MPI) arising when modeling two-dimensional and three-dimensional physical processes is described. The features of its implementation for parallel multiprocessor systems and for hybrid multiprocessor systems with multicore coprocessors Intel Xeon Phi are analyzed. The arithmetic complexity of this method is estimated. Some numerical results obtained when studying its scalability are discussed.


2018 ◽  
Vol 18 (01) ◽  
pp. e09
Author(s):  
Adrian Pousa

Most of chip multiprocessors (CMPs) are symmetric, i.e. they are composed of identical cores. These CMPs may consist of complex cores (e.g., Intel Haswell or IBM Power8) or simple and lower-power cores (e.g. ARM Cortex A9 or Intel Xeon Phi). Cores in the former approach have advanced microarchitectural features, such as out-of-order super-scalar pipelines, and they are suitable for running sequential applications which use them efficiently. Cores in the latter approach have a simple microarchitecture and are good for running applications with high thread-level parallelism (TLP).


2020 ◽  
Vol 23 (4) ◽  
pp. 866-886
Author(s):  
Vladimir Aleksandrovich Bakhtin ◽  
Dmitry Aleksandrovich Zakharov ◽  
Aleksandr Aleksandrovich Ermichev ◽  
Victor Alekseevich Krukov

DVM-system is designed for the development of parallel programs of scientific and technical calculations in the C-DVMH and Fortran-DVMH languages. These languages use a single DVMH-model of parallel programming model and are an extension of the standard C and Fortran languages with parallelism specifications in the form of compiler directives. The DVMH model makes it possible to create efficient parallel programs for heterogeneous computing clusters, in the nodes of which accelerators, graphic processors or Intel Xeon Phi coprocessors can be used as computing devices along with universal multi-core processors. The article describes the method of debugging parallel programs in DVM-system, as well as new features of DVM-debugger.


Author(s):  
Chen Liu ◽  
Xiaobin Li ◽  
Shaoshan Liu ◽  
Jean-Luc Gaudiot

Due to the conventional sequential programming model, the Instruction-Level Parallelism (ILP) that modern superscalar processors can explore is inherently limited. Hence, multithreading architectures have been proposed to exploit Thread-Level Parallelism (TLP) in addition to conventional ILP. By issuing and executing instructions from multiple threads at each clock cycle, Simultaneous MultiThreading (SMT) achieves some of the best possible system resource utilization and accordingly higher instruction throughput. In this chapter, the authors describe the origin of SMT microarchitecture, comparing it with other multithreading microarchitectures. They identify several key aspects for high-performance SMT design: fetch policy, handling long-latency instructions, resource sharing control, synchronization and communication. They also describe some potential benefits of SMT microarchitecture: SMT for faulttolerance and SMT for secure communications. Given the need to support sequential legacy code and emerge of new parallel programming model, we believe SMT microarchitecture will play a vital role as we enter the multi-thread multi/many-core processor design era.


2015 ◽  
Vol 2015 ◽  
pp. 1-11 ◽  
Author(s):  
Jack Dongarra ◽  
Mark Gates ◽  
Azzam Haidar ◽  
Yulu Jia ◽  
Khairul Kabir ◽  
...  

This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library, that incorporates the developments presented here and, more broadly, provides the DLA functionality equivalent to that of the popular LAPACK library while targeting heterogeneous architectures that feature a mix of multicore CPUs and coprocessors. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through the use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology whereby we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.


2015 ◽  
Vol 2015 ◽  
pp. 1-20 ◽  
Author(s):  
Joo Hwan Lee ◽  
Nimit Nigania ◽  
Hyesoon Kim ◽  
Kaushik Patel ◽  
Hyojong Kim

Utilizing heterogeneous platforms for computation has become a general trend, making the portability issue important. OpenCL (Open Computing Language) serves this purpose by enabling portable execution on heterogeneous architectures. However, unpredictable performance variation on different platforms has become a burden for programmers who write OpenCL applications. This is especially true for conventional multicore CPUs, since the performance of general OpenCL applications on CPUs lags behind the performance of their counterparts written in the conventional parallel programming model for CPUs. In this paper, we evaluate the performance of OpenCL applications on out-of-order multicore CPUs from the architectural perspective. We evaluate OpenCL applications on various aspects, including API overhead, scheduling overhead, instruction-level parallelism, address space, data location, data locality, and vectorization, comparing OpenCL to conventional parallel programming models for CPUs. Our evaluation indicates unique performance characteristics of OpenCL applications and also provides insight into the optimization metrics for better performance on CPUs.


2020 ◽  
Vol 23 (3) ◽  
pp. 247-270
Author(s):  
Valery Fedorovich Aleksahin ◽  
Vladimir Aleksandrovich Bakhtin ◽  
Olga Fedorovna Zhukova ◽  
Dmitry Aleksandrovich Zakharov ◽  
Victor Alekseevich Krukov ◽  
...  

DVM-system is designed for the development of parallel programs of scientific and technical calculations in the C-DVMH and Fortran-DVMH languages. These languages use a single DVMH-model of parallel programming model and are an extension of the standard C and Fortran languages with parallelism specifications in the form of compiler directives. The DVMH model makes it possible to create efficient parallel programs for heterogeneous computing clusters, in the nodes of which accelerators, graphic processors or Intel Xeon Phi coprocessors can be used as computing devices along with universal multi-core processors. The article presents new features of DVM-system that have been developed recently.


2020 ◽  
Vol 23 (4) ◽  
pp. 594-614
Author(s):  
Vladimir Aleksandrovich Bakhtin ◽  
Dmitry Aleksandrovich Zakharov ◽  
Andrey Nikolaevich Kozlov ◽  
Veniamin Sergeevich Konovalov

DVM-system is designed for the development of parallel programs of scientific and technical calculations in the C-DVMH and Fortran-DVMH languages. These languages use a single DVMH-model of parallel programming model and are an extension of the standard C and Fortran languages with parallelism specifications in the form of compiler directives. The DVMH model makes it possible to create efficient parallel programs for heterogeneous computing clusters, in the nodes of which accelerators, graphic processors or Intel Xeon Phi coprocessors can be used as computing devices along with universal multi-core processors. The article describes the experience of the successful using of DVM-system to develop a parallel software code for calculating the problem of radiation magnetic gas dynamics and for research of plasma dynamics in the QSPA channel.


2010 ◽  
Vol 45 (5) ◽  
pp. 345-346 ◽  
Author(s):  
Aparna Chandramowlishwaran ◽  
Kathleen Knobe ◽  
Richard Vuduc

Sign in / Sign up

Export Citation Format

Share Document