Executing linear algebra kernels in heterogeneous distributed infrastructures with PyCOMPSs

Oil & Gas Science and Technology – Revue d’IFP Energies nouvelles ◽

10.2516/ogst/2018047 ◽

2018 ◽

Vol 73 ◽

pp. 47 ◽

Cited By ~ 3

Author(s):

Ramon Amela ◽

Cristian Ramon-Cortes ◽

Jorge Ejarque ◽

Javier Conejero ◽

Rosa M. Badia

Keyword(s):

Programming Languages ◽

Linear Algebra ◽

Programming Model ◽

Xeon Phi ◽

Scientific Communities ◽

Heterogeneous Architectures ◽

Parallel Programming Model ◽

Significant Performance ◽

Thread Level Parallelism ◽

Level Parallelism

Python is a popular programming language due to the simplicity of its syntax, while still achieving a good performance even being an interpreted language. The adoption from multiple scientific communities has evolved in the emergence of a large number of libraries and modules, which has helped to put Python on the top of the list of the programming languages [1]. Task-based programming has been proposed in the recent years as an alternative parallel programming model. PyCOMPSs follows such approach for Python, and this paper presents its extensions to combine task-based parallelism and thread-level parallelism. Also, we present how PyCOMPSs has been adapted to support heterogeneous architectures, including Xeon Phi and GPUs. Results obtained with linear algebra benchmarks demonstrate that significant performance can be obtained with a few lines of Python.

Download Full-text

A method of two-level parallelization of the Thomas algorithm for solving tridiagonal linear systems on hybrid computers with multicore coprocessors

Numerical Methods and Programming (Vychislitel'nye Metody i Programmirovanie) ◽

10.26089/nummet.v17r322 ◽

2016 ◽

pp. 234-244

Author(s):

А.А. Федоров ◽

А.Н. Быков

Keyword(s):

Linear Systems ◽

Three Dimensional ◽

Multiprocessor Systems ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Thomas Algorithm ◽

Arithmetic Complexity ◽

Thread Level Parallelism ◽

Level Parallelism ◽

Intel Xeon

Приводится описание метода двухуровневого распараллеливания прогонки (на общей памяти средствами OpenMP и на распределенной памяти средствами MPI) для решения трехдиагональных линейных систем, возникающих при моделировании двумерных и трехмерных физических процессов. Анализируются особенности реализации метода как на ЭВМ с универсальными процессорами, так и на гибридных ЭВМ с многоядерными сопроцессорами Intel Xeon Phi. Оценивается арифметическая сложность реализованного метода. Обсуждаются результаты численных экспериментов по исследованию масштабируемости метода. A method of two-level parallelization of the Thomas algorithm for solving tridiagonal linear systems (the thread-level parallelism using OpenMP and the process-level parallelism using MPI) arising when modeling two-dimensional and three-dimensional physical processes is described. The features of its implementation for parallel multiprocessor systems and for hybrid multiprocessor systems with multicore coprocessors Intel Xeon Phi are analyzed. The arithmetic complexity of this method is estimated. Some numerical results obtained when studying its scalability are discussed.

Download Full-text

Optimization of throughput, fairness and energy efficiency on asymmetric multicore systems via OS scheduling.

Journal of Computer Science and Technology ◽

10.24215/16666038.18.e09 ◽

2018 ◽

Vol 18 (01) ◽

pp. e09

Author(s):

Adrian Pousa

Keyword(s):

Energy Efficiency ◽

Chip Multiprocessors ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Multicore Systems ◽

Lower Power ◽

Thread Level Parallelism ◽

Good For ◽

Asymmetric Multicore ◽

Level Parallelism

Most of chip multiprocessors (CMPs) are symmetric, i.e. they are composed of identical cores. These CMPs may consist of complex cores (e.g., Intel Haswell or IBM Power8) or simple and lower-power cores (e.g. ARM Cortex A9 or Intel Xeon Phi). Cores in the former approach have advanced microarchitectural features, such as out-of-order super-scalar pipelines, and they are suitable for running sequential applications which use them efficiently. Cores in the latter approach have a simple microarchitecture and are good for running applications with high thread-level parallelism (TLP).

Download Full-text

Debugging Parallel Programs in DVM-System

Russian Digital Libraries Journal ◽

10.26907/1562-5419-2020-23-4-866-886 ◽

2020 ◽

Vol 23 (4) ◽

pp. 866-886

Author(s):

Vladimir Aleksandrovich Bakhtin ◽

Dmitry Aleksandrovich Zakharov ◽

Aleksandr Aleksandrovich Ermichev ◽

Victor Alekseevich Krukov

Keyword(s):

Parallel Programming ◽

Heterogeneous Computing ◽

Programming Model ◽

Parallel Programs ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Parallel Programming Model ◽

Intel Xeon

DVM-system is designed for the development of parallel programs of scientific and technical calculations in the C-DVMH and Fortran-DVMH languages. These languages use a single DVMH-model of parallel programming model and are an extension of the standard C and Fortran languages with parallelism specifications in the form of compiler directives. The DVMH model makes it possible to create efficient parallel programs for heterogeneous computing clusters, in the nodes of which accelerators, graphic processors or Intel Xeon Phi coprocessors can be used as computing devices along with universal multi-core processors. The article describes the method of debugging parallel programs in DVM-system, as well as new features of DVM-debugger.

Download Full-text

Simultaneous MultiThreading Microarchitecture

Handbook of Research on Scalable Computing Technologies ◽

10.4018/978-1-60566-661-7.ch024 ◽

2010 ◽

pp. 552-582

Author(s):

Chen Liu ◽

Xiaobin Li ◽

Shaoshan Liu ◽

Jean-Luc Gaudiot

Keyword(s):

Resource Sharing ◽

High Performance ◽

Programming Model ◽

Clock Cycle ◽

Vital Role ◽

Simultaneous Multithreading ◽

Sequential Programming ◽

Parallel Programming Model ◽

Key Aspects ◽

Level Parallelism

Due to the conventional sequential programming model, the Instruction-Level Parallelism (ILP) that modern superscalar processors can explore is inherently limited. Hence, multithreading architectures have been proposed to exploit Thread-Level Parallelism (TLP) in addition to conventional ILP. By issuing and executing instructions from multiple threads at each clock cycle, Simultaneous MultiThreading (SMT) achieves some of the best possible system resource utilization and accordingly higher instruction throughput. In this chapter, the authors describe the origin of SMT microarchitecture, comparing it with other multithreading microarchitectures. They identify several key aspects for high-performance SMT design: fetch policy, handling long-latency instructions, resource sharing control, synchronization and communication. They also describe some potential benefits of SMT microarchitecture: SMT for faulttolerance and SMT for secure communications. Given the need to support sequential legacy code and emerge of new parallel programming model, we believe SMT microarchitecture will play a vital role as we enter the multi-thread multi/many-core processor design era.

Download Full-text

HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi

Scientific Programming ◽

10.1155/2015/502593 ◽

2015 ◽

Vol 2015 ◽

pp. 1-11 ◽

Cited By ~ 5

Author(s):

Jack Dongarra ◽

Mark Gates ◽

Azzam Haidar ◽

Yulu Jia ◽

Khairul Kabir ◽

...

Keyword(s):

Linear Systems ◽

Linear Algebra ◽

High Performance ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Design And Implementation ◽

Heterogeneous Architectures ◽

Programming Techniques ◽

Heterogeneous Hardware ◽

Many Integrated Core

This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library, that incorporates the developments presented here and, more broadly, provides the DLA functionality equivalent to that of the popular LAPACK library while targeting heterogeneous architectures that feature a mix of multicore CPUs and coprocessors. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through the use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology whereby we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.

Download Full-text

OpenCL Performance Evaluation on Modern Multicore CPUs

Scientific Programming ◽

10.1155/2015/859491 ◽

2015 ◽

Vol 2015 ◽

pp. 1-20 ◽

Cited By ~ 3

Author(s):

Joo Hwan Lee ◽

Nimit Nigania ◽

Hyesoon Kim ◽

Kaushik Patel ◽

Hyojong Kim

Keyword(s):

Parallel Programming ◽

Programming Model ◽

Data Locality ◽

Instruction Level Parallelism ◽

Performance Variation ◽

Parallel Programming Model ◽

Data Location ◽

Space Data ◽

Level Parallelism ◽

Multicore Cpus

Utilizing heterogeneous platforms for computation has become a general trend, making the portability issue important. OpenCL (Open Computing Language) serves this purpose by enabling portable execution on heterogeneous architectures. However, unpredictable performance variation on different platforms has become a burden for programmers who write OpenCL applications. This is especially true for conventional multicore CPUs, since the performance of general OpenCL applications on CPUs lags behind the performance of their counterparts written in the conventional parallel programming model for CPUs. In this paper, we evaluate the performance of OpenCL applications on out-of-order multicore CPUs from the architectural perspective. We evaluate OpenCL applications on various aspects, including API overhead, scheduling overhead, instruction-level parallelism, address space, data location, data locality, and vectorization, comparing OpenCL to conventional parallel programming models for CPUs. Our evaluation indicates unique performance characteristics of OpenCL applications and also provides insight into the optimization metrics for better performance on CPUs.

Download Full-text

Progress in Dvm-System

Russian Digital Libraries Journal ◽

10.26907/1562-5419-2020-23-3-247-270 ◽

2020 ◽

Vol 23 (3) ◽

pp. 247-270

Author(s):

Valery Fedorovich Aleksahin ◽

Vladimir Aleksandrovich Bakhtin ◽

Olga Fedorovna Zhukova ◽

Dmitry Aleksandrovich Zakharov ◽

Victor Alekseevich Krukov ◽

...

Keyword(s):

Parallel Programming ◽

Heterogeneous Computing ◽

Programming Model ◽

Parallel Programs ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Parallel Programming Model ◽

Intel Xeon

DVM-system is designed for the development of parallel programs of scientific and technical calculations in the C-DVMH and Fortran-DVMH languages. These languages use a single DVMH-model of parallel programming model and are an extension of the standard C and Fortran languages with parallelism specifications in the form of compiler directives. The DVMH model makes it possible to create efficient parallel programs for heterogeneous computing clusters, in the nodes of which accelerators, graphic processors or Intel Xeon Phi coprocessors can be used as computing devices along with universal multi-core processors. The article presents new features of DVM-system that have been developed recently.

Download Full-text

The Using of DVM-System for Developing of a Program for Calculations of the Problem of Radiation Magnetic Gas Dynamics and Research of Plasma Dynamics in the QSPA Channel

Russian Digital Libraries Journal ◽

10.26907/1562-5419-2020-23-4-594-614 ◽

2020 ◽

Vol 23 (4) ◽

pp. 594-614

Author(s):

Vladimir Aleksandrovich Bakhtin ◽

Dmitry Aleksandrovich Zakharov ◽

Andrey Nikolaevich Kozlov ◽

Veniamin Sergeevich Konovalov

Keyword(s):

Parallel Programming ◽

Gas Dynamics ◽

Heterogeneous Computing ◽

Programming Model ◽

Parallel Programs ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Plasma Dynamics ◽

Parallel Programming Model ◽

Software Code

DVM-system is designed for the development of parallel programs of scientific and technical calculations in the C-DVMH and Fortran-DVMH languages. These languages use a single DVMH-model of parallel programming model and are an extension of the standard C and Fortran languages with parallelism specifications in the form of compiler directives. The DVMH model makes it possible to create efficient parallel programs for heterogeneous computing clusters, in the nodes of which accelerators, graphic processors or Intel Xeon Phi coprocessors can be used as computing devices along with universal multi-core processors. The article describes the experience of the successful using of DVM-system to develop a parallel software code for calculating the problem of radiation magnetic gas dynamics and for research of plasma dynamics in the QSPA channel.

Download Full-text

Applying the concurrent collections programming model to asynchronous parallel dense linear algebra

ACM SIGPLAN Notices ◽

10.1145/1837853.1693506 ◽

2010 ◽

Vol 45 (5) ◽

pp. 345-346 ◽

Cited By ~ 1

Author(s):

Aparna Chandramowlishwaran ◽

Kathleen Knobe ◽

Richard Vuduc

Keyword(s):

Linear Algebra ◽

Programming Model ◽

Dense Linear Algebra ◽

Asynchronous Parallel

Download Full-text

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture - ISCA '09 ◽

10.1145/1555754.1555775 ◽

2009 ◽

Cited By ~ 256

Author(s):

Sunpyo Hong ◽

Hyesoon Kim

Keyword(s):

Analytical Model ◽

Thread Level Parallelism ◽

Level Parallelism ◽

Gpu Architecture ◽

With Memory

Download Full-text