The Parallel Tiled WZ Factorization Algorithm for Multicore Architectures

Beata Bylina; Jarosław Bylina

doi:10.2478/amcs-2019-0030

The Parallel Tiled WZ Factorization Algorithm for Multicore Architectures

International Journal of Applied Mathematics and Computer Science ◽

10.2478/amcs-2019-0030 ◽

2019 ◽

Vol 29 (2) ◽

pp. 407-419

Author(s):

Beata Bylina ◽

Jarosław Bylina

Keyword(s):

Shared Memory ◽

Linear Algebra ◽

Multicore Architectures ◽

Numerical Accuracy ◽

Factorization Algorithm ◽

Computational Performance ◽

Parallel Implementations ◽

Diagonally Dominant Matrices ◽

Diagonally Dominant ◽

Level Parallelism

Abstract The aim of this paper is to investigate dense linear algebra algorithms on shared memory multicore architectures. The design and implementation of a parallel tiled WZ factorization algorithm which can fully exploit such architectures are presented. Three parallel implementations of the algorithm are studied. The first one relies only on exploiting multithreaded BLAS (basic linear algebra subprograms) operations. The second implementation, except for BLAS operations, employs the OpenMP standard to use the loop-level parallelism. The third implementation, except for BLAS operations, employs the OpenMP task directive with the depend clause. We report the computational performance and the speedup of the parallel tiled WZ factorization algorithm on shared memory multicore architectures for dense square diagonally dominant matrices. Then we compare our parallel implementations with the respective LU factorization from a vendor implemented LAPACK library. We also analyze the numerical accuracy. Two of our implementations can be achieved with near maximal theoretical speedup implied by Amdahl’s law.

Download Full-text

OpenMP Issues Arising in the Development of Parallel BLAS and LAPACK Libraries

Scientific Programming ◽

10.1155/2003/278167 ◽

2003 ◽

Vol 11 (2) ◽

pp. 95-104 ◽

Cited By ~ 2

Author(s):

C. Addison ◽

Y. Ren ◽

M. van Waveren

Keyword(s):

Shared Memory ◽

Linear Algebra ◽

Distributed Memory ◽

Parallel Computations ◽

Dense Linear Algebra ◽

Fine Grain ◽

Parallel Implementations ◽

Work Distribution ◽

Multiple Array ◽

Parallel Library

Dense linear algebra libraries need to cope efficiently with a range of input problem sizes and shapes. Inherently this means that parallel implementations have to exploit parallelism wherever it is present. While OpenMP allows relatively fine grain parallelism to be exploited in a shared memory environment it currently lacks features to make it easy to partition computation over multiple array indices or to overlap sequential and parallel computations. The inherent flexible nature of shared memory paradigms such as OpenMP poses other difficulties when it becomes necessary to optimise performance across successive parallel library calls. Notions borrowed from distributed memory paradigms, such as explicit data distributions help address some of these problems, but the focus on data rather than work distribution appears misplaced in an SMP context.

Download Full-text

The estimates of diagonally dominant degree and eigenvalues inclusion regions for the Schur complement of block diagonally dominant matrices

Open Mathematics ◽

10.1515/math-2015-0072 ◽

2015 ◽

Vol 13 (1) ◽

Author(s):

Feng Wang ◽

Deshu Sun

Keyword(s):

Control Theory ◽

Linear Algebra ◽

Matrix Theory ◽

Schur Complement ◽

Numerical Example ◽

Diagonally Dominant Matrices ◽

Diagonally Dominant ◽

And Control

AbstractThe theory of Schur complement plays an important role in many fields, such as matrix theory and control theory. In this paper, applying the properties of Schur complement, some new estimates of diagonally dominant degree on the Schur complement of I(II)-block strictly diagonally dominant matrices and I(II)-block strictly doubly diagonally dominant matrices are obtained, which improve some relative results in Liu [Linear Algebra Appl. 435(2011) 3085-3100]. As an application, we present several new eigenvalue inclusion regions for the Schur complement of matrices. Finally, we give a numerical example to illustrate the advantages of our derived results.

Download Full-text

Scheduling Two-Sided Transformations Using Tile Algorithms on Multicore Architectures

Scientific Programming ◽

10.1155/2010/574728 ◽

2010 ◽

Vol 18 (1) ◽

pp. 35-50 ◽

Cited By ~ 4

Author(s):

Hatem Ltaief ◽

Jakub Kurzak ◽

Jack Dongarra ◽

Rosa M. Badia

Keyword(s):

Linear Algebra ◽

High Performance ◽

Eigenvalue Problems ◽

Multicore Processors ◽

Multicore Architectures ◽

Band Matrices ◽

Fine Grain ◽

Dataflow Model ◽

Singular Value Decompositions ◽

Level Parallelism

The objective of this paper is to describe, in the context of multicore architectures, three different scheduler implementations for the two-sided linear algebra transformations, in particular the Hessenberg and Bidiagonal reductions which are the first steps for the standard eigenvalue problems and the singular value decompositions respectively. State-of-the-art dense linear algebra softwares, such as the LAPACK and ScaLAPACK libraries, suffer performance losses on multicore processors due to their inability to fully exploit thread-level parallelism. At the same time the fine-grain dataflow model gains popularity as a paradigm for programming multicore architectures. Buttari et al. (Parellel Comput. Syst. Appl. 35 (2009), 38–53) introduced the concept oftile algorithmsin which parallelism is no longer hidden inside Basic Linear Algebra Subprograms but is brought to the fore to yield much better performance. Along with efficient scheduling mechanisms for data-driven execution, these tile two-sided reductions achieve high performance computing by reaching up to 75% of the DGEMM peak on a 12000×12000 matrix with 16 Intel Tigerton 2.4 GHz processors. The main drawback of thetile algorithmsapproach for two-sided transformations is that the full reduction cannot be obtained in one stage. Other methods have to be considered to further reduce the band matrices to the required forms.

Download Full-text

Polyhedral approximations of the semidefinite cone and their application

Computational Optimization and Applications ◽

10.1007/s10589-020-00255-2 ◽

2021 ◽

Author(s):

Yuzhu Wang ◽

Akihiro Tanaka ◽

Akiko Yoshise

Keyword(s):

Initial Approximation ◽

Semidefinite Relaxation ◽

Stable Set ◽

Cutting Plane Methods ◽

Diagonally Dominant Matrices ◽

Maximum Stable Set ◽

Simple Expansion ◽

Diagonally Dominant ◽

Polyhedral Approximations ◽

Semidefinite Cone

AbstractWe develop techniques to construct a series of sparse polyhedral approximations of the semidefinite cone. Motivated by the semidefinite (SD) bases proposed by Tanaka and Yoshise (Ann Oper Res 265:155–182, 2018), we propose a simple expansion of SD bases so as to keep the sparsity of the matrices composing it. We prove that the polyhedral approximation using our expanded SD bases contains the set of all diagonally dominant matrices and is contained in the set of all scaled diagonally dominant matrices. We also prove that the set of all scaled diagonally dominant matrices can be expressed using an infinite number of expanded SD bases. We use our approximations as the initial approximation in cutting plane methods for solving a semidefinite relaxation of the maximum stable set problem. It is found that the proposed methods with expanded SD bases are significantly more efficient than methods using other existing approximations or solving semidefinite relaxation problems directly.

Download Full-text

Profiling high performance dense linear algebra algorithms on multicore architectures for power and energy efficiency

Computer Science - Research and Development ◽

10.1007/s00450-011-0191-z ◽

2011 ◽

Vol 27 (4) ◽

pp. 277-287 ◽

Cited By ~ 17

Author(s):

Hatem Ltaief ◽

Piotr Luszczek ◽

Jack Dongarra

Keyword(s):

Energy Efficiency ◽

Linear Algebra ◽

High Performance ◽

Multicore Architectures ◽

Dense Linear Algebra ◽

Power And Energy

Download Full-text

New safe reliable design methodologies examined by fault injection testing and Monte Carlo simulation: tolerating shared-memory interferences in multicore architectures

International Journal of Embedded Systems ◽

10.1504/ijes.2021.117956 ◽

2021 ◽

Vol 14 (4) ◽

pp. 409

Author(s):

Abdullah El Bayoumi

Keyword(s):

Monte Carlo Simulation ◽

Monte Carlo ◽

Shared Memory ◽

Fault Injection ◽

Multicore Architectures ◽

Design Methodologies ◽

Reliable Design

Download Full-text

Enumeration and investigation of acute 0/1-simplices modulo the action of the hyperoctahedral group

Special Matrices ◽

10.1515/spma-2017-0014 ◽

2017 ◽

Vol 5 (1) ◽

pp. 158-201

Author(s):

Jan Brandts ◽

Apo Cihangir

Keyword(s):

Convex Hull ◽

Computer Program ◽

Explicit Form ◽

Linear Algebra ◽

Dihedral Angles ◽

Cycle Index ◽

Hyperoctahedral Group ◽

Diagonally Dominant ◽

Group B ◽

Cycle Indices

Abstract The convex hull of n + 1 affinely independent vertices of the unit n-cube In is called a 0/1-simplex. It is nonobtuse if none its dihedral angles is obtuse, and acute if additionally none of them is right. In terms of linear algebra, acute 0/1-simplices in In can be described by nonsingular 0/1-matrices P of size n × n whose Gramians G = PTP have an inverse that is strictly diagonally dominant, with negative off-diagonal entries [6, 7]. The first part of this paper deals with giving a detailed description of how to efficiently compute, by means of a computer program, a representative from each orbit of an acute 0/1-simplex under the action of the hyperoctahedral group Bn [17] of symmetries of In. A side product of the investigations is a simple code that computes the cycle index of Bn, which can in explicit form only be found in the literature [11] for n ≤ 6. Using the computed cycle indices for B3, . . . ,B11 in combination with Pólya’s theory of enumeration shows that acute 0/1-simplices are extremely rare among all 0/1-simplices. In the second part of the paper, we study the 0/1-matrices that represent the acute 0/1-simplices that were generated by our code from a mathematical perspective. One of the patterns observed in the data involves unreduced upper Hessenberg 0/1-matrices of size n × n, block-partitioned according to certain integer compositions of n. These patterns will be fully explained using a so-called One Neighbor Theorem [4]. Additionally, we are able to prove that the volumes of the corresponding acute simplices are in one-to-one correspondence with the part of Kepler’s Tree of Fractions [1, 24] that enumerates ℚ ⋂ (0, 1). Another key ingredient in the proofs is the fact that the Gramians of the unreduced upper Hessenberg matrices involved are strictly ultrametric [14, 26] matrices.

Download Full-text

A fast and efficient MATLAB-based MPM solver: fMPMM-solver v1.1

Geoscientific Model Development ◽

10.5194/gmd-13-6265-2020 ◽

2020 ◽

Vol 13 (12) ◽

pp. 6265-6284

Author(s):

Emmanuel Wyser ◽

Yury Alkhimenkov ◽

Michel Jaboyedoff ◽

Yury Y. Podladchikov

Keyword(s):

Cantilever Beam ◽

Large Scale ◽

Solid Mechanics ◽

Large Deformations ◽

Material Point ◽

Snow Avalanches ◽

High Level Language ◽

Numerical Accuracy ◽

Computational Performance ◽

High Level

Abstract. We present an efficient MATLAB-based implementation of the material point method (MPM) and its most recent variants. MPM has gained popularity over the last decade, especially for problems in solid mechanics in which large deformations are involved, such as cantilever beam problems, granular collapses and even large-scale snow avalanches. Although its numerical accuracy is lower than that of the widely accepted finite element method (FEM), MPM has proven useful for overcoming some of the limitations of FEM, such as excessive mesh distortions. We demonstrate that MATLAB is an efficient high-level language for MPM implementations that solve elasto-dynamic and elasto-plastic problems. We accelerate the MATLAB-based implementation of the MPM method by using the numerical techniques recently developed for FEM optimization in MATLAB. These techniques include vectorization, the use of native MATLAB functions and the maintenance of optimal RAM-to-cache communication, among others. We validate our in-house code with classical MPM benchmarks including (i) the elastic collapse of a column under its own weight; (ii) the elastic cantilever beam problem; and (iii) existing experimental and numerical results, i.e. granular collapses and slumping mechanics respectively. We report an improvement in performance by a factor of 28 for a vectorized code compared with a classical iterative version. The computational performance of the solver is at least 2.8 times greater than those of previously reported MPM implementations in Julia under a similar computational architecture.

Download Full-text

Calculating an Invariant Subspace of Diagonally Dominant Matrices - Part II

Missouri Journal of Mathematical Sciences ◽

10.35834/1993/0503116 ◽

1993 ◽

Vol 5 (3) ◽

pp. 116-122

Author(s):

Noah H. Rhee

Keyword(s):

Invariant Subspace ◽

Diagonally Dominant Matrices ◽

Diagonally Dominant

Download Full-text

Executing linear algebra kernels in heterogeneous distributed infrastructures with PyCOMPSs

Oil & Gas Science and Technology – Revue d’IFP Energies nouvelles ◽

10.2516/ogst/2018047 ◽

2018 ◽

Vol 73 ◽

pp. 47 ◽

Cited By ~ 3

Author(s):

Ramon Amela ◽

Cristian Ramon-Cortes ◽

Jorge Ejarque ◽

Javier Conejero ◽

Rosa M. Badia

Keyword(s):

Programming Languages ◽

Linear Algebra ◽

Programming Model ◽

Xeon Phi ◽

Scientific Communities ◽

Heterogeneous Architectures ◽

Parallel Programming Model ◽

Significant Performance ◽

Thread Level Parallelism ◽

Level Parallelism

Python is a popular programming language due to the simplicity of its syntax, while still achieving a good performance even being an interpreted language. The adoption from multiple scientific communities has evolved in the emergence of a large number of libraries and modules, which has helped to put Python on the top of the list of the programming languages [1]. Task-based programming has been proposed in the recent years as an alternative parallel programming model. PyCOMPSs follows such approach for Python, and this paper presents its extensions to combine task-based parallelism and thread-level parallelism. Also, we present how PyCOMPSs has been adapted to support heterogeneous architectures, including Xeon Phi and GPUs. Results obtained with linear algebra benchmarks demonstrate that significant performance can be obtained with a few lines of Python.

Download Full-text