processor architectures Latest Research Papers

Hardware Architecture for Asynchronous Cellular Self-Organizing Maps

Electronics ◽

10.3390/electronics11020215 ◽

2022 ◽

Vol 11 (2) ◽

pp. 215

Author(s):

Quentin Berthet ◽

Joachim Schmidt ◽

Andres Upegui

Keyword(s):

Specific Property ◽

Network Size ◽

Hardware Architecture ◽

Mammalian Brain ◽

Single Chip ◽

Self Organizing Maps ◽

Processor Architectures ◽

Underlying Network ◽

On Chip ◽

Self Organizing

Nowadays, one of the main challenges in computer architectures is scalability; indeed, novel processor architectures can include thousands of processing elements on a single chip and using them efficiently remains a big issue. An interesting source of inspiration for handling scalability is the mammalian brain and different works on neuromorphic computation have attempted to address this question. The Self-configurable 3D Cellular Adaptive Platform (SCALP) has been designed with the goal of prototyping such types of systems and has led to the proposal of the Cellular Self-Organizing Maps (CSOM) algorithm. In this paper, we present a hardware architecture for CSOM in the form of interconnected neural units with the specific property of supporting an asynchronous deployment on a multi-FPGA 3D array. The Asynchronous CSOM (ACSOM) algorithm exploits the underlying Network-on-Chip structure to be provided by SCALP in order to overcome the multi-path propagation issue presented by a straightforward CSOM implementation. We explore its behaviour under different map topologies and scalar representations. The results suggest that a larger network size with low precision coding obtains an optimal ratio between algorithm accuracy and FPGA resources.

Download Full-text

HEPiX Benchmarking Solution for WLCG Computing Resources

Computing and Software for Big Science ◽

10.1007/s41781-021-00074-y ◽

2021 ◽

Vol 5 (1) ◽

Author(s):

Domenico Giordano ◽

Manfred Alef ◽

Luca Atzori ◽

Jean-Michel Barbet ◽

Olga Datskova ◽

...

Keyword(s):

Working Group ◽

High Energy Physics ◽

High Energy ◽

Computing Power ◽

Software Applications ◽

Processor Architectures ◽

Group A ◽

Benchmark Suite ◽

Main Components ◽

Energy Physics

AbstractThe HEPiX Benchmarking Working Group has developed a framework to benchmark the performance of a computational server using the software applications of the High Energy Physics (HEP) community. This framework consists of two main components, named HEP-Workloads and HEPscore. HEP-Workloads is a collection of standalone production applications provided by a number of HEP experiments. HEPscore is designed to run HEP-Workloads and provide an overall measurement that is representative of the computing power of a system. HEPscore is able to measure the performance of systems with different processor architectures and accelerators. The framework is completed by the HEP Benchmark Suite that simplifies the process of executing HEPscore and other benchmarks such as HEP-SPEC06, SPEC CPU 2017, and DB12. This paper describes the motivation, the design choices, and the results achieved by the HEPiX Benchmarking Working group. A perspective on future plans is also presented.

Download Full-text

CoMeT: Configurable Tagged Memory Extension

Sensors ◽

10.3390/s21227771 ◽

2021 ◽

Vol 21 (22) ◽

pp. 7771

Author(s):

Jinjae Lee ◽

Derry Pratama ◽

Minjae Kim ◽

Howon Kim ◽

Donghyun Kwon

Keyword(s):

Memory Access ◽

Instruction Set ◽

Security Issues ◽

Processor Architectures ◽

Instruction Set Extension ◽

Instruction Set Extensions ◽

Access Permissions

Commodity processor architectures are releasing various instruction set extensions to support security solutions for the efficient mitigation of memory vulnerabilities. Among them, tagged memory extension (TME), such as ARM MTE and SPARC ADI, can prevent unauthorized memory access by utilizing tagged memory. However, our analysis found that TME has performance and security issues in practical use. To alleviate these, in this paper, we propose CoMeT, a new instruction set extension for tagged memory. The key idea behind CoMeT is not only to check whether the tag values in the address tag and memory tag are matched, but also to check the access permissions for each tag value. We implemented the prototype of CoMeT on the RISC-V platform. Our evaluation results confirm that CoMeT can be utilized to efficiently implement well-known security solutions, i.e., shadow stack and in-process isolation, without compromising security.

Download Full-text

Survival of the Fittest Amidst the Cambrian Explosion of Processor Architectures for Artificial Intelligence : Invited Paper

10.1109/pehc54839.2021.00010 ◽

2021 ◽

Author(s):

Sreenivas R. Sukumar ◽

Jacob A. Balma ◽

Cong Xu ◽

Sergey Serebryakov

Keyword(s):

Artificial Intelligence ◽

Cambrian Explosion ◽

Processor Architectures ◽

Survival Of The Fittest

Download Full-text

Embedded Processor Architectures

10.1007/978-981-16-3293-8_12 ◽

2021 ◽

pp. 341-389

Author(s):

KCS Murti

Keyword(s):

Embedded Processor ◽

Processor Architectures

Download Full-text

Combining admission tests for heuristic partitioning of real-time tasks on ARM big.LITTLE multi-processor architectures

Journal of Systems Architecture ◽

10.1016/j.sysarc.2021.102229 ◽

2021 ◽

pp. 102229

Author(s):

Agostino Mascitti ◽

Tommaso Cucinotta ◽

Luca Abeni

Keyword(s):

Real Time ◽

Processor Architectures ◽

Admission Tests

Download Full-text

Changing Trends in Computer Architecture : A Comprehensive Analysis of ARM and x86 Processors

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit2173188 ◽

2021 ◽

pp. 619-631

Author(s):

Khushi Gupta ◽

Tushar Sharma

Keyword(s):

Computer Architecture ◽

High Performance ◽

Comprehensive Analysis ◽

Low Power Consumption ◽

Modern World ◽

Development Environment ◽

Processor Architectures ◽

Software Development Environment ◽

Changing Trends ◽

Microprocessor Industry

In the modern world, we use microprocessors which are either based on ARM or x86 architecture which are the most common processor architectures. ARM originally stood for ‘Acorn RISC Machines’ but over the years changed to ‘Advanced RISC Machines’. It was started as just an experiment but showed promising results and now it is omnipresent in our modern devices. Unlike x86 which is designed for high performance, ARM focuses on low power consumption with considerable performance. Because of the advancements in the ARM technology, they are becoming more powerful than their x86 counterparts. In this analysis we will collate the two architectures briefly and conclude which microprocessor will dominate the microprocessor industry. The processor which will perform better in different tests will be more suitable for the reader to use in their application. The shift in the industry towards ARM processors can change how we write softwares which in turn will affect the whole software development environment.

Download Full-text

Automatic Sublining for Efficient Sparse Memory Accesses

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3452141 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1-23

Author(s):

Wim Heirman ◽

Stijn Eyerman ◽

Kristof Du Bois ◽

Ibrahim Hur

Keyword(s):

Dynamic Environment ◽

Large Data ◽

Main Memory ◽

Single Element ◽

Graph Analytics ◽

Available Bandwidth ◽

Processor Architectures ◽

Spatial Locality ◽

Potential Impact ◽

Memory Accesses

Sparse memory accesses, which are scattered accesses to single elements of a large data structure, are a challenge for current processor architectures. Their lack of spatial and temporal locality and their irregularity makes caches and traditional stream prefetchers useless. Furthermore, performing standard caching and prefetching on sparse accesses wastes precious memory bandwidth and thrashes caches, deteriorating performance for regular accesses. Bypassing prefetchers and caches for sparse accesses, and fetching only a single element (e.g., 8 B) from main memory (subline access), can solve these issues. Deciding which accesses to handle as sparse accesses and which as regular cached accesses, is a challenging task, with a large potential impact on performance. Not only is performance reduced by treating sparse accesses as regular accesses, not caching accesses that do have locality also negatively impacts performance by significantly increasing their latency and bandwidth consumption. Furthermore, this decision depends on the dynamic environment, such as input set characteristics and system load, making a static decision by the programmer or compiler suboptimal. We propose the Instruction Spatial Locality Estimator ( ISLE ), a hardware detector that finds instructions that access isolated words in a sea of unused data. These sparse accesses are dynamically converted into uncached subline accesses, while keeping regular accesses cached. ISLE does not require modifying source code or binaries, and adapts automatically to a changing environment (input data, available bandwidth, etc.). We apply ISLE to a graph analytics processor running sparse graph workloads, and show that ISLE outperforms the performance of no subline accesses, manual sublining, and prior work on detecting sparse accesses.

Download Full-text

Low-Complexity High-Throughput QC-LDPC Decoder for 5G New Radio Wireless Communication

Electronics ◽

10.3390/electronics10040516 ◽

2021 ◽

Vol 10 (4) ◽

pp. 516

Author(s):

Tram Thi Bao Nguyen ◽

Tuy Nguyen Tan ◽

Hanho Lee

Keyword(s):

Wireless Communication ◽

High Throughput ◽

Low Complexity ◽

Ldpc Decoder ◽

Processor Architectures ◽

New Radio ◽

Decoder Architecture ◽

Check Node ◽

Information Update ◽

Wireless Standards

This paper presents a pipelined layered quasi-cyclic low-density parity-check (QC-LDPC) decoder architecture targeting low-complexity, high-throughput, and efficient use of hardware resources compliant with the specifications of 5G new radio (NR) wireless communication standard. First, a combined min-sum (CMS) decoding algorithm, which is a combination of the offset min-sum and the original min-sum algorithm, is proposed. Then, a low-complexity and high-throughput pipelined layered QC-LDPC decoder architecture for enhanced mobile broadband specifications in 5G NR wireless standards based on CMS algorithm with pipeline layered scheduling is presented. Enhanced versions of check node-based processor architectures are proposed to improve the complexity of the LDPC decoders. An efficient minimum-finder for the check node unit architecture that reduces the hardware required for the computation of the first two minima is introduced. Moreover, a low complexity a posteriori information update unit architecture, which only requires one adder array for their operations, is presented. The proposed architecture shows significant improvements in terms of area and throughput compared to other QC-LDPC decoder architectures available in the literature.

Download Full-text

Efficient local locking for massively multithreaded in-memory hash-based operators

The VLDB Journal ◽

10.1007/s00778-020-00642-5 ◽

2021 ◽

Author(s):

Bashar Romanous ◽

Skyler Windh ◽

Ildar Absalyamov ◽

Prerna Budhkar ◽

Robert Halstead ◽

...

Keyword(s):

Relational Databases ◽

Aggregation Operators ◽

Main Memory ◽

Paradigm Shifts ◽

Multithreaded Processors ◽

Cache Hierarchies ◽

Processor Architectures ◽

Spatial Locality ◽

Content Addressable Memories ◽

Multi Core Processor

AbstractThe join and group-by aggregation are two memory intensive operators that are affecting the performance of relational databases. Hashing is a common approach used to implement both operators. Recent paradigm shifts in multi-core processor architectures have reinvigorated research into how the join and group-by aggregation operators can leverage these advances. However, the poor spatial locality of the hashing approach has hindered performance on multi-core processor architectures which rely on using large cache hierarchies for latency mitigation. Multithreaded architectures can better cope with poor spatial locality by masking memory latency with many outstanding requests. Nevertheless, the number of parallel threads, even in the most advanced multithreaded processors, such as UltraSPARC, is not enough to fully cover the main memory access latency. In this paper, we explore the hardware re-configurability of FPGAs to enable deeper execution pipelines that maintain hundreds (instead of tens) of outstanding memory requests across four FPGAs-drastically increasing concurrency and throughput. We present two end-to-end in-memory accelerators for the join and group-by aggregation operators using FPGAs. Both accelerators use massive multithreading to mask long memory delays of traversing linked-list data structures, while concurrently managing hundreds of thread states across four FPGAs locally. We explore how content addressable memories can be intermixed within our multithreaded designs to act as a synchronizing cache, which enforces locks and merges jobs together before they are written to memory. Throughput results for our hash-join operator accelerator show a speedup between 2$$\times $$ × and 3.4$$\times $$ × over the best multi-core approaches with comparable memory bandwidths on uniform and skewed datasets. The accelerator for the hash-based group-by aggregation operator demonstrates that leveraging CAMs achieves average speedup of 3.3$$\times $$ × with a best case of 9.4$$\times $$ × in terms of throughput over CPU implementations across five types of data distributions.

Download Full-text

processor architectures
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Hardware Architecture for Asynchronous Cellular Self-Organizing Maps

HEPiX Benchmarking Solution for WLCG Computing Resources

CoMeT: Configurable Tagged Memory Extension

Survival of the Fittest Amidst the Cambrian Explosion of Processor Architectures for Artificial Intelligence : Invited Paper

Embedded Processor Architectures

Combining admission tests for heuristic partitioning of real-time tasks on ARM big.LITTLE multi-processor architectures

Changing Trends in Computer Architecture : A Comprehensive Analysis of ARM and x86 Processors

Automatic Sublining for Efficient Sparse Memory Accesses

Low-Complexity High-Throughput QC-LDPC Decoder for 5G New Radio Wireless Communication

Efficient local locking for massively multithreaded in-memory hash-based operators

Export Citation Format

processor architecturesRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Hardware Architecture for Asynchronous Cellular Self-Organizing Maps

HEPiX Benchmarking Solution for WLCG Computing Resources

CoMeT: Configurable Tagged Memory Extension

Survival of the Fittest Amidst the Cambrian Explosion of Processor Architectures for Artificial Intelligence : Invited Paper

Embedded Processor Architectures

Combining admission tests for heuristic partitioning of real-time tasks on ARM big.LITTLE multi-processor architectures

Changing Trends in Computer Architecture : A Comprehensive Analysis of ARM and x86 Processors

Automatic Sublining for Efficient Sparse Memory Accesses

Low-Complexity High-Throughput QC-LDPC Decoder for 5G New Radio Wireless Communication

Efficient local locking for massively multithreaded in-memory hash-based operators

processor architectures
Recently Published Documents