multithreaded processors Latest Research Papers

Efficient local locking for massively multithreaded in-memory hash-based operators

The VLDB Journal ◽

10.1007/s00778-020-00642-5 ◽

2021 ◽

Author(s):

Bashar Romanous ◽

Skyler Windh ◽

Ildar Absalyamov ◽

Prerna Budhkar ◽

Robert Halstead ◽

...

Keyword(s):

Relational Databases ◽

Aggregation Operators ◽

Main Memory ◽

Paradigm Shifts ◽

Multithreaded Processors ◽

Cache Hierarchies ◽

Processor Architectures ◽

Spatial Locality ◽

Content Addressable Memories ◽

Multi Core Processor

AbstractThe join and group-by aggregation are two memory intensive operators that are affecting the performance of relational databases. Hashing is a common approach used to implement both operators. Recent paradigm shifts in multi-core processor architectures have reinvigorated research into how the join and group-by aggregation operators can leverage these advances. However, the poor spatial locality of the hashing approach has hindered performance on multi-core processor architectures which rely on using large cache hierarchies for latency mitigation. Multithreaded architectures can better cope with poor spatial locality by masking memory latency with many outstanding requests. Nevertheless, the number of parallel threads, even in the most advanced multithreaded processors, such as UltraSPARC, is not enough to fully cover the main memory access latency. In this paper, we explore the hardware re-configurability of FPGAs to enable deeper execution pipelines that maintain hundreds (instead of tens) of outstanding memory requests across four FPGAs-drastically increasing concurrency and throughput. We present two end-to-end in-memory accelerators for the join and group-by aggregation operators using FPGAs. Both accelerators use massive multithreading to mask long memory delays of traversing linked-list data structures, while concurrently managing hundreds of thread states across four FPGAs locally. We explore how content addressable memories can be intermixed within our multithreaded designs to act as a synchronizing cache, which enforces locks and merges jobs together before they are written to memory. Throughput results for our hash-join operator accelerator show a speedup between 2$$\times $$ × and 3.4$$\times $$ × over the best multi-core approaches with comparable memory bandwidths on uniform and skewed datasets. The accelerator for the hash-based group-by aggregation operator demonstrates that leveraging CAMs achieves average speedup of 3.3$$\times $$ × with a best case of 9.4$$\times $$ × in terms of throughput over CPU implementations across five types of data distributions.

Download Full-text

Dynamic issue queue capping for simultaneous multithreaded processors

TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES ◽

10.3906/elk-2005-50 ◽

2021 ◽

Keyword(s):

Multithreaded Processors

Download Full-text

Applied On-Chip Machine Learning for Dynamic Resource Control in Multithreaded Processors

Parallel Processing Letters ◽

10.1142/s0129626419500130 ◽

2019 ◽

Vol 29 (03) ◽

pp. 1950013

Author(s):

Shane Carroll ◽

Wei-Ming Lin

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Clock Cycle ◽

Shared Resources ◽

Multithreaded Processors ◽

Multiple Threads ◽

On Chip ◽

Control Instruction ◽

Cache Miss ◽

Fetch Bandwidth

In this paper, we propose a machine learning algorithm to control instruction fetch bandwidth in a simultaneous multithreaded CPU. In a simultaneous multithreaded CPU, multiple threads occupy pools of hardware resources in the same clock cycle. Under some conditions, one or more threads may undergo a period of inefficiency, e.g., a cache miss, thereby inefficiently using shared resources and degrading the performance of other threads. If these inefficiencies can be identified at runtime, the offending thread can be temporarily blocked from fetching new instructions into the pipeline and given time to recover from its inefficiency, and prevent the shared system resources from being wasted on a stalled thread. In this paper, we propose a machine learning approach to determine when a thread should be blocked from fetching new instructions. The model is trained offline and the parameters embedded in a CPU, which can be queried with runtime statistics to determine if a thread is running inefficiently and should be temporarily blocked from fetching. We propose two models: a simple linear model and a higher-capacity neural network. We test each model in a simulation environment and show that system performance can increase by up to 19% on average with a feasible implementation of the proposed algorithm.

Download Full-text

Round Robin Thread Selection Optimization in Multithreaded Processors

Parallel Processing Letters ◽

10.1142/s0129626419500038 ◽

2019 ◽

Vol 29 (01) ◽

pp. 1950003

Author(s):

Shane Carroll ◽

Wei-Ming Lin

Keyword(s):

Resource Distribution ◽

Round Robin ◽

System Throughput ◽

Shared Resources ◽

Multithreaded Processors ◽

Run Time ◽

Multiple Stages

We propose a variation of round-robin ordering in an multi-threaded pipeline to increase system throughput and resource distribution fairness. We show that using round robin with a typical arbitrary ordering results in inefficient use of shared resources and subsequent thread starvation. To address this but still use a simple round-robin approach, we optimally and dynamically sort the order of the round robin periodically at runtime. We show that with 4-threaded workloads, throughput can be improved by over 9% and harmonic throughput by over 3% by sorting thread order at run time. We experiment with multiple stages of the pipeline and show consistent results throughout several experiments using the SPEC CPU 2006 benchmarks. Furthermore, since the technique is still a simple round robin, the increased performance requires little overhead to implement.

Download Full-text

Thread Assignment in Multicore/Multithreaded Processors: A Statistical Approach

IEEE Transactions on Computers ◽

10.1109/tc.2015.2417533 ◽

2016 ◽

Vol 65 (1) ◽

pp. 256-269 ◽

Cited By ~ 3

Author(s):

Petar Radojkovic ◽

Paul M. Carpenter ◽

Miquel Moreto ◽

Vladimir Cakarevic ◽

Javier Verdu ◽

...

Keyword(s):

Statistical Approach ◽

Multithreaded Processors

Download Full-text