Scalable Phylogeny Reconstruction with Disaggregated Near-memory Processing

Nikolaos Alachiotis; Panagiotis Skrimponis; Manolis Pissadakis; Dionisios Pnevmatikatos

doi:10.1145/3484983

Scalable Phylogeny Reconstruction with Disaggregated Near-memory Processing

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3484983 ◽

2022 ◽

Vol 15 (3) ◽

pp. 1-32

Author(s):

Nikolaos Alachiotis ◽

Panagiotis Skrimponis ◽

Manolis Pissadakis ◽

Dionisios Pnevmatikatos

Keyword(s):

Virtual Machines ◽

Likelihood Function ◽

Hardware Acceleration ◽

Operation Performance ◽

Data Movement ◽

Memory Processing ◽

Data Intensive ◽

Maximum Likelihood Methods ◽

Time And Energy ◽

Computational Kernel

Disaggregated computer architectures eliminate resource fragmentation in next-generation datacenters by enabling virtual machines to employ resources such as CPUs, memory, and accelerators that are physically located on different servers. While this paves the way for highly compute- and/or memory-intensive applications to potentially deploy all CPUs and/or memory resources in a datacenter, it poses a major challenge to the efficient deployment of hardware accelerators: input/output data can reside on different servers than the ones hosting accelerator resources, thereby requiring time- and energy-consuming remote data transfers that diminish the gains of hardware acceleration. Targeting a disaggregated datacenter architecture similar to the IBM dReDBox disaggregated datacenter prototype, the present work explores the potential of deploying custom acceleration units adjacently to the disaggregated-memory controller on memory bricks (in dReDBox terminology), which is implemented on FPGA technology, to reduce data movement and improve performance and energy efficiency when reconstructing large phylogenies (evolutionary relationships among organisms). A fundamental computational kernel is the Phylogenetic Likelihood Function (PLF), which dominates the total execution time (up to 95%) of widely used maximum-likelihood methods. Numerous efforts to boost PLF performance over the years focused on accelerating computation; since the PLF is a data-intensive, memory-bound operation, performance remains limited by data movement, and memory disaggregation only exacerbates the problem. We describe two near-memory processing models, one that addresses the problem of workload distribution to memory bricks, which is particularly tailored toward larger genomes (e.g., plants and mammals), and one that reduces overall memory requirements through memory-side data interpolation transparently to the application, thereby allowing the phylogeny size to scale to a larger number of organisms without requiring additional memory.

Download Full-text

Not in Name Alone: A Memristive Memory Processing Unit for Real In-Memory Processing

10.36227/techrxiv.12894941.v1 ◽

2020 ◽

Author(s):

Ameer Haj-Ali ◽

Nimrod Wald ◽

Ronny Ronen ◽

Shahar Kvatinsky ◽

Rotem Ben-Hur

Keyword(s):

Data Transfer ◽

Single Instruction Multiple Data ◽

Processing Unit ◽

Von Neumann ◽

Data Movement ◽

Memory Processing ◽

Data Intensive ◽

Root Cause ◽

Multiple Data ◽

Data Intensive Applications

<div>Data movement between processing and memory is</div><div>the root cause of the limited performance and energy</div><div>efficiency in modern von Neumann systems. To</div><div>overcome the data-movement bottleneck, we present</div><div>the memristive Memory Processing Unit (mMPU)—a</div><div>real processing-in-memory system in which the computation is done directly in the</div><div>memory cells, thus eliminating the necessity for data transfer. Furthermore, with its</div><div>enormous inner parallelism, this system is ideal for data-intensive applications that are</div><div>based on single instruction, multiple data (SIMD)—providing high throughput and</div><div>energy-efficiency.</div>

Download Full-text

Not in Name Alone: A Memristive Memory Processing Unit for Real In-Memory Processing

10.36227/techrxiv.12894941 ◽

2020 ◽

Author(s):

Ameer Haj-Ali ◽

Nimrod Wald ◽

Ronny Ronen ◽

Shahar Kvatinsky ◽

Rotem Ben-Hur

Keyword(s):

Data Transfer ◽

Single Instruction Multiple Data ◽

Processing Unit ◽

Von Neumann ◽

Data Movement ◽

Memory Processing ◽

Data Intensive ◽

Root Cause ◽

Multiple Data ◽

Data Intensive Applications

Download Full-text

Hardware-Accelerated Dual-Split Trees

Proceedings of the ACM on Computer Graphics and Interactive Techniques ◽

10.1145/3406185 ◽

2020 ◽

Vol 3 (2) ◽

pp. 1-21

Author(s):

Daqi Lin ◽

Elena Vasiou ◽

Cem Yuksel ◽

Daniel Kopta ◽

Erik Brunvand

Keyword(s):

Ray Tracing ◽

Hardware Acceleration ◽

Memory Storage ◽

Compact Representation ◽

Space Partitioning ◽

Data Movement ◽

Bounding Volume ◽

Bounding Boxes ◽

Split Trees ◽

Bounding Volume Hierarchies

Bounding volume hierarchies (BVH) are the most widely used acceleration structures for ray tracing due to their high construction and traversal performance. However, the bounding planes shared between parent and children bounding boxes is an inherent storage redundancy that limits further improvement in performance due to the memory cost of reading these redundant planes. Dual-split trees can create identical space partitioning as BVHs, but in a compact form using less memory by eliminating the redundancies of the BVH structure representation. This reduction in memory storage and data movement translates to faster ray traversal and better energy efficiency. Yet, the performance benefits of dual-split trees are undermined by the processing required to extract the necessary information from their compact representation. This involves bit manipulations and branching instructions which are inefficient in software. We introduce hardware acceleration for dual-split trees and show that the performance advantages over BVHs are emphasized in a hardware ray tracing context that can take advantage of such acceleration. We provide details on how the operations needed for decoding dual-split tree nodes can be implemented in hardware and present experiments in a number of scenes with different sizes using path tracing. In our experiments, we have observed up to 31% reduction in render time and 38% energy saving using dual-split trees as compared to binary BVHs representing identical space partitioning.

Download Full-text

GATECloud.net: a platform for large-scale, open-source text processing on the cloud

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2012.0071 ◽

2013 ◽

Vol 371 (1983) ◽

pp. 20120071 ◽

Cited By ~ 23

Author(s):

Valentin Tablan ◽

Ian Roberts ◽

Hamish Cunningham ◽

Kalina Bontcheva

Keyword(s):

Cloud Computing ◽

Language Processing ◽

Large Scale ◽

Virtual Machines ◽

Cost Benefit Analysis ◽

Text Processing ◽

Cost Benefit ◽

Data Intensive ◽

On Demand ◽

Usage Evaluation

Cloud computing is increasingly being regarded as a key enabler of the ‘democratization of science’, because on-demand, highly scalable cloud computing facilities enable researchers anywhere to carry out data-intensive experiments. In the context of natural language processing (NLP), algorithms tend to be complex, which makes their parallelization and deployment on cloud platforms a non-trivial task. This study presents a new, unique, cloud-based platform for large-scale NLP research—GATECloud. net. It enables researchers to carry out data-intensive NLP experiments by harnessing the vast, on-demand compute power of the Amazon cloud. Important infrastructural issues are dealt with by the platform, completely transparently for the researcher: load balancing, efficient data upload and storage, deployment on the virtual machines, security and fault tolerance. We also include a cost–benefit analysis and usage evaluation.

Download Full-text

Normalized Transfer of Bulk Data By Using UDP In Dedicated Networks

International Journal of Smart Sensor and Adhoc Network. ◽

10.47893/ijssan.2012.1071 ◽

2012 ◽

pp. 243-246

Author(s):

Kurmachalam Ajay Kumar ◽

Saritha Vemuri ◽

Ralla Suresh

Keyword(s):

High Speed ◽

Data Transfer ◽

Long Distance ◽

Memory Processing ◽

Data Intensive ◽

Bulk Data ◽

Network Links ◽

Bulk Data Transfer ◽

Dedicated Networks ◽

Effective Buffer

High speed bulk data transfer is an important part of many data-intensive scientific applications. TCP fails for the transfer of large amounts of data over long distance across high-speed dedicated network links. Due to system hardware is incapable of saturating the bandwidths supported by the network and rise buffer overflow and packet-loss in the system. To overcome this there is a necessity to build a Performance Adaptive-UDP (PA-UDP) protocol for dynamically maximizing the implementation under different systems. A mathematical model and algorithms are used for effective buffer and CPU management. Performance Adaptive-UDP is a supreme protocol than other protocols by maintaining memory processing, packetloss processing and CPU utilization. Based on this protocol bulk data transfer is processed with high speed over the dedicated network links.

Download Full-text

An Overview of In-memory Processing with Emerging Non-volatile Memory for Data-intensive Applications

Proceedings of the 2019 on Great Lakes Symposium on VLSI - GLSVLSI '19 ◽

10.1145/3299874.3319452 ◽

2019 ◽

Cited By ~ 4

Author(s):

Bing Li ◽

Bonan Yan ◽

Hai Li

Keyword(s):

Memory Processing ◽

Data Intensive ◽

Non Volatile Memory ◽

Volatile Memory ◽

Data Intensive Applications

Download Full-text

Multi-Agent Genetic Algorithm for Efficient Load Balancing in Cloud Computing

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.c8836.029420 ◽

2020 ◽

Vol 9 (4) ◽

pp. 45-51

Keyword(s):

Genetic Algorithm ◽

Cloud Computing ◽

Load Balancing ◽

Optimization Problem ◽

Virtual Machines ◽

Work Load ◽

Np Hard Problem ◽

Multi Agent ◽

Time And Energy ◽

User Priority

Cloud computing, one of the fastest growing fields, is the the delivery of computing resources and services. Load balancing is a key problem in cloud computing (CC) that deals with the even distribution of work load across multiple virtual machines to ensure that no machine is overloaded or underutilized during the task computation. The load balancing optimization problem is an NP-hard problem, hence, for the optimal usage of available resources, we propose a new efficient user-priority multi-agent genetic algorithm (GA). Our algorithm takes the “users’ priority and earliest job finishing time” into consideration for minimizing the response time and energy. We simulate our algorithm using Cloud-Analyst and show that our algorithm outperforms the existing algorithms for load balancing.

Download Full-text

Deploying Multi-tenant FPGAs within Linux-based Cloud Infrastructure

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3474058 ◽

2022 ◽

Vol 15 (2) ◽

pp. 1-31

Author(s):

Joel Mandebi Mbongue ◽

Danielle Tchuinkou Kwadjo ◽

Alex Shuping ◽

Christophe Bobda

Keyword(s):

Software Architecture ◽

Hardware Acceleration ◽

Maximum Frequency ◽

Cloud Infrastructure ◽

Fpga Design ◽

Data Movement ◽

Field Programmable ◽

Minimal Data ◽

On Chip ◽

Cloud Users

Cloud deployments now increasingly exploit Field-Programmable Gate Array (FPGA) accelerators as part of virtual instances. While cloud FPGAs are still essentially single-tenant, the growing demand for efficient hardware acceleration paves the way to FPGA multi-tenancy. It then becomes necessary to explore architectures, design flows, and resource management features that aim at exposing multi-tenant FPGAs to the cloud users. In this article, we discuss a hardware/software architecture that supports provisioning space-shared FPGAs in Kernel-based Virtual Machine (KVM) clouds. The proposed hardware/software architecture introduces an FPGA organization that improves hardware consolidation and support hardware elasticity with minimal data movement overhead. It also relies on VirtIO to decrease communication latency between hardware and software domains. Prototyping the proposed architecture with a Virtex UltraScale+ FPGA demonstrated near specification maximum frequency for on-chip data movement and high throughput in virtual instance access to hardware accelerators. We demonstrate similar performance compared to single-tenant deployment while increasing FPGA utilization, which is one of the goals of virtualization. Overall, our FPGA design achieved about 2× higher maximum frequency than the state of the art and a bandwidth reaching up to 28 Gbps on 32-bit data width.

Download Full-text

Live migration of virtual machines with their local persistent storage in a data intensive cloud

International Journal of High Performance Computing and Networking ◽

10.1504/ijhpcn.2017.10003771 ◽

2017 ◽

Vol 10 (1/2) ◽

pp. 134 ◽

Cited By ~ 1

Author(s):

P. Santhi Thilagam ◽

Abhinit Modi ◽

Raghavendra Achar

Keyword(s):

Virtual Machines ◽

Live Migration ◽

Data Intensive ◽

Persistent Storage

Download Full-text

A Survey of Resource Management for Processing-In-Memory and Near-Memory Processing Architectures

Journal of Low Power Electronics and Applications ◽

10.3390/jlpea10040030 ◽

2020 ◽

Vol 10 (4) ◽

pp. 30

Author(s):

Kamil Khan ◽

Sudeep Pasricha ◽

Ryan Gary Kim

Keyword(s):

Resource Management ◽

Random Access ◽

Computation Offloading ◽

New Paradigm ◽

Data Movement ◽

Memory Processing ◽

Big Data Applications ◽

Challenges And Opportunities ◽

Management Techniques ◽

Access Patterns

Due to the amount of data involved in emerging deep learning and big data applications, operations related to data movement have quickly become a bottleneck. Data-centric computing (DCC), as enabled by processing-in-memory (PIM) and near-memory processing (NMP) paradigms, aims to accelerate these types of applications by moving the computation closer to the data. Over the past few years, researchers have proposed various memory architectures that enable DCC systems, such as logic layers in 3D-stacked memories or charge-sharing-based bitwise operations in dynamic random-access memory (DRAM). However, application-specific memory access patterns, power and thermal concerns, memory technology limitations, and inconsistent performance gains complicate the offloading of computation in DCC systems. Therefore, designing intelligent resource management techniques for computation offloading is vital for leveraging the potential offered by this new paradigm. In this article, we survey the major trends in managing PIM and NMP-based DCC systems and provide a review of the landscape of resource management techniques employed by system designers for such systems. Additionally, we discuss the future challenges and opportunities in DCC management.

Download Full-text