Integrating Web service and grid enabling technologies to provide desktop access to high-performance cluster-based components for large-scale data services

Recent progress in high-throughput instrumentations has led to an astonishing growth in both volume and complexity of biomedical data collected from various sources. The planet-size data brings serious challenges to the storage and computing technologies. Cloud computing is an alternative to crack the nut because it gives concurrent consideration to enable storage and high-performance computing on large-scale data. This work briefly introduces the data intensive computing system and summarizes existing cloud-based resources in bioinformatics. These developments and applications would facilitate biomedical research to make the vast amount of diversification data meaningful and usable.

Download Full-text

SW-LZMA: Parallel Implementation of LZMA Based on SW26010 Many-Core Processor

Wireless Communications and Mobile Computing ◽

10.1155/2021/4486494 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Bingzheng Li ◽

Jinchen Xu ◽

Zijing Liu

Keyword(s):

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Cluster Systems ◽

Large Scale Data ◽

Many Core ◽

High Performance Computing Cluster ◽

Performance Computing ◽

Scale Data ◽

Computing Cluster

With the development of high-performance computing and big data applications, the scale of data transmitted, stored, and processed by high-performance computing cluster systems is increasing explosively. Efficient compression of large-scale data and reducing the space required for data storage and transmission is one of the keys to improving the performance of high-performance computing cluster systems. In this paper, we present SW-LZMA, a parallel design and optimization of LZMA based on the Sunway 26010 heterogeneous many-core processor. Combined with the characteristics of SW26010 processors, we analyse the storage space requirements, memory access characteristics, and hotspot functions of the LZMA algorithm and implement the thread-level parallelism of the LZMA algorithm based on Athread interface. Furthermore, we make a fine-grained layout of LDM address space to achieve DMA double buffer cyclic sliding window algorithm, which optimizes the performance of SW-LZMA. The experimental results show that compared with the serial baseline implementation of LZMA, the parallel LZMA algorithm obtains a maximum speedup ratio of 4.1 times using the Silesia corpus benchmark, while on the large-scale data set, speedup is 5.3 times.

Download Full-text

PHash: A memory-efficient, high-performance key-value store for large-scale data-intensive applications

Journal of Systems and Software ◽

10.1016/j.jss.2016.09.047 ◽

2017 ◽

Vol 123 ◽

pp. 33-44 ◽

Cited By ~ 2

Author(s):

Hyotaek Shim

Keyword(s):

High Performance ◽

Large Scale ◽

Data Intensive ◽

Large Scale Data ◽

Data Intensive Applications ◽

Scale Data ◽

Memory Efficient

Download Full-text

High-Performance Machine Learning for Large-Scale Data Classification considering Class Imbalance

Scientific Programming ◽

10.1155/2020/1953461 ◽

2020 ◽

Vol 2020 ◽

pp. 1-16

Author(s):

Yang Liu ◽

Xiang Li ◽

Xianbang Chen ◽

Xi Wang ◽

Huaqiang Li

Keyword(s):

Neural Network ◽

Machine Learning ◽

High Performance ◽

Large Scale ◽

Class Imbalance ◽

Data Classification ◽

Training Dataset ◽

Large Scale Data ◽

Input Layer ◽

Scale Data

Currently, data classification is one of the most important ways to analysis data. However, along with the development of data collection, transmission, and storage technologies, the scale of the data has been sharply increased. Additionally, due to multiple classes and imbalanced data distribution in the dataset, the class imbalance issue is also gradually highlighted. The traditional machine learning algorithms lack of abilities for handling the aforementioned issues so that the classification efficiency and precision may be significantly impacted. Therefore, this paper presents an improved artificial neural network in enabling the high-performance classification for the imbalanced large volume data. Firstly, the Borderline-SMOTE (synthetic minority oversampling technique) algorithm is employed to balance the training dataset, which potentially aims at improving the training of the back propagation neural network (BPNN), and then, zero-mean, batch-normalization, and rectified linear unit (ReLU) are further employed to optimize the input layer and hidden layers of BPNN. At last, the ensemble learning-based parallelization of the improved BPNN is implemented using the Hadoop framework. Positive conclusions can be summarized according to the experimental results. Benefitting from Borderline-SMOTE, the imbalanced training dataset can be balanced, which improves the training performance and the classification accuracy. The improvements for the input layer and hidden layer also enhance the training performances in terms of convergence. The parallelization and the ensemble learning techniques enable BPNN to implement the high-performance large-scale data classification. The experimental results show the effectiveness of the presented classification algorithm.

Download Full-text

Work in progress — Integration of the scientific workflow paradigm into high performance computing and large scale data management curricula

2010 IEEE Frontiers in Education Conference (FIE) ◽

10.1109/fie.2010.5673235 ◽

2010 ◽

Author(s):

Brandeis Marshall ◽

John Springer ◽

Thomas Hacker

Keyword(s):

Data Management ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Scientific Workflow ◽

Work In Progress ◽

Large Scale Data ◽

Performance Computing ◽

Scale Data

Download Full-text

Exploring the optimal strategy for large-scale data movement in high-performance networks

2012 IEEE 31st International Performance Computing and Communications Conference (IPCCC) ◽

10.1109/pccc.2012.6407676 ◽

2012 ◽

Author(s):

P. Brown ◽

Mengxia Zhu ◽

Qishi Wu ◽

Daqing Yun ◽

J. Zurawski

Keyword(s):

Optimal Strategy ◽

High Performance ◽

Large Scale ◽

Data Movement ◽

Large Scale Data ◽

Scale Data

Download Full-text

Large-Scale Co-Phylogenetic Analysis on the Grid

Cloud, Grid and High Performance Computing ◽

10.4018/978-1-60960-603-9.ch014 ◽

2011 ◽

pp. 222-237

Author(s):

Heinz Stockinger ◽

Alexander F. Auch ◽

Markus Göker ◽

Jan Meier-Kolthoff ◽

Alexandros Stamatakis

Keyword(s):

Phylogenetic Analysis ◽

Phylogenetic Trees ◽

High Performance ◽

Large Scale ◽

Large Scale Data ◽

Statistical Framework ◽

Phylogenetic Studies ◽

Data Analyses ◽

Host Parasite ◽

Scale Data

Phylogenetic data analysis represents an extremely compute-intensive area of Bioinformatics and thus requires high-performance technologies. Another compute- and memory-intensive problem is that of host-parasite co-phylogenetic analysis: given two phylogenetic trees, one for the hosts (e.g., mammals) and one for their respective parasites (e.g., lice) the question arises whether host and parasite trees are more similar to each other than expected by chance alone. CopyCat is an easy-to-use tool that allows biologists to conduct such co-phylogenetic studies within an elaborate statistical framework based on the highly optimized sequential and parallel AxParafit program. We have developed enhanced versions of these tools that efficiently exploit a Grid environment and therefore facilitate large-scale data analyses. Furthermore, we developed a freely accessible client tool that provides co-phylogenetic analysis capabilities. Since the computational bulk of the problem is embarrassingly parallel, it fits well to a computational Grid and reduces the response time of large scale analyses.

Download Full-text