Background:
Genotype imputation as a service is developed to enable researchers to estimate
genotypes on haplotyped data without performing whole genome sequencing. However, genotype
imputation is computation intensive and thus it remains a challenge to satisfy the high performance
requirement of genome wide association study (GWAS).
Objective:
In this paper, we propose a high performance computing solution for genotype imputation
on supercomputers to enhance its execution performance.
Method:
We design and implement a multi-level parallelization that includes job level, process level
and thread level parallelization, enabled by job scheduling management, message passing interface
(MPI) and OpenMP, respectively. It involves job distribution, chunk partition and execution,
parallelized iteration for imputation and data concatenation. Due to the design of multi-level
parallelization, we can exploit the multi-machine/multi-core architecture to improve the performance
of genotype imputation.
Results:
Experiment results show that our proposed method can outperform the Hadoop-based
implementation of genotype imputation. Moreover, we conduct the experiments on supercomputers to
evaluate the performance of the proposed method. The evaluation shows that it can significantly
shorten the execution time, thus improving the performance for genotype imputation.
Conclusion:
The proposed multi-level parallelization, when deployed as an imputation as a service,
will facilitate bioinformatics researchers in Singapore to conduct genotype imputation and enhance
the association study.