Abstract
Background
Illumina DNA methylation arrays are high-throughput platforms for cost-effective genome-wide profiling of individual CpGs. Experimental and technical factors introduce appreciable measurement variation, some of which can be mitigated by careful “preprocessing” of raw data.
Methods
Here we describe the ENmix preprocessing pipeline and compare it to a set of seven published alternative pipelines (ChAMP, Illumina, SWAN, Funnorm, Noob, wateRmelon, and RnBeads). We use two large sets of duplicate sample measurements with 450 K and EPIC arrays, along with mixtures of isogenic methylated and unmethylated cell line DNA to compare raw data and that preprocessed via different pipelines.
Results
Our evaluations show that the ENmix pipeline performs the best with significantly higher correlation and lower absolute difference between duplicate pairs, higher intraclass correlation coefficients (ICC) and smaller deviations from expected methylation level in mixture experiments. In addition to the pipeline function, ENmix software provides an integrated set of functions for reading in raw data files from mouse and human arrays, quality control, data preprocessing, visualization, detection of differentially methylated regions (DMRs), estimation of cell type proportions, and calculation of methylation age clocks. ENmix is computationally efficient, flexible and allows parallel computing. To facilitate further evaluations, we make all datasets and evaluation code publicly available.
Conclusion
Careful selection of robust data preprocessing methods is critical for DNA methylation array studies. ENmix outperformed other pipelines in our evaluations to minimize experimental variation and to improve data quality and study power.