scholarly journals anonymizeBAM: Versatile anonymization of human sequence data for open data sharing

2021 ◽  
Author(s):  
Christoph Ziegenhain ◽  
Rickard Sandberg

AbstractThe risks associated with re-identification of human genetic data are severely limiting open data sharing in life sciences. Here, we developed anonymizeBAM, a versatile tool for the anonymization of genetic variant information present in sequence data. Applying anonymizeBAM to single-cell RNA-seq and ATAC-seq datasets confirmed the complete removal of donor-related genetic information. Therefore, the accurate generation of de-identified sequence data will re-enable open sharing in sequencing-based studies for improved transparency, reproducibility, and innovation.

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Christoph Ziegenhain ◽  
Rickard Sandberg

AbstractThe risks associated with re-identification of human genetic data are severely limiting open data sharing in life sciences, even in studies where donor-related genetic variant information is not of primary interest. Here, we developed BAMboozle, a versatile tool to eliminate critical types of sensitive genetic information in human sequence data by reverting aligned reads to the genome reference sequence. Applying BAMboozle to functional genomics data, such as single-cell RNA-seq (scRNA-seq) and scATAC-seq datasets, confirmed the removal of donor-related single nucleotide polymorphisms (SNPs) and indels in a manner that did not disclose the altered positions. Importantly, BAMboozle only removes the genetic sequence variants of the sample (i.e., donor) while preserving other important aspects of the raw sequence data. For example, BAMboozled scRNA-seq data contained accurate cell-type associated gene expression signatures, splice kinetic information, and can be used for methods benchmarking. Altogether, BAMboozle efficiently removes genetic variation in aligned sequence data, which represents a step forward towards open data sharing in many areas of genomics where the genetic variant information is not of primary interest.


2018 ◽  
Author(s):  
Brian Tsui ◽  
Michelle Dow ◽  
Dylan Skola ◽  
Hannah Carter

The Sequence Read Archive (SRA) contains over one million publicly available sequencing runs from various studies using a variety of sequencing library strategies. These data inherently contain information about underlying genomic sequence variants which we exploit to extract allelic read counts on an unprecedented scale. We reprocessed over 250,000 human sequencing runs (>1000 TB data worth of raw sequence data) into a single unified dataset of allelic read counts for nearly 300,000 variants of biomedical relevance curated by NCBI dbSNP, where germline variants were detected in a median of 912 sequencing runs, and somatic variants were detected in a median of 4,876 sequencing runs, suggesting that this dataset facilitates identification of sequencing runs that harbor variants of interest. Allelic read counts obtained using a targeted alignment were very similar to read counts obtained from whole genome alignment. Analyzing allelic read count data for matched DNA and RNA samples from tumors, we find that RNA-seq can also recover variants identified by WXS, suggesting that reprocessed allelic read counts can support variant detection across different library strategies in SRA. This study provides a rich database of known human variants across SRA samples that can support future meta-analyses of human sequence variation.


2021 ◽  
pp. 002203452110202
Author(s):  
F. Schwendicke ◽  
J. Krois

Data are a key resource for modern societies and expected to improve quality, accessibility, affordability, safety, and equity of health care. Dental care and research are currently transforming into what we term data dentistry, with 3 main applications: 1) medical data analysis uses deep learning, allowing one to master unprecedented amounts of data (language, speech, imagery) and put them to productive use. 2) Data-enriched clinical care integrates data from individual (e.g., demographic, social, clinical and omics data, consumer data), setting (e.g., geospatial, environmental, provider-related data), and systems level (payer or regulatory data to characterize input, throughput, output, and outcomes of health care) to provide a comprehensive and continuous real-time assessment of biologic perturbations, individual behaviors, and context. Such care may contribute to a deeper understanding of health and disease and a more precise, personalized, predictive, and preventive care. 3) Data for research include open research data and data sharing, allowing one to appraise, benchmark, pool, replicate, and reuse data. Concerns and confidence into data-driven applications, stakeholders’ and system’s capabilities, and lack of data standardization and harmonization currently limit the development and implementation of data dentistry. Aspects of bias and data-user interaction require attention. Action items for the dental community circle around increasing data availability, refinement, and usage; demonstrating safety, value, and usefulness of applications; educating the dental workforce and consumers; providing performant and standardized infrastructure and processes; and incentivizing and adopting open data and data sharing.


Author(s):  
Di Xian ◽  
Peng Zhang ◽  
Ling Gao ◽  
Ruijing Sun ◽  
Haizhen Zhang ◽  
...  

AbstractFollowing the progress of satellite data assimilation in the 1990s, the combination of meteorological satellites and numerical models has changed the way scientists understand the earth. With the evolution of numerical weather prediction models and earth system models, meteorological satellites will play a more important role in earth sciences in the future. As part of the space-based infrastructure, the Fengyun (FY) meteorological satellites have contributed to earth science sustainability studies through an open data policy and stable data quality since the first launch of the FY-1A satellite in 1988. The capability of earth system monitoring was greatly enhanced after the second-generation polar orbiting FY-3 satellites and geostationary orbiting FY-4 satellites were developed. Meanwhile, the quality of the products generated from the FY-3 and FY-4 satellites is comparable to the well-known MODIS products. FY satellite data has been utilized broadly in weather forecasting, climate and climate change investigations, environmental disaster monitoring, etc. This article reviews the instruments mounted on the FY satellites. Sensor-dependent level 1 products (radiance data) and inversion algorithm-dependent level 2 products (geophysical parameters) are introduced. As an example, some typical geophysical parameters, such as wildfires, lightning, vegetation indices, aerosol products, soil moisture, and precipitation estimation have been demonstrated and validated by in-situ observations and other well-known satellite products. To help users access the FY products, a set of data sharing systems has been developed and operated. The newly developed data sharing system based on cloud technology has been illustrated to improve the efficiency of data delivery.


2019 ◽  
Vol 10 (20) ◽  
pp. 17 ◽  
Author(s):  
Mattia Previtali ◽  
Riccardo Valente

<p>The open data paradigm is changing the research approach in many fields such as remote sensing and the social sciences. This is supported by governmental decisions and policies that are boosting the open data wave, and in this context archaeology is also affected by this new trend. In many countries, archaeological data are still protected or only limited access is allowed. However, the strong political and economic support for the publication of government data as open data will change the accessibility and disciplinary expertise in the archaeological field too. In order to maximize the impact of data, their technical openness is of primary importance. Indeed, since a spreadsheet is more usable than a PDF of a table, the availability of digital archaeological data, which is structured using standardised approaches, is of primary importance for the real usability of published data. In this context, the main aim of this paper is to present a workflow for archaeological data sharing as open data with a large level of technical usability and interoperability. Primary data is mainly acquired through the use of digital techniques (e.g. digital cameras and terrestrial laser scanning). The processing of this raw data is performed with commercial software for scan registration and image processing, allowing for a simple and semi-automated workflow. Outputs obtained from this step are then processed in modelling and drawing environments to generate digital models, both 2D and 3D. These crude geometrical data are then enriched with further information to generate a Geographic Information System (GIS) which is finally published as open data using Open Geospatial Consortium (OGC) standards to maximise interoperability.</p><p><strong>Highlights:</strong></p><ul><li><p>Open data will change the accessibility and disciplinary expertise in the archaeological field.</p></li><li><p>The main aim of this paper is to present a workflow for archaeological data sharing as open data with a large level of interoperability.</p></li><li><p>Digital acquisition techniques are used to document archaeological excavations and a Geographic Information System (GIS) is generated that is published as open data.</p></li></ul>


2021 ◽  
Author(s):  
William T Clarke ◽  
Mark Mikkelsen ◽  
Georg Oeltzschner ◽  
Tiffany Bell ◽  
Amirmohammad Shamaei ◽  
...  

Purpose: The use of multiple data formats in the MRS community currently hinders data sharing and integration. NIfTI-MRS is proposed as a standard MR spectroscopy data format, which is implemented as an extension to the neuroimaging informatics technology initiative (NIfTI) format. Using this standardised format will facilitate data sharing, ease algorithm development, and encourage the integration of MRS analysis with other imaging modalities. Methods: A file format based on the NIfTI header extension framework was designed to incorporate essential spectroscopic metadata and additional encoding dimensions. A detailed description of the specification is provided. An open-source command-line conversion program is implemented to enable conversion of single-voxel and spectroscopic imaging data to NIfTI-MRS. To provide visualisation of data in NIfTI-MRS, a dedicated plugin is implemented for FSLeyes, the FSL image viewer. Results: Alongside online documentation, ten example datasets are provided in the proposed format. In addition, minimal examples of NIfTI-MRS readers have been implemented. The conversion software, spec2nii, currently converts fourteen formats to NIfTI-MRS, including DICOM and vendor proprietary formats. Conclusion: The proposed format aims to solve the issue of multiple data formats being used in the MRS community. By providing a single conversion point, it aims to simplify the processing and analysis of MRS data, thereby lowering the barrier to use of MRS. Furthermore, it can serve as the basis for open data sharing, collaboration, and interoperability of analysis programs. It also opens possibility of greater standardisation and harmonisation. By aligning with the dominant format in neuroimaging, NIfTI-MRS enables the use of mature tools present in the imaging community, demonstrated in this work by using a dedicated imaging tool, FSLeyes, as a viewer.


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e11875
Author(s):  
Tomoko Matsuda

Large volumes of high-throughput sequencing data have been submitted to the Sequencing Read Archive (SRA). The lack of experimental metadata associated with the data makes reuse and understanding data quality very difficult. In the case of RNA sequencing (RNA-Seq), which reveals the presence and quantity of RNA in a biological sample at any moment, it is necessary to consider that gene expression responds over a short time interval (several seconds to a few minutes) in many organisms. Therefore, to isolate RNA that accurately reflects the transcriptome at the point of harvest, raw biological samples should be processed by freezing in liquid nitrogen, immersing in RNA stabilization reagent or lysing and homogenizing in RNA lysis buffer containing guanidine thiocyanate as soon as possible. As the number of samples handled simultaneously increases, the time until the RNA is protected can increase. Here, to evaluate the effect of different lag times in RNA protection on RNA-Seq data, we harvested CHO-S cells after 3, 5, 6, and 7 days of cultivation, added RNA lysis buffer in a time course of 15, 30, 45, and 60 min after harvest, and conducted RNA-Seq. These RNA samples showed high RNA integrity number (RIN) values indicating non-degraded RNA, and sequence data from libraries prepared with these RNA samples was of high quality according to FastQC. We observed that, at the same cultivation day, global trends of gene expression were similar across the time course of addition of RNA lysis buffer; however, the expression of some genes was significantly different between the time-course samples of the same cultivation day; most of these differentially expressed genes were related to apoptosis. We conclude that the time lag between sample harvest and RNA protection influences gene expression of specific genes. It is, therefore, necessary to know not only RIN values of RNA and the quality of the sequence data but also how the experiment was performed when acquiring RNA-Seq data from the database.


Sign in / Sign up

Export Citation Format

Share Document