A blinded evaluation of privacy preserving record linkage with Bloom filters

Sean Randall; Helen Wichmann; Adrian Brown; James Boyd; Tom Eitelhuber; Alexandra Merchant; Anna Ferrante

doi:10.1186/s12874-022-01510-2

A blinded evaluation of privacy preserving record linkage with Bloom filters

BMC Medical Research Methodology ◽

10.1186/s12874-022-01510-2 ◽

2022 ◽

Vol 22 (1) ◽

Author(s):

Sean Randall ◽

Helen Wichmann ◽

Adrian Brown ◽

James Boyd ◽

Tom Eitelhuber ◽

...

Keyword(s):

Record Linkage ◽

Bloom Filter ◽

Privacy Preserving ◽

Bloom Filters ◽

Filter Method ◽

Morbidity Data ◽

Linkage Quality ◽

Hospital Morbidity ◽

Western Australian ◽

Direct Investigation

Abstract Background Privacy preserving record linkage (PPRL) methods using Bloom filters have shown promise for use in operational linkage settings. However real-world evaluations are required to confirm their suitability in practice. Methods An extract of records from the Western Australian (WA) Hospital Morbidity Data Collection 2011–2015 and WA Death Registrations 2011–2015 were encoded to Bloom filters, and then linked using privacy-preserving methods. Results were compared to a traditional, un-encoded linkage of the same datasets using the same blocking criteria to enable direct investigation of the comparison step. The encoded linkage was carried out in a blinded setting, where there was no access to un-encoded data or a ‘truth set’. Results The PPRL method using Bloom filters provided similar linkage quality to the traditional un-encoded linkage, with 99.3% of ‘groupings’ identical between privacy preserving and clear-text linkage. Conclusion The Bloom filter method appears suitable for use in situations where clear-text identifiers cannot be provided for linkage.

Download Full-text

Privacy preserving linkage using multiple dynamic match keys

International Journal for Population Data Science ◽

10.23889/ijpds.v4i1.1094 ◽

2019 ◽

Vol 4 (1) ◽

Author(s):

Sean Randall ◽

Adrian P Brown ◽

Anna M Ferrante ◽

James H Boyd

Keyword(s):

Bloom Filter ◽

Privacy Preserving ◽

Bloom Filters ◽

Filter Method ◽

Linkage Quality ◽

Filter Approach ◽

Synthetic Datasets ◽

Administrative Datasets ◽

Limited Accuracy ◽

Better Than

IntroductionAvailable and practical methods for privacy preserving linkage have shortcomings: methods utilising anonymous linkage codes provide limited accuracy while methods based on Bloom filters have proven vulnerable to frequency-based attacks. ObjectivesIn this paper, we present and evaluate a novel protocol that aims to meld both the accuracy of the Bloom filter method with the privacy achievable through the anonymous linkage code methodology. MethodsThe protocol involves creating multiple match-keys for each record, with the composition of each match-key depending on attributes of the underlying datasets being compared. The protocol was evaluated through de-duplication of four administrative datasets and two synthetic datasets; the ‘answers’ outlining which records belonged to the same individual were known for each dataset. The results were compared against results achieved with un-encoded linkage and other privacy preserving techniques on the same datasets. ResultsThe multiple match-key protocol presented here achieved high quality across all datasets, performing better than record-level Bloom filters and the SLK, but worse than field-level Bloom filters. ConclusionThe presented method provides high linkage quality while avoiding the frequency based attacks that have been demonstrated against the Bloom filter approach. The method appears promising for real world use.

Download Full-text

Evaluating Binary Encoding Techniques in The Presence of Missing Values in Privacy-Preserving Record Linkage

International Journal for Population Data Science ◽

10.23889/ijpds.v5i5.1445 ◽

2020 ◽

Vol 5 (5) ◽

Author(s):

Thilina Ranbaduge ◽

Peter Christen

Keyword(s):

National Security ◽

Record Linkage ◽

Missing Values ◽

Bloom Filter ◽

Privacy Preserving ◽

Bloom Filters ◽

Practical Applications ◽

Binary Encoding ◽

Linkage Quality

IntroductionApplications in domains ranging from healthcare to national security increasingly require records about individuals in sensitive databases to be linked in privacy-preserving ways. Missing values make the linkage process challenging because they can affect the encoding of attribute values. No study has systematically investigated how missing values affect the outcomes of different encoding techniques used in privacy-preserving linkage applications. Objectives and ApproachBinary encodings, such as Bloom filters, are popular for linking sensitive databases. They are now employed in real-world linkage applications. However, existing encoding techniques assume the quasi-identifying attributes used for encoding to be complete. Missing values can lead to incomplete encodings which can result in decreased or increased similarities and therefore to false non-matches or false matches. In this study we empirically evaluate three binary encoding techniques using real voter databases, where pairs of records that correspond to the same voter (with name or address changes) resulted in files of 100,000 and 500,000 records containing from 0% to 50% missing values. ResultsWe encoded between two and four of the attributes first and last name, street, and city into three record-level binary encodings: Cryptographic long-term key (CLK) [Schnell et al. 2009], record-level Bloom filter (RBF) [Durham et al. 2014], and tabulation Min-hashing (TBH) [Smith 2017]. Experiments showed a 10% to 25% drop on average in both precision and recall for all encoding techniques when missing values are increasing. CLK resulted in the highest decrease in precision, while TBH resulted in the highest decrease in recall compared to the other encoding techniques. ConclusionBinary encodings such as Bloom filters are now used in practical applications for linking sensitive databases. Our evaluation shows that such encoding techniques can result in lower linkage quality if there are missing values in quasi-identifying attributes. This highlights the need for novel encoding techniques that can overcome the challenge of missing values.

Download Full-text

Secure Privacy Preserving Record Linkage of Large Databases by Modified Bloom Filter Encodings

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.29 ◽

2017 ◽

Vol 1 (1) ◽

Cited By ~ 2

Author(s):

Rainer Schnell ◽

Christian Borgs

Keyword(s):

Record Linkage ◽

Large Scale ◽

Bloom Filter ◽

Privacy Preserving ◽

Error Rates ◽

Bloom Filters ◽

Data Sets ◽

Research Subjects ◽

Practical Applications ◽

Large Databases

ABSTRACTObjectiveIn most European settings, record linkage across different institutions has to be based on personal identifiers such as names, birthday or place of birth. To protect the privacy of research subjects, the identifiers have to be encrypted. In practice, these identifiers show error rates up to 20% per identifier, therefore linking on encrypted identifiers usually implies the loss of large subsets of the databases. In many applications, this loss of cases is related to variables of interest for the subject matter of the study. Therefore, this kind of record-linkage will generate biased estimates. These problems gave rise to techniques of Privacy Preserving Record Linkage (PPRL). Many different PPRL techniques have been suggested within the last 10 years, very few of them are suitable for practical applications with large database containing millions of records as they are typical for administrative or medical databases. One proven technique for PPRL for large scale applications is PPRL based on Bloom filters.MethodUsing appropriate parameter settings, Bloom filter approaches show linkage results comparable to linkage based on unencrypted identifiers. Furthermore, this approach has been used in real-world settings with data sets containing up to 100 Million records. By the application of suitable blocking strategies, linking can be done in reasonable time.ResultHowever, Bloom filters have been subject of cryptographic attacks. Previous research has shown that the straight application of Bloom filters has a nonzero re-identification risk. We will present new results on recently developed techniques to defy all known attacks on PPRL Bloom filters. These computationally simple algorithms modify the identifiers by different cryptographic diffusion techniques. The presentation will demonstrate these new algorithms and show their performance concerning precision, recall and re-identification risk on large databases.

Download Full-text

Optimization of the Mainzelliste software for fast privacy-preserving record linkage

Journal of Translational Medicine ◽

10.1186/s12967-020-02678-1 ◽

2021 ◽

Vol 19 (1) ◽

Author(s):

Florens Rohde ◽

Martin Franke ◽

Ziad Sehili ◽

Martin Lablans ◽

Erhard Rahm

Keyword(s):

Record Linkage ◽

Comprehensive Evaluation ◽

Personal Data ◽

Privacy Preserving ◽

Use Cases ◽

Locality Sensitive Hashing ◽

Third Party ◽

Bloom Filters ◽

Linkage Quality ◽

Order Of Magnitude

Abstract Background Data analysis for biomedical research often requires a record linkage step to identify records from multiple data sources referring to the same person. Due to the lack of unique personal identifiers across these sources, record linkage relies on the similarity of personal data such as first and last names or birth dates. However, the exchange of such identifying data with a third party, as is the case in record linkage, is generally subject to strict privacy requirements. This problem is addressed by privacy-preserving record linkage (PPRL) and pseudonymization services. Mainzelliste is an open-source record linkage and pseudonymization service used to carry out PPRL processes in real-world use cases. Methods We evaluate the linkage quality and performance of the linkage process using several real and near-real datasets with different properties w.r.t. size and error-rate of matching records. We conduct a comparison between (plaintext) record linkage and PPRL based on encoded records (Bloom filters). Furthermore, since the Mainzelliste software offers no blocking mechanism, we extend it by phonetic blocking as well as novel blocking schemes based on locality-sensitive hashing (LSH) to improve runtime for both standard and privacy-preserving record linkage. Results The Mainzelliste achieves high linkage quality for PPRL using field-level Bloom filters due to the use of an error-tolerant matching algorithm that can handle variances in names, in particular missing or transposed name compounds. However, due to the absence of blocking, the runtimes are unacceptable for real use cases with larger datasets. The newly implemented blocking approaches improve runtimes by orders of magnitude while retaining high linkage quality. Conclusion We conduct the first comprehensive evaluation of the record linkage facilities of the Mainzelliste software and extend it with blocking methods to improve its runtime. We observed a very high linkage quality for both plaintext as well as encoded data even in the presence of errors. The provided blocking methods provide order of magnitude improvements regarding runtime performance thus facilitating the use in research projects with large datasets and many participants.

Download Full-text

Cryptanalysis of Basic Bloom Filters Used for Privacy Preserving Record Linkage

Journal of Privacy and Confidentiality ◽

10.29012/jpc.v6i2.640 ◽

2014 ◽

Vol 6 (2) ◽

Cited By ~ 23

Author(s):

Frank Niedermeyer ◽

Simone Steinmetzer ◽

Martin Kroll ◽

Rainer Schnell

Keyword(s):

Statistical Analysis ◽

Record Linkage ◽

Bloom Filter ◽

Privacy Preserving ◽

Bloom Filters ◽

Small Subset ◽

Data Set ◽

Successful Attack ◽

Global Data

Bloom filter encoded identifiers are increasingly used for privacy preserving record linkage applications, because they allow for errors in encrypted identifiers. However, little research on the security of Bloom filters has been published so far. In this paper, we formalize a successful attack on Bloom filters composed of bigrams. It has previously been assumed in the literature that an attacker knows the global data set from which a sample is drawn. In contrast, we suppose that an attacker does not know this global data set. Instead, we assume the adversary knows a publicly available list of the most frequent attributes. The attack is based on subtle filtering and elementary statistical analysis of encrypted bigrams. The attack described in this paper can be used for the deciphering of a whole database instead of only a small subset of the most frequent names, as in previous research. We illustrate our proposed method with an attack on a database of encrypted surnames. Finally, we describe modifications of the Bloom filters for preventing similar attacks.

Download Full-text

Encoding Diagnostic Codes for Privacy-Preserving Record Linkage

International Journal for Population Data Science ◽

10.23889/ijpds.v5i5.1461 ◽

2020 ◽

Vol 5 (5) ◽

Author(s):

Rainer Schnell ◽

Christian Borgs

Keyword(s):

Record Linkage ◽

Privacy Preserving ◽

New Method ◽

Bloom Filters ◽

Mortality Data ◽

Sensitive Information ◽

Diagnostic Code ◽

Relational Similarity ◽

Linkage Quality ◽

The Impact

IntroductionDiagnostic codes, such as the ICD-10, may be considered as sensitive information. If such codes have to be encoded using current methods for data linkage, all hierarchical information given by the code positions will be lost. We present a technique (HPBFs) for preserving the hierarchical information of the codes while protecting privacy. The new method modifies a widely used Privacy-preserving Record Linkage (PPRL) technique based on Bloom filters for the use with hierarchical codes. Objectives and ApproachAssessing the similarities of hierarchical codes requires considering the code positions of two codes in a given diagnostic hierarchy. The hierarchical similarities of the original diagnostic code pairs should correspond closely to the similarity of the encoded pairs of the same code. Furthermore, to assess the hierarchy-preserving properties of an encoding, the impact on similarity measures from differing code positions at all levels of the code hierarchy can be evaluated. A full match of codes should yield a higher similarity than partial matches. Finally, the new method is tested against ad-hoc solutions as an addition to a standard PPRL setup. This is done using real-world mortality data with a known link status of two databases. ResultsIn all applications for encoded ICD codes where either categorical discrimination, relational similarity or linkage quality in a PPRL setting is required, HPBFs outperform other known methods. Lower mean differences and smaller confidence intervals between clear-text codes and encrypted code pairs were observed, indicating better preservation of hierarchical similarities. Finally, using these techniques allows for much better hierarchical discrimination for partial matches. ConclusionThe new technique yields better linkage results than all other known methods to encrypt hierarchical codes. In all tests, comparing categorical discrimination, relational similarity and PPRL linkage quality, HPBFs outperformed methods currently used.

Download Full-text

Securing Bloom Filters for Privacy-preserving Record Linkage

Proceedings of the 29th ACM International Conference on Information & Knowledge Management ◽

10.1145/3340531.3412105 ◽

2020 ◽

Author(s):

Thilina Ranbaduge ◽

Rainer Schnell

Keyword(s):

Record Linkage ◽

Privacy Preserving ◽

Bloom Filters

Download Full-text

Encoding Hierarchical Classification Codes for Privacy-Preserving Record Linkage Using Bloom Filters

Machine Learning and Knowledge Discovery in Databases - Communications in Computer and Information Science ◽

10.1007/978-3-030-43887-6_12 ◽

2020 ◽

pp. 142-156

Author(s):

Rainer Schnell ◽

Christian Borgs

Keyword(s):

Record Linkage ◽

Hierarchical Classification ◽

Privacy Preserving ◽

Bloom Filters

Download Full-text

Efficient Cryptanalysis of Bloom Filters for Privacy-Preserving Record Linkage

Advances in Knowledge Discovery and Data Mining - Lecture Notes in Computer Science ◽

10.1007/978-3-319-57454-7_49 ◽

2017 ◽

pp. 628-640 ◽

Cited By ~ 11

Author(s):

Peter Christen ◽

Rainer Schnell ◽

Dinusha Vatsalan ◽

Thilina Ranbaduge

Keyword(s):

Record Linkage ◽

Privacy Preserving ◽

Bloom Filters

Download Full-text

Validation of intellectual disability coding through hospital morbidity records using an intellectual disability population-based database in Western Australia

BMJ Open ◽

10.1136/bmjopen-2017-019113 ◽

2018 ◽

Vol 8 (1) ◽

pp. e019113 ◽

Cited By ~ 6

Author(s):

Jenny Bourke ◽

Kingsley Wong ◽

Helen Leonard

Keyword(s):

Intellectual Disability ◽

International Classification Of Diseases ◽

Population Based ◽

Reliable Source ◽

Morbidity Data ◽

Classification Of Diseases ◽

Data Source ◽

Hospital Morbidity ◽

Western Australian ◽

Hospital Morbidity Data

ObjectivesTo investigate how well intellectual disability (ID) can be ascertained using hospital morbidity data compared with a population-based data source.Design, setting and participantsAll children born in 1983–2010 with a hospital admission in the Western Australian Hospital Morbidity Data System (HMDS) were linked with the Western Australian Intellectual Disability Exploring Answers (IDEA) database. The International Classification of Diseases hospital codes consistent with ID were also identified.Main outcome measuresThe characteristics of those children identified with ID through either or both sources were investigated.ResultsOf the 488 905 individuals in the study, 10 218 (2.1%) were identified with ID in either IDEA or HMDS with 1435 (14.0%) individuals identified in both databases, 8305 (81.3%) unique to the IDEA database and 478 (4.7%) unique to the HMDS dataset only. Of those unique to the HMDS dataset, about a quarter (n=124) had died before 1 year of age and most of these (75%) before 1 month. Children with ID who were also coded as such in the HMDS data were more likely to be aged under 1 year, female, non-Aboriginal and have a severe level of ID, compared with those not coded in the HMDS data. The sensitivity of using HMDS to identify ID was 14.7%, whereas the specificity was much higher at 99.9%.ConclusionHospital morbidity data are not a reliable source for identifying ID within a population, and epidemiological researchers need to take these findings into account in their study design.

Download Full-text