Bloom filter medicine

3/8/2024

The use of partial agreement linkage models has been shown to greatly improve the linkage quality when compared to the use of exact comparisons. These scores are then converted into partial weights of agreement or partial disagreement weights (as distinct from full agreement or full disagreement ).

An approximate match is typically assigned a ‘similarity score’. An extension to the basic probabilistic model of record linkage allows for approximate matches between fields. It has been shown to produce equally good results when applied to Bloom filters. Probabilistic record linkage is preferred by many data linkage centres due to its proven track record of producing high quality linkage results from unencrypted identifiers. Composite Bloom filters may be useful in certain situations where a single linkage field is desirable or even mandated, but handling missing values and identifiers that change over time (such as address) remain issues. Record level (or composite) Bloom filters encode two or more identifiers into a single Bloom filter. Record linkage techniques (deterministic and probabilistic) can then be used to link records in much the same way as with unencrypted identifiers. Field level Bloom filters encode each identifier into a separate Bloom filter. Typically, PPRL techniques that use Bloom filters are applied at either the field or record level. The results of these hash functions determine which positions in the bit array are set to one. Text values are first split into elements (typically bigrams) each element is added to the Bloom filter by applying one or more hash functions to it. Bloom filters are implemented using an array of bits. Ī Bloom filter is a probabilistic data structure that is used to approximate the equality of two sets these similarity comparisons are extremely useful in record linkage allowing for typographic errors and variations in spelling. To consider for operational use within large-scale linkage systems, accuracy must be sufficiently high. The resultant accuracy or ‘quality’ of these techniques has often been overlooked.

Much research has focussed on the security aspect of the Bloom filters, such as cryptographic analyses of encoding methods, modifications, and hashing variations. As a result, research around privacy-preserving record linkage (PPRL) methods has become a pressing area of inquiry, with much focus on the use of Bloom filters. With growing demand for linked data, it has been critical for record linkage centres to implement methods which protect privacy, yet maximise the benefits that can be derived from data assets. In recent years, record linkage centres have adopted many different models and linkage methods to ensure the protection of individual privacy as part of their operational processes. Probabilistic linkages using Bloom filters benefit significantly from the use of similarity comparisons, with partial weight curves producing the best results, even when not optimised for that particular dataset. The use of Bloom filter similarity comparisons for probabilistic record linkage can produce linkage quality results which are comparable to Jaro-Winkler string similarities with unencrypted linkages. The Sørensen-Dice coefficient and Jaccard similarity produced the most consistent results across a spectrum of synthetic and real-world datasets. Field level partial weight curves for a specific dataset produced the best quality results. Linkages using approximate comparisons produced significantly better quality results than those using exact comparisons only. This was to compare the resulting quality of the approximate comparison techniques with linkages using simple cut-off similarity values and only exact matching. Deduplication linkages were run on each dataset using these partial weight curves. Using synthetic datasets with introduced errors to simulate datasets with a range of data quality and a large real-world administrative health dataset, the research estimated partial weight curves for converting similarity scores (for each approximate comparison method) to partial weights at both field and dataset level. In this study, we evaluate the effectiveness of three approximate comparison methods for Bloom filters within the context of the Fellegi-Sunter model of recording linkage: Sørensen–Dice coefficient, Jaccard similarity and Hamming distance. With few applications of Bloom filters within a probabilistic framework, there is limited information on whether approximate matches between Bloom filtered fields can improve linkage quality. A popular technique using Bloom filters with cryptographic analyses, modifications, and hashing variations to optimise privacy has been the focus of much research in this area. The need for increased privacy protection in data linkage has driven the development of privacy-preserving record linkage (PPRL) techniques.

0 Comments

BLOG

Bloom filter medicine

Leave a Reply.

Author

Archives

Categories