Probabilistic Blocking and Distributed Bayesian Entity Resolution

  • Ted Enamorado
  • , Rebecca C. Steorts

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    4 Scopus citations

    Abstract

    Entity resolution (ER) is becoming an increasingly important task across many domains (e.g., official statistics, human rights, medicine, etc.), where databases contain duplications of entities that need to be removed for later inferential and prediction tasks. Motivated by scaling to large data sets and providing uncertainty propagation, we propose a generalized approach to the blocking and ER pipeline which consists of two steps. First, a probabilistic blocking step, where we consider that of[5], which is ER record in its own right. Its usage for blocking allows one to reduce the comparison space greatly, providing overlapping blocks for any ER method in the literature. Second, the probabilistic blocking step is passed to any ER method, where one can evaluate uncertainty propagation depending on the ER task. We consider that of[12], which is a joint Bayesian method of both blocking and ER, that provides a joint posterior distribution regarding both the blocking and ER, and scales to large datasets, however, it does it a slower rate than when used in tandem with[5]. Through simulation and empirical studies, we show that our proposed methodology outperforms[5, 12] when used in isolation of each other. It produces reliable estimates of the underlying linkage structure and the number of true entities in each dataset. Furthermore, it produces an approximate posterior distribution and preserves transitive closures of the linkages.

    Original languageEnglish
    Title of host publicationPrivacy in Statistical Databases - UNESCO Chair in Data Privacy, International Conference, PSD 2020, Proceedings
    EditorsJosep Domingo-Ferrer, Krishnamurty Muralidhar
    PublisherSpringer Science and Business Media Deutschland GmbH
    Pages224-239
    Number of pages16
    ISBN (Print)9783030575205
    DOIs
    StatePublished - 2020
    EventInternational Conference on Privacy in Statistical Databases, PSD 2020 - Tarragona, Spain
    Duration: Sep 23 2020Sep 25 2020

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume12276 LNCS
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349

    Conference

    ConferenceInternational Conference on Privacy in Statistical Databases, PSD 2020
    Country/TerritorySpain
    CityTarragona
    Period09/23/2009/25/20

    Fingerprint

    Dive into the research topics of 'Probabilistic Blocking and Distributed Bayesian Entity Resolution'. Together they form a unique fingerprint.

    Cite this