TY - GEN
T1 - Probabilistic Blocking and Distributed Bayesian Entity Resolution
AU - Enamorado, Ted
AU - Steorts, Rebecca C.
N1 - Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - Entity resolution (ER) is becoming an increasingly important task across many domains (e.g., official statistics, human rights, medicine, etc.), where databases contain duplications of entities that need to be removed for later inferential and prediction tasks. Motivated by scaling to large data sets and providing uncertainty propagation, we propose a generalized approach to the blocking and ER pipeline which consists of two steps. First, a probabilistic blocking step, where we consider that of[5], which is ER record in its own right. Its usage for blocking allows one to reduce the comparison space greatly, providing overlapping blocks for any ER method in the literature. Second, the probabilistic blocking step is passed to any ER method, where one can evaluate uncertainty propagation depending on the ER task. We consider that of[12], which is a joint Bayesian method of both blocking and ER, that provides a joint posterior distribution regarding both the blocking and ER, and scales to large datasets, however, it does it a slower rate than when used in tandem with[5]. Through simulation and empirical studies, we show that our proposed methodology outperforms[5, 12] when used in isolation of each other. It produces reliable estimates of the underlying linkage structure and the number of true entities in each dataset. Furthermore, it produces an approximate posterior distribution and preserves transitive closures of the linkages.
AB - Entity resolution (ER) is becoming an increasingly important task across many domains (e.g., official statistics, human rights, medicine, etc.), where databases contain duplications of entities that need to be removed for later inferential and prediction tasks. Motivated by scaling to large data sets and providing uncertainty propagation, we propose a generalized approach to the blocking and ER pipeline which consists of two steps. First, a probabilistic blocking step, where we consider that of[5], which is ER record in its own right. Its usage for blocking allows one to reduce the comparison space greatly, providing overlapping blocks for any ER method in the literature. Second, the probabilistic blocking step is passed to any ER method, where one can evaluate uncertainty propagation depending on the ER task. We consider that of[12], which is a joint Bayesian method of both blocking and ER, that provides a joint posterior distribution regarding both the blocking and ER, and scales to large datasets, however, it does it a slower rate than when used in tandem with[5]. Through simulation and empirical studies, we show that our proposed methodology outperforms[5, 12] when used in isolation of each other. It produces reliable estimates of the underlying linkage structure and the number of true entities in each dataset. Furthermore, it produces an approximate posterior distribution and preserves transitive closures of the linkages.
UR - https://www.scopus.com/pages/publications/85092113060
U2 - 10.1007/978-3-030-57521-2_16
DO - 10.1007/978-3-030-57521-2_16
M3 - Conference contribution
AN - SCOPUS:85092113060
SN - 9783030575205
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 224
EP - 239
BT - Privacy in Statistical Databases - UNESCO Chair in Data Privacy, International Conference, PSD 2020, Proceedings
A2 - Domingo-Ferrer, Josep
A2 - Muralidhar, Krishnamurty
PB - Springer Science and Business Media Deutschland GmbH
T2 - International Conference on Privacy in Statistical Databases, PSD 2020
Y2 - 23 September 2020 through 25 September 2020
ER -