TY - GEN
T1 - What is the Value of Data? on Mathematical Methods for Data Quality Estimation
AU - Raviv, Netanel
AU - Jain, Siddharth
AU - Bruck, Jehoshua
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/6
Y1 - 2020/6
N2 - Data is one of the most important assets of the information age, and its societal impact is undisputed. Yet, rigorous methods of assessing the quality of data are lacking. In this paper, we propose a formal definition for the quality of a given dataset. We assess a dataset's quality by a quantity we call the expected diameter, which measures the expected disagreement between two randomly chosen hypotheses that explain it, and has recently found applications in active learning. We focus on Boolean hyperplanes, and utilize a collection of Fourier analytic, algebraic, and probabilistic methods to come up with theoretical guarantees and practical solutions for the computation of the expected diameter. We also study the behaviour of the expected diameter on algebraically structured datasets, conduct experiments that validate this notion of quality, and demonstrate the feasibility of our techniques.
AB - Data is one of the most important assets of the information age, and its societal impact is undisputed. Yet, rigorous methods of assessing the quality of data are lacking. In this paper, we propose a formal definition for the quality of a given dataset. We assess a dataset's quality by a quantity we call the expected diameter, which measures the expected disagreement between two randomly chosen hypotheses that explain it, and has recently found applications in active learning. We focus on Boolean hyperplanes, and utilize a collection of Fourier analytic, algebraic, and probabilistic methods to come up with theoretical guarantees and practical solutions for the computation of the expected diameter. We also study the behaviour of the expected diameter on algebraically structured datasets, conduct experiments that validate this notion of quality, and demonstrate the feasibility of our techniques.
UR - https://www.scopus.com/pages/publications/85090411941
U2 - 10.1109/ISIT44484.2020.9174311
DO - 10.1109/ISIT44484.2020.9174311
M3 - Conference contribution
AN - SCOPUS:85090411941
T3 - IEEE International Symposium on Information Theory - Proceedings
SP - 2825
EP - 2830
BT - 2020 IEEE International Symposium on Information Theory, ISIT 2020 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 IEEE International Symposium on Information Theory, ISIT 2020
Y2 - 21 July 2020 through 26 July 2020
ER -