TY - JOUR
T1 - PAC Guarantees and Effective Algorithms for Detecting Novel Categories
AU - Liu, Si
AU - Garrepalli, Risheek
AU - Hendrycks, Dan
AU - Fern, Alan
AU - Mondal, Debashis
AU - Dietterich, Thomas G.
N1 - Publisher Copyright:
© 2022 Si Liu, Risheek Garrepalli, Dan Hendrycks, Alan Fern, Debashis Mondal, Thomas G. Dietterich.
PY - 2022
Y1 - 2022
N2 - Open category detection is the problem of detecting “alien” test instances that belong to categories or classes that were not present in the training data (Liu et al., 2018). In many applications, reliably detecting such aliens is central to ensuring the safety and accuracy of test set predictions. Unfortunately, there are no algorithms that provide theoretical guarantees on their ability to detect aliens under general assumptions. Further, while there are algorithms for open category detection, there are few empirical results that directly report alien detection rates. Thus, there are significant theoretical and empirical gaps in our understanding of open category detection. In this paper, we take a step toward addressing this gap by studying a simple, but practically-relevant variant of open category detection. In our setting, we are provided with a “clean” training set that contains only the target categories of interest and an unlabeled “contaminated” training set that contains a fraction α of alien examples. Under the assumption that we know an upper bound on α, we develop an algorithm that gives PAC-style guarantees on the alien detection rate, while aiming to minimize false alarms. Given an overall budget on the amount of training data, we also derive the optimal allocation of samples between the mixture and the clean data sets. Experiments on synthetic and standard benchmark datasets evaluate the regimes in which the algorithm can be effective and provide a baseline for further advancements. In addition, for the situation when an upper bound for α is not available, we employ nine different anomaly proportion estimators, and run experiments on both synthetic and standard benchmark data sets to compare their performance.
AB - Open category detection is the problem of detecting “alien” test instances that belong to categories or classes that were not present in the training data (Liu et al., 2018). In many applications, reliably detecting such aliens is central to ensuring the safety and accuracy of test set predictions. Unfortunately, there are no algorithms that provide theoretical guarantees on their ability to detect aliens under general assumptions. Further, while there are algorithms for open category detection, there are few empirical results that directly report alien detection rates. Thus, there are significant theoretical and empirical gaps in our understanding of open category detection. In this paper, we take a step toward addressing this gap by studying a simple, but practically-relevant variant of open category detection. In our setting, we are provided with a “clean” training set that contains only the target categories of interest and an unlabeled “contaminated” training set that contains a fraction α of alien examples. Under the assumption that we know an upper bound on α, we develop an algorithm that gives PAC-style guarantees on the alien detection rate, while aiming to minimize false alarms. Given an overall budget on the amount of training data, we also derive the optimal allocation of samples between the mixture and the clean data sets. Experiments on synthetic and standard benchmark datasets evaluate the regimes in which the algorithm can be effective and provide a baseline for further advancements. In addition, for the situation when an upper bound for α is not available, we employ nine different anomaly proportion estimators, and run experiments on both synthetic and standard benchmark data sets to compare their performance.
KW - Alien detection rate
KW - Anomaly detection
KW - False positive rate
KW - Open category detection
KW - PAC guarantees
UR - http://www.scopus.com/inward/record.url?scp=85124236261&partnerID=8YFLogxK
M3 - Article
AN - SCOPUS:85124236261
SN - 1532-4435
VL - 23
JO - Journal of Machine Learning Research
JF - Journal of Machine Learning Research
ER -