TY - JOUR
T1 - The effect of sample size and disease prevalence on supervised machine learning of narrative data.
AU - McKnight, Lawrence K.
AU - Wilcox, Adam
AU - Hripcsak, George
PY - 2002
Y1 - 2002
N2 - This paper examines the independent effects of outcome prevalence and training sample sizes on inductive learning performance. We trained 3 inductive learning algorithms (MC4, IB, and Naïve-Bayes) on 60 simulated datasets of parsed radiology text reports labeled with 6 disease states. Data sets were constructed to define positive outcome states at 4 prevalence rates (1, 5, 10, 25, and 50%) in training set sizes of 200 and 2,000 cases. We found that the effect of outcome prevalence is significant when outcome classes drop below 10% of cases. The effect appeared independent of sample size, induction algorithm used, or class label. Work is needed to identify methods of improving classifier performance when output classes are rare.
AB - This paper examines the independent effects of outcome prevalence and training sample sizes on inductive learning performance. We trained 3 inductive learning algorithms (MC4, IB, and Naïve-Bayes) on 60 simulated datasets of parsed radiology text reports labeled with 6 disease states. Data sets were constructed to define positive outcome states at 4 prevalence rates (1, 5, 10, 25, and 50%) in training set sizes of 200 and 2,000 cases. We found that the effect of outcome prevalence is significant when outcome classes drop below 10% of cases. The effect appeared independent of sample size, induction algorithm used, or class label. Work is needed to identify methods of improving classifier performance when output classes are rare.
UR - http://www.scopus.com/inward/record.url?scp=0036364399&partnerID=8YFLogxK
M3 - Article
C2 - 12463878
AN - SCOPUS:0036364399
SN - 1531-605X
SP - 519
EP - 522
JO - Proceedings / AMIA ... Annual Symposium. AMIA Symposium
JF - Proceedings / AMIA ... Annual Symposium. AMIA Symposium
ER -