7 Scopus citations

Abstract

This paper examines the independent effects of outcome prevalence and training sample sizes on inductive learning performance. We trained 3 inductive learning algorithms (MC4, IB, and Naïve-Bayes) on 60 simulated datasets of parsed radiology text reports labeled with 6 disease states. Data sets were constructed to define positive outcome states at 4 prevalence rates (1, 5, 10, 25, and 50%) in training set sizes of 200 and 2,000 cases. We found that the effect of outcome prevalence is significant when outcome classes drop below 10% of cases. The effect appeared independent of sample size, induction algorithm used, or class label. Work is needed to identify methods of improving classifier performance when output classes are rare.

Original languageEnglish
Pages (from-to)519-522
Number of pages4
JournalProceedings / AMIA ... Annual Symposium. AMIA Symposium
StatePublished - 2002

Fingerprint

Dive into the research topics of 'The effect of sample size and disease prevalence on supervised machine learning of narrative data.'. Together they form a unique fingerprint.

Cite this