Abstract
This paper examines the independent effects of outcome prevalence and training sample sizes on inductive learning performance. We trained 3 inductive learning algorithms (MC4, IB, and Naïve-Bayes) on 60 simulated datasets of parsed radiology text reports labeled with 6 disease states. Data sets were constructed to define positive outcome states at 4 prevalence rates (1, 5, 10, 25, and 50%) in training set sizes of 200 and 2,000 cases. We found that the effect of outcome prevalence is significant when outcome classes drop below 10% of cases. The effect appeared independent of sample size, induction algorithm used, or class label. Work is needed to identify methods of improving classifier performance when output classes are rare.
| Original language | English |
|---|---|
| Pages (from-to) | 519-522 |
| Number of pages | 4 |
| Journal | Proceedings / AMIA ... Annual Symposium. AMIA Symposium |
| State | Published - 2002 |
Fingerprint
Dive into the research topics of 'The effect of sample size and disease prevalence on supervised machine learning of narrative data.'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver