TY - JOUR
T1 - Using discordance to improve classification in narrative clinical databases
T2 - An application to community-acquired pneumonia
AU - Hripcsak, George
AU - Knirsch, Charles
AU - Zhou, Li
AU - Wilcox, Adam
AU - Melton, Genevieve B.
N1 - Funding Information:
This work was supported by the National Library of Medicine (R01 LM06910) “Discovering and applying knowledge in clinical databases” and by Pfizer Inc.
PY - 2007/3
Y1 - 2007/3
N2 - Data mining in electronic medical records may facilitate clinical research, but much of the structured data may be miscoded, incomplete, or non-specific. The exploitation of narrative data using natural language processing may help, although nesting, varying granularity, and repetition remain challenges. In a study of community-acquired pneumonia using electronic records, these issues led to poor classification. Limiting queries to accurate, complete records led to vastly reduced, possibly biased samples. We exploited knowledge latent in the electronic records to improve classification. A similarity metric was used to cluster cases. We defined discordance as the degree to which cases within a cluster give different answers for some query that addresses a classification task of interest. Cases with higher discordance are more likely to be incorrectly classified, and can be reviewed manually to adjust the classification, improve the query, or estimate the likely accuracy of the query. In a study of pneumonia-in which the ICD9-CM coding was found to be very poor-the discordance measure was statistically significantly correlated with classification correctness (.45; 95% CI .15-.62).
AB - Data mining in electronic medical records may facilitate clinical research, but much of the structured data may be miscoded, incomplete, or non-specific. The exploitation of narrative data using natural language processing may help, although nesting, varying granularity, and repetition remain challenges. In a study of community-acquired pneumonia using electronic records, these issues led to poor classification. Limiting queries to accurate, complete records led to vastly reduced, possibly biased samples. We exploited knowledge latent in the electronic records to improve classification. A similarity metric was used to cluster cases. We defined discordance as the degree to which cases within a cluster give different answers for some query that addresses a classification task of interest. Cases with higher discordance are more likely to be incorrectly classified, and can be reviewed manually to adjust the classification, improve the query, or estimate the likely accuracy of the query. In a study of pneumonia-in which the ICD9-CM coding was found to be very poor-the discordance measure was statistically significantly correlated with classification correctness (.45; 95% CI .15-.62).
KW - Classification
KW - Community-acquired pneumonia
KW - Data mining
KW - Discordance
KW - Electronic medical records
KW - Natural language processing
KW - Similarity metrics
UR - http://www.scopus.com/inward/record.url?scp=33846018355&partnerID=8YFLogxK
U2 - 10.1016/j.compbiomed.2006.02.001
DO - 10.1016/j.compbiomed.2006.02.001
M3 - Article
C2 - 16620802
AN - SCOPUS:33846018355
SN - 0010-4825
VL - 37
SP - 296
EP - 304
JO - Computers in Biology and Medicine
JF - Computers in Biology and Medicine
IS - 3
ER -