The second track of the 2014 i2b2 challenge asked participants to automatically identify risk factors for heart disease among diabetic patients using natural language processing techniques for clinical notes. This paper describes a rule-based system developed using a combination of regular expressions, concepts from the Unified Medical Language System (UMLS), and freely-available resources from the community. With a performance (F1 = 90.7) that is significantly higher than the median (F1 = 87.20) and close to the top performing system (F1 = 92.8), it was the best rule-based system of all the submissions in the challenge. We also used this system to evaluate the utility of different terminologies in the UMLS towards the challenge task. Of the 155 terminologies in the UMLS, 129 (76.78%) have no representation in the corpus. The Consumer Health Vocabulary had very good coverage of relevant concepts and was the most useful terminology for the challenge task. While segmenting notes into sections and lists has a significant impact on the performance, identifying negations and experiencer of the medical event results in negligible gain.

Original languageEnglish
Pages (from-to)S103-S110
JournalJournal of Biomedical Informatics
StatePublished - Dec 1 2015


  • Electronic health records
  • Natural language processing
  • Rule-based system
  • Unified Medical Language System


Dive into the research topics of 'Comparison of UMLS terminologies to identify risk of heart disease using clinical notes'. Together they form a unique fingerprint.

Cite this