Objectives We aim to extract a subset of social factors from clinical notes using common text classification methods. Design Retrospective chart review. Setting We collaborated with a local level I trauma hospital located in an underserved area that has a housing unstable patient population of about 6.5% and extracted text notes related to various social determinants for acute care patients. Participants Notes were retrospectively extracted from 43 798 acute care patients. Methods We solely use open source Python packages to test simple text classification methods that can potentially be easily generalisable and implemented. We extracted social history text from various sources, such as admission and emergency department notes, over a 5-year timeframe and performed manual chart reviews to ensure data quality. We manually labelled the sentiment of the notes, treating each text entry independently. Four different models with two different feature selection methods (bag of words and bigrams) were used to classify and predict housing stability, tobacco use and alcohol use status for the extracted clinical text. Results From our analysis, we found overall positive results and metrics in applying open-source classification techniques; the accuracy scores were 91.2%, 84.7%, 82.8% for housing stability, tobacco use and alcohol use, respectively. There were many limitations in our analysis including social factors not present due to patient condition, multiple copy-forward entries and shorthand. Additionally, it was difficult to translate usage degrees for tobacco and alcohol use. However, when compared with structured data sources, our classification approach on unstructured notes yielded more results for housing and alcohol use; tobacco use proved less fruitful for unstructured notes.

Original languageEnglish
Article numbere048397
JournalBMJ Open
Issue number1
StatePublished - Jan 18 2022


  • biotechnology & bioinformatics
  • health informatics
  • history (see medical history)
  • social medicine


Dive into the research topics of 'Simplified data science approach to extract social and behavioural determinants: A retrospective chart review'. Together they form a unique fingerprint.

Cite this