Examining the Gateway Hypothesis and Mapping Substance Use Pathways on Social Media: A Machine Learning Approach

Yunhao Yuan, Erin Kasson, Jordan Taylor, Patricia Cavazos-Rehg, Munmun De Choudhury, Talayeh Aledavood

Research output: Contribution to journalReview articlepeer-review


Background: Substance misuse presents significant global public health challenges. Understanding transitions between substance types and the timing of shifts to polysubstance use is vital for targeted prevention, harm reduction, and recovery strategies. The longstanding gateway hypothesis suggests high-risk substance use is preceded by lower-risk substance use. However, the source of this correlation is hotly contested. While some claim that low-risk substance use causes subsequent, riskier substance use, most users of low-risk substances also do not escalate to higher-risk substances. Social media data holds the potential to shed light on the factors contributing to substance use transitions. Objective: By leveraging social media data, our study aims to gain a better understanding of substance use pathways. By identifying and analyzing the transitions of individuals between different risk levels of substance use, our goal is to find specific linguistic cues in individuals' social media posts that could be indicative of escalating or de-escalating patterns in substance use. Methods: We conducted a large-scale analysis using data from Reddit, collected between 2015 and 2019, consisting of over 2.29 million posts and approximately 29.37 million comments by around 1.4 million users from subreddits. This data, derived from substance use subreddits, facilitated the creation of a risk transition dataset reflecting the substance use behaviors of over 1.4 million users. We deployed deep learning and machine learning techniques, including fine-tuned BERT and RoBERTa models, to predict the escalation or de-escalation in risk levels based on initial transition phases documented in posts and comments. Additionally, we conducted an extensive linguistic analysis to analyze the language patterns associated with transitions in substance use, emphasizing the role of n-gram features in predicting future risk trajectories. Results: Our results showed promise in predicting the escalation or de-escalation in risk levels based on the historical data of Reddit users created on initial transition phases among drug-related subreddits with an accuracy of 78.48% and an F1-score of 79.20%. We highlighted the vital predictive features, such as specific substance names and tools indicative of future risk escalations. Our linguistic analysis showed terms linked with harm reduction strategies were instrumental in signaling deescalation, whereas descriptors of frequent substance use were characteristic of escalating transitions. Conclusions: This study sheds light on the complexities surrounding the gateway hypothesis of substance use through an examination of online behavior on Reddit. While certain findings validate the hypothesis, indicating a progression from lower-risk substances like marijuana to higher-risk ones, a significant number of individuals did not showcase this transition. The research underscores the potential of using machine learning in conjunction with social media analysis for predicting substance use transitions. Our results emphasize the role of linguistic features as predictors and the importance of timely interventions.

Original languageEnglish
JournalJMIR Formative Research
StatePublished - Apr 6 2023


Dive into the research topics of 'Examining the Gateway Hypothesis and Mapping Substance Use Pathways on Social Media: A Machine Learning Approach'. Together they form a unique fingerprint.

Cite this