How bandwidth selection algorithms impact exploratory data analysis using kernel density estimation

Jared K. Harpole, Carol M. Woods, Thomas L. Rodebaugh, Cheri A. Levinson, Eric J. Lenze

Research output: Contribution to journalArticlepeer-review

20 Scopus citations

Abstract

Exploratory data analysis (EDA) can reveal important features of underlying distributions, and these features often have an impact on inferences and conclusions drawn from data. Graphical analysis is central to EDA, and graphical representations of distributions often benefit from smoothing. A viable method of estimating and graphing the underlying density in EDA is kernel density estimation (KDE). This article provides an introduction to KDE and examines alternative methods for specifying the smoothing bandwidth in terms of their ability to recover the true density. We also illustrate the comparison and use of KDE methods with 2 empirical examples. Simulations were carried out in which we compared 8 bandwidth selection methods (Sheather-Jones plug-in [SJDP], normal rule of thumb, Silverman's rule of thumb, least squares cross-validation, biased cross-validation, and 3 adaptive kernel estimators) using 5 true density shapes (standard normal, positively skewed, bimodal, skewed bimodal, and standard lognormal) and 9 sample sizes (15, 25, 50, 75, 100, 250, 500, 1,000, 2,000). Results indicate that, overall, SJDP outperformed all methods. However, for smaller sample sizes (25 to 100) either biased cross-validation or Silverman's rule of thumb was recommended, and for larger sample sizes the adaptive kernel estimator with SJDP was recommended. Information is provided about implementing the recommendations in the R computing language.

Original languageEnglish
Pages (from-to)428-434
Number of pages7
JournalPsychological Methods
Volume19
Issue number3
DOIs
StatePublished - 2014

Keywords

  • Adaptive kernel density estimation
  • Bandwidth selection
  • Exploratory data analysis
  • Graphical analysis
  • Kernel density estimation

Fingerprint

Dive into the research topics of 'How bandwidth selection algorithms impact exploratory data analysis using kernel density estimation'. Together they form a unique fingerprint.

Cite this