Umpire 2.0: Simulating realistic, mixed-type, clinical data for machine learning [version 2; peer review: 1 approved, 1 approved with reservations]

Caitlin E. Coombes, Zachary B. Abrams, Samantha Nakayiza, Guy Brock, Kevin R. Coombes

Research output: Contribution to journalArticlepeer-review


The Umpire 2.0 R-package offers a streamlined, user-friendly workflow to simulate complex, heterogeneous, mixed-type data with known subgroup identities, dichotomous outcomes, and time-toevent data, while providing ample opportunities for fine-tuning and flexibility. Here, we describe how we have expanded the core Umpire1.0 R-package, developed to simulate gene expression data, to generate clinically realistic, mixed-type data for use in evaluating unsupervised and supervised machine learning (ML) methods. As the availability of large-scale clinical data for ML has increased, clinical data has posed unique challenges, including widely variable size, individual biological heterogeneity, data collection and measurement noise, and mixed data types. Developing and validating ML methods for clinical data requires data sets with known ground truth, generated from simulation. Umpire 2.0 addresses challenges to simulating realistic clinical data by providing the user a series of modules to generate survival parameters and subgroups, apply meaningful additive noise, and discretize to single or mixed data types. Umpire 2.0 provides broad functionality across sample sizes, feature spaces, and data types, allowing the user to simulate correlated, heterogeneous, binary, continuous, categorical, or mixed type data from the scale of a small clinical trial to data on thousands of patients drawn from electronic health records. The user may generate elaborate simulations by varying parameters in order to compare algorithms or interrogate operating characteristics of an algorithm in both supervised and unsupervised ML.

Original languageEnglish
Pages (from-to)1-28
Number of pages28
StatePublished - 2021


  • clinical data
  • clinical informatics
  • clustering
  • machine learning
  • mixed data
  • mixedtype data
  • supervised machine learning
  • unsupervised machine learning


Dive into the research topics of 'Umpire 2.0: Simulating realistic, mixed-type, clinical data for machine learning [version 2; peer review: 1 approved, 1 approved with reservations]'. Together they form a unique fingerprint.

Cite this