STUDY DESIGN: Retrospective analysis of administrative billing data. OBJECTIVE: To evaluate the extent to which a metric of serious complications determined from administrative data can reliably profile hospital performance in spine fusion surgery. SUMMARY OF BACKGROUND DATA: While payers are increasingly focused on implementing pay-for-performance measures, quality metrics must reliably reflect true differences in performance among the hospitals profiled. METHODS: We used State Inpatient Databases from nine states to characterize serious complications after elective cervical and thoracolumbar fusion. Hierarchical logistic regression was used to risk-adjust differences in case mix, along with variability from low case volumes. The reliability of this risk-stratified complication rate (RSCR) was assessed as the variation between hospitals that was not due to chance alone, calculated separately by fusion type and year. Finally, we estimated the proportion of hospitals that had sufficient case volumes to obtain reliable (>0.7) complication estimates. RESULTS: From 2010 to 2017 we identified 154,078 cervical and 213,133 thoracolumbar fusion surgeries. 4.2% of cervical fusion patients had a serious complication, and the median RSCR increased from 4.2% in 2010 to 5.5% in 2017. The reliability of the RSCR for cervical fusion was poor and varied substantially by year (range 0.04-0.28). Overall, 7.7% of thoracolumbar fusion patients experienced a serious complication, and the RSCR varied from 6.8% to 8.0% during the study period. Although still modest, the RSCR reliability was higher for thoracolumbar fusion (range 0.16-0.43). Depending on the study year, 0% to 4.5% of hospitals had sufficient cervical fusion case volume to report reliable (>0.7) estimates, whereas 15% to 36% of hospitals reached this threshold for thoracolumbar fusion. CONCLUSION: A metric of serious complications was unreliable for benchmarking cervical fusion outcomes and only modestly reliable for thoracolumbar fusion. When assessed using administrative datasets, these measures appear inappropriate for high-stakes applications, such as public reporting or pay-for-performance.Level of Evidence: 3.