Wide-field calcium imaging (WFCI) with genetically encoded calcium indicators allows spatiotemporal recordings of neuronal activity in preclinical models. When applied to the study of sleep, WFCI data are manually scored into sleep states of wakefulness, non-rapid eye movement (NREM) and REM by use of adjunct electroencephalogram (EEG) and electromyogram (EMG) recordings. However, this process is time-consuming, invasive and suffers from low inter- and intra-rater reliability. To overcome these limitations, an automated sleep state classification method that operates on spatiotemporal WFCI recordings is desired. Previous work that classifies sleep states from WFCI data by use of multiplex visibility graphs and deep learning only leverages shared information derived from average time series across parcellated brain regions, and thus fails to fully explore the spatiotemporal calcium dynamics recorded. In this work, a hybrid network architecture consisting of a convolutional neural network (CNN) to extract spatial features of image frames and a bidirectional long short-term memory network (BiLSTM) with attention mechanism to identify temporal dependencies among different time points was proposed to jointly learn spatial and temporal information from the WFCI sleep data. Nineteen transgenic mice expressing GCaMP6f in excitatory neurons were used for network training and testing. The CNN-BiLSTM achieved a weighted F1-score of 0.84 and Cohen's κ of 0.64, indicating substantial agreement with EEG/EMG-based human scoring. The gradient-weighted class activation maps were computed to provide deeper insights into the brain regions most relevant to the inference of individual sleep state. This work will enable further investigation of sleep neural activity using WFCI.