Respiration-induced organ motion is one of the major uncertainties in lung cancer radiotherapy and is crucial to be able to accurately model the lung motion. Most work so far has focused on the study of the motion of a single point (usually the tumor center of mass), and much less work has been done to model the motion of the entire lung. Inspired by the work of Zhang et al (2007 Med. Phys. 34 4772-81), we believe that the spatiotemporal relationship of the entire lung motion can be accurately modeled based on principle component analysis (PCA) and then a sparse subset of the entire lung, such as an implanted marker, can be used to drive the motion of the entire lung (including the tumor). The goal of this work is twofold. First, we aim to understand the underlying reason why PCA is effective for modeling lung motion and find the optimal number of PCA coefficients for accurate lung motion modeling. We attempt to address the above important problems both in a theoretical framework and in the context of real clinical data. Second, we propose a new method to derive the entire lung motion using a single internal marker based on the PCA model. The main results of this work are as follows. We derived an important property which reveals the implicit regularization imposed by the PCA model. We then studied the model using two mathematical respiratory phantoms and 11 clinical 4DCT scans for eight lung cancer patients. For the mathematical phantoms with cosine and an even power (2n) of cosine motion, we proved that 2 and 2n PCA coefficients and eigenvectors will completely represent the lung motion, respectively. Moreover, for the cosine phantom, we derived the equivalence conditions for the PCA motion model and the physiological 5D lung motion model (Low et al 2005 Int. J. Radiat. Oncol. Biol. Phys. 63 921-9). For the clinical 4DCT data, we demonstrated the modeling power and generalization performance of the PCA model. The average 3D modeling error using PCA was within 1 mm (0.7 0.1 mm). When a single artificial internal marker was used to derive the lung motion, the average 3D error was found to be within 2 mm (1.8 0.3 mm) through comprehensive statistical analysis. The optimal number of PCA coefficients needs to be determined on a patient-by-patient basis and two PCA coefficients seem to be sufficient for accurate modeling of the lung motion for most patients. In conclusion, we have presented thorough theoretical analysis and clinical validation of the PCA lung motion model. The feasibility of deriving the entire lung motion using a single marker has also been demonstrated on clinical data using a simulation approach.