Data in language testing often have a hierarchical structure, with nonindependent observations and responses at the lower levels that are nested within higher-level units. For example, when data are collected in schools, the performance of students from the same class may be more uniform than that of those from different classes. This is because students belonging to a particular class are taught by the same teacher. When such a hierarchical structure is not properly modeled, the results suffer from an increased probability of a Type I error. This is often true for traditional methods such as analysis of variance (ANOVA) and multiple regression analysis. To address this issue, this chapter introduces the application of multilevel modeling (MLM) to examine sources of variability in second language (L2) test scores in cross-sectional designs. We explain how MLM works, apply MLM to L2 vocabulary test data, discuss considerations for using MLM, and conclude by discussing some of the potential applications of MLM to L2 testing research.