Tuesday, December 3, 2019

Results provide a cautionary tale for the naïve application of VAMs to teacher evaluation and other settings; they point to the possibility of the misidentification of sizable teacher “effects”where none exist

Teacher Effects on Student Achievement and Height: A Cautionary Tale. Marianne Bitler, Sean Corcoran, Thurston Domina, Emily Penner. NBER Working Paper No. 26480, November 2019. https://www.nber.org/papers/w26480

Abstract: Estimates of teacher “value-added” suggest teachers vary substantially in their ability to promote student learning. Prompted by this finding, many states and school districts have adopted value-added measures as indicators of teacher job performance. In this paper, we conduct a new test of the validity of value-added models. Using administrative student data from New York City, we apply commonly estimated value-added models to an outcome teachers cannot plausibly affect: student height. We find the standard deviation of teacher effects on height is nearly as large as that for math and reading achievement, raising obvious questions about validity. Subsequent analysis finds these “effects” are largely spurious variation (noise), rather than bias resulting from sorting on unobserved factors related to achievement. Given the difficulty of differentiating signal from noise in real-world teacher effect estimates, this paper serves as a cautionary tale for their use in practice.

6   Discussion
Schools and districts across the country want to employ teachers who can best help students to learn, grow, and achieve academic success. Identifying such individuals is integral to schools' successbutis also difficult to do in practice. In the face of data and measurement limitations, school leaders and state education departments seek low-cost, unbiased ways to observe and monitor the impact that their teachers have on students. Although many have criticized the use of VAMs to evaluate teachers, they remain a widely-used measure of teacher performance. In part, their popularity is due to convenience-while observational protocols which send observers to every teacher's classroom require expensive training and considerable resources to implement at scale, VAMs use existing data and can be calculated centrally at low cost. Further, VAMs are arguably less biased than many other evaluation methods that districts might use instead (Bacher-Hicks et al. 2017; Harris et al. 2014; Hill et al. 2011).

Yet questions remain about the reliability, validity, and practical use of VAMs. This paper interrogates concerns raised by prior research on VAMs and raises new concerns about the use of VAMs in career and compensation decisions. We explore the bias and reliability of commonlyestimated VAMs by comparing estimates of teacher value-added in mathematics and ELA with parallel estimates of teacher value-added on a well-measured biomarker that teachers should not impact: student height. Using administrative data from New York City, we find estimated teacher “effects”on height that are comparable in magnitude to actual teacher effects on math and ELA achievement, 0.22:compared to 0.29:and0.26:respectively. On its face, such results raise concerns about the validity of these models.

Fortunately, subsequent analysis finds that teacher effects on height are primarily noise, rather than bias due to sorting on unobserved factors. To ameliorate the effect of sampling error on value-added estimates, analysts sometimes “shrink” VAMs, scaling them by their estimated signal-to-noise ratio. When we apply the shrinkage method across multiple years of data from Kane and Staiger (2008), the persistent teacher “effect”on height goes away, becoming the expected (and known) mean of zero. This procedure is not always done in practice, however, and requires multiple years of classroom data for the same teachers to implement. Of course, for making hiring and firing decisions, it seems important to consider that value added measures which require multiple years of data to implement will likely permit identification of persistently bad teachers, but not provide a performance evaluation metric that can be met by teachers trying to improve their performance. In more realistic settings where the persistent effect is not zero, it is less clear that shrinkage would have a major influence on performance decisions, since it has modest effects on the relative rankings of teachers.

Taken together, our results provide a cautionary tale for the naïve application of VAMs to teacher evaluation and other settings. They point to the possibility of the misidentification of sizable teacher “effects”where none exist. These effects may be due in part to spurious variation driven by the typically small samples of children used to estimate a teacher's individual effect.

No comments:

Post a Comment