Student Teaching Evaluations are Effective, but Not in the Way You Think

By Keith Devlin @profkeithdevlin

As a department chair for four years and a dean for eight, I read a great many student evaluation folders of the faculty who taught them. In both institutions, we also had classroom peer review of instructors. I always felt far more confident in the information provided by experienced fellow instructors than I did with the student evaluations. For one thing, student evaluations encourage younger instructors whose career development depends on them to “teach to the evaluation”, resulting in the evaluations being more a measure of the degree to which the students enjoyed the course, rather than how effective it was in producing good learning. For another, even if the students tried to be objective about what they learned, they have an inadequate baseline to make such an evaluation. That, after all, is why they are taking the class!

But it gets worse. Not even an experienced instructor can say, at the time, how well the students have learned from the course. The only measure of learning that has any real value is to see how course graduates perform in subsequent situations where the material learned has relevance: in a follow-on course or, even more meaningful, a course or a job where the mathematics supposedly learned has to be used. Without such down-the-line evaluation, both the student and the instructor have to rely on assignments completed during the course and the results of a final exam in the last week, neither of which say much about effective learning.

My periods as a department chair and a dean were at two selective liberal arts colleges, from 1989 to 2001. Subsequently, after moving to Stanford to focus more on research in 2001, I learned considerably more about cognitive science (long a side interest of mine), where researchers had amassed a considerable amount of knowledge about how people learn, and what factors make for good learning.

One study in particular showed that student evaluations are good predictors of learning. But there’s a twist. The correlation between student evaluations and quality of learning is negative. The higher the instructor’s score in the student evaluations, the worse the learning; and the lower the evaluation score, the better the learning. (Feel free to read that again. There is no typo.)

Conducting a study that can produce such results with any reliability is clearly a difficult task. Indeed, in a normal undergraduate institution, it’s impossible. As I alluded to above, to see how effective a particular course has been, you need to see how well a student performs when they later face challenges for which the course experience is—or at least, should be—relevant. But in a regular university, it’s just not possible to set up a randomized, controlled study that follows groups of course graduates for several years, all the time evaluating them in a standardized, systematic way. Even if the original course is in the first year, leaving the student three further years in the same institution, students drop out, select different subsequent elective courses, or even change major tracks.

That problem is what made one particular study so significant. Conducted from 1997 to 2007, the subjects were students (12,568 in all) at the US Air Force Academy (USAFA) in Colorado Springs, Colorado. The researchers were Scott E. Carrell, of the Department of Economics at the University of California, Davis and James E. West of the Department of Economics and Geosciences at USAFA.

Students at the US Air Force Academy in Colorado Springs, Colorado

Since this is a fairly unique higher education institute, extrapolation of the study’s results to other colleges or universities clearly requires knowledge of what kind of institution USAFA is.

The US Air Force Academy is a fully accredited undergraduate institution of higher education with an approximate enrollment of 4,200 students. It offers 32 majors, including humanities, social sciences, basic sciences, and engineering. The average SAT for the 2005 entering class was 1309 with an average high school GPA of 3:60 (Princeton Review 2007). Applicants are selected for admission on the basis of academic, athletic, and leadership potential, and a nomination from a legal nominating authority. All students receive 100 percent scholarship to cover their tuition, room, and board. Additionally, each student receives a monthly stipend of $845 to cover books, uniforms, computer, and other living expenses. All students are required to graduate within four years, after which they must serve a for five years as a commissioned officer in the Air Force.

Approximately 17% of the study sample was female, 5% was black, 7% Hispanic, and 5% Asian.

Academic aptitude for entry to USAFA is measured through SAT verbal and SAT math scores and an academic composite that is a weighted average of an individual's high school GPA, class rank, and the quality of the high school attended. All entering students take a mathematics placement exam upon matriculation, which tests algebra, trigonometry, and calculus. The sample mean SAT math and SAT verbal are 663 and 632, with respective standard deviations of 62 and 66.

UAAFA students are required to take a core set of approximately 30 courses in mathematics, basic sciences, social sciences, humanities, and engineering. Grades are determined on an A, A-, B+, B, …, C-, D, F scale, where an A is worth 4 grade points, an A- is 3.7 grade points, a B+ is 3.3 grade points, etc. The average GPA for the study sample was 2.78. Over the ten-year period of the study there were 13,417 separate course-sections taught by 1,462 different faculty members. Average class size was 18 students per class and approximately 49 sections of each core course were taught each year.

USAFA faculty, which are both military officers and civilian employees, have graduate degrees from a broad sample of high quality programs in their respective disciplines, similar to a comparable undergraduate liberal arts college.

Clearly, in many respects, this reads like the academic profile many American four-year colleges and universities. The main difference is the nature of the student body, where USAFA students enter with a specific career path in mind (at least for nine years)—albeit a career path admitting a great many variations—perhaps also, in many cases, with a high degree of motivation. While that difference clearly has to be taken in mind when using the study’s results to make inferences for higher education as a whole, the research benefits of working with such an organization are significant, leading to results highly reliable for that institution.

First, there is the sheer size of the study population. So large, that there was no problem randomly assigning students to professors over a wide variety of standardized core courses. That random assignment of students to professors, together with substantial data on both professors and students, enabled the researchers to examine how professor quality affects student achievement, free from the usual problems researcher face due to student self-selection.

Moreover, grades in USAFA core courses are a consistent measure of student achievement, because faculty members teaching the same course use an identical syllabus and give the same exams during a common testing period.

Student grades in mathematics courses, in particular, are particularly reliable measures. Math professors grade only a small proportion of their own students’ exams, which vastly reduces the ability of “easy” or “hard” grading professors to affect their students’ grades. Math exams are jointly graded by all professors teaching the course during that semester in “grading parties,” where Professor A grades question 1 for all students, Professor B grades question 2 for all students, and so on. Additionally, all professors are given copies of the exams for the course prior to the start of the semester. All final grades in all core courses are determined on a single grading scale and are approved by the department chair. Student grades can thus be taken to reflect the manner in which the course is taught by each professor.

A further significant research benefit of conducting the study at USAFA is that students are required to take, and are randomly assigned to, numerous follow-on courses in mathematics, humanities, basic sciences, and engineering, so that performance in subsequent courses can be used to measure effectiveness of earlier ones—surely a far more meaningful measure of learning than weekly assignments or an end-of-term exam.

It is worth noting also that, even if a student has a particularly bad introductory course instructor, they still are required to take the follow-on related curriculum.

If you are like me, given that background information, you will take seriously the research results obtained from this study. At a cost of focusing on a special subset of students, the statistical results of the study will be far more reliable and meaningful than for most educational studies.

So what are those results?

First, the researchers found there are relatively large and statistically significant differences in student achievement across professors in the contemporaneous course being taught. A one-standard deviation difference in a professor “fixed effect” results in a 0:08 to 0:21-standard deviation change in student achievement. (A fixed effect is a variable like age, sex, ethnicity, or qualifications, that is constant across individuals for the course of the study.)

Introductory course professors significantly affect student achievement in follow-on related courses, but these effects are quite heterogeneous across subjects.

But here is the first surprising result. Students of professors who as a group perform well in the initial mathematics course perform significantly worse in the (mandatory) follow-on related math, science, and engineering courses. For math and science courses, academic rank, teaching experience, and terminal degree status of professors are negatively correlated with contemporaneous student achievement, but positively related to follow-on course achievement. That is, students of less experienced instructors who do not possess terminal degrees perform better in the contemporaneous course being taught, but perform worse in the follow-on related courses.

Presumably, what is going on is that less academically qualified instructors may spur (potentially erroneous) interest in a particular subject through higher grades, but those students perform significantly worse in follow-on related courses that rely on the initial course for content. (Interesting side note: for humanities courses, the researchers found almost no relationship between professor observable attributes and student achievement.)

And what about students evaluations, the issue with which I began this essay? This is the second surprising result—or for many experienced educators the surprise may be that such a substantial study could be conducted to prove what they long suspected. The study found that student evaluations positively predict student achievement in contemporaneous courses, but are very poor predictors of follow-on student achievement. This finding raises a big question regarding the value and accuracy of contemporaneous student evaluations in making instructor personnel decisions.

So what is going on here? To answer that question, we must look to other research on how people learn. This is a huge topic in its own right, with research contributions from several disciplines, including neurophysiology. Since this essay is already over 1,700 words long, I’ll leave it here with an extremely brief summary oriented toward teaching. (I’ll come back to the issue in a later post. It’s not an area I have worked in, but I am familiar with the work of others who do, and have collaborated with some of them.)

Learning occurs when we get something wrong and have to correct it. So a particularly effective way to teach something in a way that will stick is to put students in a position of having to arrive at the best answer they can, without hints, even if that answer is wrong. Then, after they have committed (preferably some time after), you can correct them, preferably with a hint (just one) to prompt them to rectify the error themselves. Psychologists who have studied this refer to the approach as introducing “desirable difficulties.” Google it if you have not come across this term before. (The term itself is due to the Stanford psychologist Robert Bjork.)

The result of this approach makes students (and likely their parents and their instructor) feel uncomfortable, since the student does not appear to be making progress. In particular, their assignment work and end-of-term test will be littered with errors. (Instructors should grade on the curve. I frequently set the pass mark in a math course around 30%, with a score of 60% or more correct getting an A, although in an ideal world I would have preferred to not be obliged to assign a letter grade, at least based solely on contemporaneous testing.)

Of course, the students are not going to be happy about this, and their frustration with themselves is likely to be offloaded onto the instructor. But, for all that it may seem counterintuitive, they will walk away from that course with far better, more lasting, and more usable learning than if they had spent the time in a feel-good semester of shallow reinforcement because they were getting everything right. (It’s not hard to get something right if someone else guides you to it, or even ends up showing you exactly how to do it.)

Getting things right, with the well-deserved feeling of accomplishment it brings, is a wonderful thing to experience, and should be acknowledged and rewarded—when you are on your own, out in the world, applying your learning to do things. But getting everything right is a counterproductive goal in education. Learning is what happens by correcting what you got wrong. Indeed, as I alluded to earlier, the learning is better if the correction occurs some time after the error is made. Stewing for a while in frustration at being wrong, and not seeing how to fix it, before finally figuring out what you were doing/getting wrong, turns out to be a good thing.

So, if you are a student, and your instructor refuses to put you out of your misery as you struggle to master the new concepts and complete the assignments, at least be aware that the instructor most likely is doing so because they want you to learn, and as a trained professional, knows that that takes. You can’t learn to ride a bike or skateboard without bruising your knees and your elbows. And you can’t learn math (and various other skills) without bruising your ego.

NOTE: In two related blog posts from the Stanford Mathematics Outreach Project, I examine implications of the USAFA study for K-12 mathematics education.

Devlin's Angle, TEACHING & LEARNINGKeith DevlinOctober 1, 2019assessment, student evaluations of instructors, how people learn, USAFA study, desirable difficulties, Keith Devlin