Tuesday, November 11, 2014

Evaluating Exam Efficacy

Some instructors pride themselves on writing exam questions that only 20% of the class get right because they think it helps them figure out who the strong students are and who the weak students are. It's a noble goal, but this is not actually the best metric to use to evaluate an exam's ability to discriminate strong from weak students.

When I'm evaluating if my exam is effective and valid, I'm generally interested in an exam item's difficulty and discriminability. 

The metric I use to determine if an exam question is easy or difficult is the probability that students in the class get the question correct - p(correct). If p(correct) for a multiple choice question is 0.25 (i.e. 25% of students get the answer correct), it suggests that the exam question is too difficult. On a 5-option multiple choice question, chance is 0.20, so p(correct) suggests that most (if not all) students who got the question correct were guessing. I am conservative, so I tend to throw out questions that have a p(correct) below 0.3 Ideally, the p(correct) on multiple choice exam questions will be between 0.5 and 0.8, but I keep questions with a p(correct) anywhere from 0.3 and 0.9, but then look at whether it has good discriminability.

A question that has good discriminability will have a p(correct) that is higher for students who performed well on the exam overall, and is lower for students who performed poorly. If you take the students who scored in the top 25% of the class, on any given question they should outperform students who scored in the bottom 25% of the class. The bigger the difference in p(correct) for those two groups, the better the discriminability for the individual question. Looking at performance for each exam item as a function of the top vs. bottom quartile of the class is a rough estimate, but a more precise measure is the point biserial coefficient. This is the correlation for the performance on an exam question with the overall exam performance. Ideally, the point biserial will be at least 0.25, and I have questions that are as high as 0.45 0.57 (a new high for Psy270!). If an exam question is really tough (ex. p(correct) = 0.3) but it has a high point biserial coefficient (0.3+), I will keep it in, despite it's difficulty. The best questions I have in my arsenal are p(correct) = 0.7 and point biserial coefficient > 0.32.



The worst questions are questions that have a low p(correct) and a negative point biserial coefficient. A negative coefficient means that students who performed poorly on the exam were more likely to get the question correct that students who performed well. Generally, when I look at these questions I find they are unintentionally misleading. The strongest students in the class tend to read into these questions too much, over-interpreting them and talk themselves out of the correct answer, while poorer students don't know enough to be mislead. I will always toss out these questions because they don't do what they're supposed to do - discriminate weak from strong students.

Finally, I look at the overall exam reliability. The reliability coefficient indicates the likelihood that the exam will produce consistent results. High reliability means that students who answered a given question correctly were likely to answer other questions correctly. While the reliability coefficient can theoretical range from 0.00 to 1.00, but in practice they tend to range between 0.5 and 0.9. An exam with a reliability coefficient above 0.9 is excellent, and is generally where standardized testing services like Educational Testing Services want their exams to fall. I'm pretty pleased to say that my exam reliability coefficients for my second year courses are between 0.85 and 0.9.

So the next time a prof tells you that tough questions help them discriminate the strong students from the weak students, ask them what metric they're using the make that evaluation. Ideally it won't exclusively be based on how many students answered the question correctly.