How often have you heard your fellow instructors lament,

I don’t know why I bother with comments on the exams or even handing them back – students don’t go over their exams to see where they what they got right and wrong, they just look at the mark and move on.

If you often say or think this, you might want to ask yourself, What’s their motivation for going over the exam, besides “It will help me learn…”? But that’s the topic for another post.

In the introductory gen-ed astronomy class I’m working on, we gave a midterm exam last week. We dutifully marked it which was simple because the midterm exam was multiple-choice answered on Scantron cards. And calculated the average. And fixed the scoring on a couple of questions where the question stem was ambiguous (when you say, “summer in the southern hemisphere, do you mean June or do you mean when it gets hot?”). And we moved on.

Hey, wait a minute! Isn’t that just what the students do — check the mark and move on?

Since I have the data, every student’s answer to every question, via the Scantron and already in Excel, I decided to “go over the exam” to try to learn from it.

*(Psst: I just finished wringing some graphs out of Excel and I wanted to start writing this post before I got distracted by, er, life so I haven’t done the analysis yet. I can’t wait to see what I write below!)*

Besides the average (23.1/35 questions or 66%) and standard deviation (5.3/35 or 15%), I created a histogram of the students’ choices for each question. Here is a selection of questions which, as you’ll see further below, are widespread on the good-to-bad scale.

**Question 9:** You photograph a region of the night sky in March, in September, and again the following March. The two March photographs look the same but the September photo shows 3 stars in different locations. Of these three stars, the one whose position shifts the most must be

A) farthest away

B) closest

C) receding from Earth most rapidly

D) approaching Earth most rapidly

E) the brightest one

**Question 16:** What is the shape of the shadow of the Earth, as seen projected onto the Moon, during a lunar eclipse?

A) always a full circle

B) part of a circle

C) a straight line

D) an ellipse

E) a lunar eclipse does not involve the shadow of the Earth

**Question 25:** On the vernal equinox, compare the number of daytime hours in 3 cities, one at the north pole, one at 45 degrees north latitude and one at the equator.

A) 0, 12, 24

B) 12, 18, 24

C) 12, 12, 12

D) 0, 12, 18

E) 18, 18, 18

How much can you learn from these histograms? Quite a bit. Question 9 is too easy and we should use our precious time to better evaluate the students’ knowledge. The “straight line” choice on Question 16 should be replaced with a better distractor – no one “fell for” that one. I’m a bit alarmed that 5% of the students think that the Earth’s shadow has nothing to do with eclipses but then again, that’s only 1 in 20 (actually, 11 in 204 students – aren’t data great!) We’re used to seeing these histograms because in class, we have frequent think-pair-share episodes using i>clickers and use the students’ vote to decide how to proceed. If these were first-vote distributions in a clicker question, we wouldn’t do Question 9 again but we’d definitely get them to pair and share for Question 16 and maybe even Question 25. As I’ve written elsewhere, a 70% “success rate” can mean only about 60% of the students chose the correct answer for the right reasons.

I decided to turn it up a notch by following some advice I got from Ed Prather at the Center for Astronomy Education. He and his colleagues analyze multiple-choice questions using the point-biserial correlation coefficient. I’ll admit it – I’m not a statistics guru, so I had to look that one up. Wikipedia helped a bit, so did this article and Bardar et al. (2006). Normally, a correlation coefficient tells you how two variables are related. A favourite around Vancouver is the correlation between property crime and distance to the nearest Skytrain station (with all the correlation-causation arguments that go with it.) With point-biserial correlation, you can look for a relationship between students’ test scores and their success on a particular question (this is the “dichotomous variable” with only two values, 0 (wrong) and 1 (right).) It allows you to speculate on things like,

- (for high correlation) “If they got this question, they probably did well on the entire exam.” In other words, that one question could be a litmus test for the entire test.
- (for low correlation) “Anyone could have got this question right, regardless of whether they did well or poorly on the rest of the exam.” Maybe we should drop that question since it does nothing to discriminate or resolve the student’s level of understanding.

I cranked up my Excel worksheet to compute the coefficient, usually called ρ_{pb} or ρ_{pbis}:

where μ_{+} is the average test score for all students who got this particular questions correct, μ_{x} is the average test score for all students, σ_{x} is the standard deviation of all test scores, *p* is the fraction of students who got this question right and *q*=(1-*p*) is the fraction who got it wrong. You compute this coefficient for every question on the test. The key step in my Excel worksheet, after giving each student a 0 or 1 for each question they answered, was the AVERAGEIF function: for each question I computed

=AVERAGEIF(B$3:B$206,”=1″,$AL3:$AL206)

where, for example, Column B holds the 0 and 1 scores for Question 1 and Column AL holds the exam marks. This function takes the average of the exam scores only for those students (rows) who have got a “1” on Question 1. At last then, the point-biserial correlation coefficients for each of the 35 questions on the midterm, sorted from lowest to highest:

First of all, ooo shiney! I can’t stand the default graphics settings of Excel (and PowerPoint) but with some adjustments, you can produce a reasonable plot. Not that this in is perfect, but it’s not bad. Gotta work on the labels and a better way to represent the bands of “desirable”, “weak”, etc.

Back to going over the exam, how did the questions I included above fare? Question 9 has a weak, not desirable coefficient, just 0.21. That suggests anyone could get this question right (or equivalently, no could get this question right). It does nothing to discriminate or distinguish high-performing students from low-performing students. Question 16, with ρ_{pb} = 0.37 is in the desirable range – just hard enough to begin to separate the high- and low-performing students. Question 25 is one of the best on the exam, I think.

In case you’re wondering, Question 6 (with the second highest ρ_{pb} ) is a rather ugly calculation. It discriminated between high- and low-performing students but personally, I wouldn’t include it – doesn’t match the more conceptual learning goals IMHO.

I was pretty happy with this analysis (and my not-such-a-novice-anymore skills in Excel and statistics.) I should stopped there. But like a good scientist making sure every observation is consistent with the theory, I looked at Question 26, the one with the highest point-biserial correlation coefficient. I was shocked, alarmed even. The most discriminating question on the test was this?

**Question 26:** What is the phase of the Moon shown in this image?

A) waning crescent

B) waxing crescent

C) waning gibbous

D) waxing gibbous

E) third quarter

It’s waning gibbous, by the way, and 73% of the students knew it. That’s a lame, Bloom’s taxonomy Level 1, memorization question. *Damn*. To which my wise and mentoring colleague asked, “Well, what was the exam really testing, anyway?”

Alright, perhaps I didn’t get the result I wanted. But that’s not the point of science. Of this exercise. I definitely learned a lot by “going over the exam”, about validating questions, Excel, statistics and WordPress. And perhaps made it easier for the next person, shoulders of giants and all that…

Impressive analysis! Thanks for doing this. I would like to do more looking at the post-exam results to see what questions are working. So hard to find the motivation and time! 🙂

I’m still so new to teaching astronomy that I don’t have one way of testing that I really like. I’ve been trying new things each semester and my class size is small enough that I can do written answer and problems. But after the CAE workshop this past weekend, I wonder if I should be doing calculator problems at all.

I will read through your results more carefully soon…for now have to run to the next astronomy lab. 🙂

Looking at this point-biserial correlation is tailored to exams with questions which can be marked right or wrong. It lends itself naturally to multiple-choice questions. I’ll admit that I used to think multiple-choice-only exams were a poor form of assessment but I’m starting to think otherwise, if the questions are good ones. But I can imagine using this rho_pb to look at a multiple-choice question in an exam with other kinds of questions, too – you’d still be able to find the average scores for students who got your multiple-choice question and wrong. Perhaps you could even analyze a short answer question, as long as you’re willing to assign it all or nothing marks (but not part marks.)

I’m glad you had a chance to attend a CAE workshop. Keep going back – you learn more each time. And then sign up for a Tier II!

Hi Peter,

I’m pretty happy about the timing of your post since I’m right in the

middle of grading at the moment. For long answer questions (with

“continuous” grades) would the Pearson

correlation be the right tool to use?

In your case though, shouldn’t it p*q instead of p/q in the square root?

Cheers!

Pat

If I didn’t say it explicitly, you could probably figure it out from this post: I’m not very confident in my stats ability. But I’m quite sure that when comparing 2 continuous variables, you can use the Pearson correlation coefficient, as you suggest. For example, you could look for correlations between the number of classes they attend and their mark on the final exam. Ideally, I think, you want both variables to have a normal distribution. But don’t get suckered into assigning causation to correlation. Coming to class might get you a better exam mark. Or the people who get high exam marks have learned that coming to class is a good learning strategy, along with doing the homework, the pre-reading and attending review sessions.

As for p/q or p*q, I used the first formula for r_pb in the Wikipedia page which has (n1/n)*(n2/n), essential p*q. Then I tried the formula from Bardar et al. (the formula for $rho_{pbis}$ above with p/q). And miracle-of-miracles, the results are identical! Now there’s a good homework problem…

Hi Peter,

This probably doesn’t make a difference for you, as your test has a fair number of items, but the following point becomes more important for shorter assessments. Folks could consider using “corrected” test scores and “corrected” standard deviation for this calculation. In other words: for mu_{+}, use the mean of the corrected total test scores for those whose answered correctly; for mu_{x}, use the mean of the corrected total test scores for the whole sample; and, for sigma_{x}, use the standard deviation of all scores on the corrected total test. The correction entails a total score which excludes the response to the item in question, as total scores which include the item in question will possess inauthentically greater correlation than total scores consisting only of other items in the test, especially when the assessment possesses relatively

few items. [1]

I’ve heard arguments that state values of rpb > 0.2 are okay (i.e., considered desirable) [2]. In fact, a relatively low value should not be surprising, and perhaps even expected, if one considers assessments designed to test multiple abilities in as few questions as possible.

Lastly, fyi, a minimum critical Pearson point-biserial correlation coefficient has been defined. [3] It is two standard deviations above zero, with the standard deviation calculated by:

sigma_r = 1/sqrt(N-1)

where N is the sample size. For example, if you had a class of 150 students, your critical cutoff would be 0.16.

[1] M. J. Allen and W. M. Yen, Introduction to Measurement Theory (Long Grove, IL: Waveland Press, 1979) p. 123.

[2] P. Kline, A Handbook of Test Construction: Introduction to psychometric design (London: Methuen, 1986) p. 143.

[3] L. Crocker and J. Algina, Introduction to Classical and Modern Test Theory (New York, NY: Holt, 1986) p. 34.

Thanks, James, for checking my stats. I did read about the “corrected” point-biserial. As you so nicely describe, the success or failure on the particular question itself should probably be taken out of the test success rates, since it drags the test scores up (or down). There were a couple of reasons why I went with this simpler “biased” version, though. First, as you mention, I have 35 questions and I didn’t think the effect would be huge. If there were only, say, 5 questions, then certainly I’d need to use the “correct” test scores.

The other reason I kept this “biased” definition is because it’s one Bardar et al. (2006) used in their analysis of the LSCI. They quote 0.30 – 0.70 as “acceptable” so I wanted to use the same statistic.

That paper also mentions the lower limit, 1/sqrt(N-1). In my case with N=204, that gives 0.07. It doesn’t show very well (at all – must work on graphics) in the graphic but there’s a greed line at 0.07 below which values are “insignificant”.

Thanks for the comment. Having someone with much more statistics expertise explain the difference between the biased and unbiased coefficient, and having it match what I thought, boosts my stats confidence.

I really hate questions like this:

Question 25: On the vernal equinox, compare the number of daytime hours in 3 cities, one at the north pole, one at 45 degrees north latitude and one at the equator.A) 0, 12, 24

B) 12, 18, 24

C) 12, 12, 12

D) 0, 12, 18

E) 18, 18, 18

Of course, at first thought, most people would answer (C). But that’s not strictly correct: at the north pole, the sun is above the horizon for the entire 24 hour period centered on the vernal equinox. The sun climbs in declination about 24′ every 24 hours at the vernal equinox. Therefore, while the sun is centered on the horizon at the moment of vernal equinox as seen from the north pole, its top limb (15.5′ minutes higher) would still be above the horizon 12 hours earlier (when the sun is 12′ lower). And 12 hours later, most of the solar disk is above the horizon. And this ignores atmospheric refraction, which raises the apparent height of objects on the horizon by 34′. So, in reality, a person at the north pole would see the

entire solar disk above the horizon for the 24 hour period centered on the vernal equinox.So the correct answer would be 24,12,12, which is an option NOT provided. So what’s the poor student to do when he realizes this?

Phil

Thanks, Phil, for your clear explanation. Perhaps this question and our 12, 12, 12 answer is another victim of over-simplifying the real World so we get the answer we want.

I wonder, though, if a student would have any appreciation of the impact of refraction if they were not first expecting the 12, 12, 12 answer. I think it harks back to the scientific method: we predict the answer will be 12, 12, 12. Yep, the simulations, text book illustrations and visualization of the celestial sphere say that’s correct. Unfortunately, we can’t get to the North Pole on the equinox, so we’ll accept for 12, 12, 12. Oh, but wait, someone HAS made that observation and 12, 12, 12 is not true? How interesting. I wonder what went wrong? Oh, right – we forgot about refraction. Let’s go back and revise our theory, make a new prediction (“Hmm, I wonder what happens at the South Pole?”) and continue exploring…

With this introductory gen-ed Astro 101 audience, though, I’m content with getting the students over the first hurdle, visualizing the celestial sphere, and preparing them to have the deeper conversations about refraction.

At my school we have access to the “Remark” software (I think published by Gravic—company names change so often) that scans bubblesheets and can do detailed item analysis such as the point bi-serial stat. Although this may sound heretical: I re-use test questions (mix them up and option orders from semester to semester, of course) and it’s interesting to see that the point bi-serial stat for a question can vary quite a lot from semester to semester. I don’t know how to analyze the results over multiple semesters (combine multiple point bi-serials for a given question) short of re-scanning all of an exam’s bubble sheets over multiple semesters making due allowance for different question and option orders. [If there were 48 hours in a day and I didn’t have to sleep, I’d consider doing that.] So, this is a note of caution that a great question on this semester’s exam might not be so great on another exam and vice versa. There is also the fact that we teachers are constantly trying to improve our teaching of a concept based on what we gleaned from student difficulties in the previous semesters. That has got to have an effect on a question’s point bi-serial from semester to semester but I don’t have the knowledge of how to account for that in the statistical analysis.

Anyway, now that you have the ability to calculate those pt bi-serials, it would interesting to see what you’d find if you re-use questions on future exams.

Thanks for the suggestions, Nick. When I’m authoring an exam, I worry about the order of the questions. For example, I don’t want to put a hard question first that might stump (and demoralize) too many students. But that’s not based on science, just a gut feeling. Using the point-biserial on the same questions from different versions of the exam is a great idea for quantifying that feeling – Does order make a difference?

Hmm, in large classes I often have 2 versions of midterms and exams that I distribute in stripes down the room so neighbours to your left or right have a different test than yours. Same questions, just shuffled. That could be the data we’re looking for. Gonna have to try it!

My point was about exams in different semesters (i.e., with different classes) so I don’t think that two versions of an exam given to the same class would address the point I was making. However, it would still be interesting to check how question order on a given exam to the same class could affect things.

For my classes I always have two versions of the exam and I have students sit in assigned seats such that they will not be sitting next to someone with the same version they have. When I do the analysis of the questions I merge the two exams to the same question order so I lose any information on whether question order matters or not. My classes are small enough that I’d worry about small sample size statistics leading to erroneous conclusions if I didn’t merge the two exam versions. A stats guru could probably take care of the small sample size but that’s not me!

It sounds like we all have ways to discourage cheating during exams. That would make a good astrolrnr thread, what with the term about to begin.

As for your first point, comparing students’ success on 2 scrambled versions of the same midterm (as I suggested in my previous reply) would better probe the effects of order, I think, because we’re guaranteed all the students had the same instruction. Comparisons of question order from one term to the next could be clouded by changes in instruction, too.

Both experiments are interesting and each will reveal something about teaching and learning astronomy. I think I still have the raw scantron data for the exam discussed in this blog post – I might look at processing the 2 version separately, just to see if anything shows up…

Thanks, Nick, for the discussion about going over going over the exam. That’s some kind of “meta” going-over, I think. Or maybe we’re iterating towards a stable result!