When scores for the June SAT were released last month, many students found themselves in for a rude surprise. Although their raw scores were higher than on their previous exam(s), their scaled scores were lower, in some cases very significantly so.

An article in The Washington Post recounted the story of Campbell Taylor, who in March scored a 1470—20 points shy of the score he needed to qualify for a scholarship at his top-choice school:

[T]he 17-year-old resolved to take the test again in June and spent the intervening months buried in SAT preparation books and working with tutors. Taylor awoke at 7:30 a.m. Wednesday and checked his latest score online. The results were disappointing: He received a 1400.

He missed one more question overall in June than in March but his score, he said, dropped precipitously. And in the math portion of the exam, he actually missed fewer questions but scored lower: Taylor said he got a 770 in March after missing five math questions but received a 720 in June after missing just three math questions.

A student who contacted me, asking me to call attention to the situation, described something similar:

My personal experience is similar to others, my score dropped by the 90 points that most students are reporting. My June SAT score was a 1390 but with previous scales it should have been a 1480. My score was actually 10 points off from what most colleges that I am planning to apply to are expecting. Another girl I talked to had a June SAT score of 1150 but with the previous scale it should have been a 1240. She was looking to gain more scholarships and aid for the college she was accepted into.

When the student emailed me, she included a breakdown of the number of questions at each level difficulty level on the last few exams, and in comparison to the May test, there were notably fewer hard questions on all three sections of the June test (17 vs. 21 for reading; 3 vs. 9 for writing; and 16 vs. 25 for math).

Now obviously, it is impossible to ensure perfect consistency from exam to exam, and an easier test should have a less forgiving scale. If you’re interested in the nitty-gritty of how scales get tinkered with post-exam, Brian McElroy of McElroy Tutoring has a detailed explanation of the process. But I would also argue that to get too caught up in the minutiae of equating exams pre-test, post-test, etc. is really to miss the point here.

Yes, there is no way to predict with 100% certainty how a particular group of test-takers will fare on a given exam. But that said, the very fact that the College Board somehow ended up with such strikingly different numbers of hard questions on back-to-back administrations suggests that something is very wrong.

To state the obvious, the number of questions at each difficult level should remain more or less consistent from test to test; a student who answers more questions correctly on a retake should not see their score drop by these numbers. 10 or 20 points, fine, 30 maybe, but 50-100 is just too extreme. By definition, a standardized test must be consistent. If it isn’t consistent, it isn’t standardized. These kinds of wild swings simply did not occur before David Coleman took over, a fact that is even more notable when you consider that there were five levels of difficulty rather than just three. That version of the test may have had its problems, but it was calibrated exceedingly carefully and produced remarkably stable results.

Even if you accept that this level of variation is acceptable, there seems to be an additional problem. A student who commented on the WaPo article also made the following point, which interestingly was not mentioned in the article:

There is also the fact that 4 questions were thrown out by CollegeBoard for this test, 2 in reading and 2 in writing. Throwing out 4 questions (marked “unscoreable”) is unheard of. It reeks of a flawed test that was rushed. CB’s response is that students weren’t penalized for those missing 4 questions, but they were. Why? Because they still had to spend time answering them! And if these questions were so flawed that they had to be thrown out, it is not a stretch to believe students spent an inordinate amount of time trying to answer them. 

As I recall, the CB also threw out questions on one of the first of the new exams administered. At the time, it could be passed off as a normal part of the transition period, but more than two years in, that excuse doesn’t hold water.

To understand how this type of scaling inconsistency could happen—particularly when nothing comparable occurred prior to 2016—it is important to realize that although ETS is still playing a role in the administration of the SAT, the exam is now being written directly by the College Board for the first time in its history. That was a major shift, and one that never received anywhere near enough scrutiny.

According to sources I spoke with around the time the redesigned exam was introduced, the most experienced College Board psychometricians were left out of the development process for the new test and replaced by weaker hires from the ACT.

And while there is still an experimental section on the new exam, it is no longer universally administered (at least to the best of my knowledge), and the selection process for new questions does seem to have become notably less rigorous. In the past, questions were field-tested for several years with a variety of demographic groups to ensure scoring consistency, but the current fiasco suggests that things are a lot sloppier now.

If you’re a senior already committed to taking the SAT, there is unfortunately little you can do at this point other than remain aware that scoring has the potential to be exceedingly inconsistent, and know that the published scales may not in fact be accurate. If you can stand to do so, you might want to allow for one additional test, in case something unexpected happens when you retake.

It’s possible that the College Board will tread more careful when constructing future tests. But then again, given the inroads the CB has made into the state testing market and in recapturing market share from the ACT, the organization doesn’t have much of an incentive to be careful—huge numbers of students will still be required to take the SAT regardless of its scoring irregularities, and students who sign up for the Saturday test can be dismissed as whiners who don’t properly appreciate the subtleties of the equating process. If things are working well enough, why bother to fix them? Besides, admitting error is not exactly something the College Board is known for, especially these days.

So if you’re just beginning the test-prep process, I would still strongly recommend taking a hard look at the ACT, which remains a far less risky prospect in terms of scoring consistency. This is particularly true if you are aiming for merit scholarships that have a clear cut-off. If your ability to pay for college is on the line, this is not a chance you should take.