The Washington Post reported yesterday that the new SAT will in fact continue to include an experimental section. According to James Murphy of the Princeton Review, guest-writing in Valerie Strauss’s column, the change was announced at a meeting for test-center coordinators in Boston on February 4th.
To sum up:
The SAT has traditionally included an extra section — either Reading, Writing, or Math — that is used for research purposes only and is not scored. In the past, every student taking the exam under regular conditions (that is, without extra time) received an exam that included one of these sections. On the new SAT, however, only students not taking the test with writing (essay) will be given versions of the test that include experimental multiple-choice questions, and then only some of those students. The College Board has not made it clear what percentage will take the “experimental” version, nor has it indicated how those students will be selected.
In all the public relations the company has done for the new SAT, however, no mention has been made of an experimental section. This omission led test-prep professionals to conclude that the experimental section was dead.
He’s got that right — I certainly assumed the experimental section had been scrapped! And I spend a fair amount of time communicating with people who stay much more in the loop about the College Board’s less publicized wheelings and dealings than I do.
The College Board has not been transparent about the inclusion of this section. Even in that one place it mentions experimental questions—the counselors’ guide available for download as a PDF — you need to be familiar with the language of psychometrics to even know that what you’re actually reading is the announcement of experimental questions.
The SAT will be given in a standard testing room (to students with no testing accommodations) and consist of four components — five if the optional 50-minute Essay is taken — with each component timed separately. The timed portion of the SAT with Essay (excluding breaks) is three hours and 50 minutes. To allow for pretesting, some students taking the SAT with no Essay will take a fifth, 20-minute section. Any section of the SAT may contain both operational and pretest items.
The College Board document defines neither “operational” nor “pretest.” Nor does this paragraph make it clear whether all the experimental questions will appear only on the fifth section, at the start or end of the test, or will be dispersed throughout the exam. During the session, I asked if all the questions on the extra section won’t count and was told they would not. This paragraph is less clear on that issue, since it suggests that experimental (“pretest”) questions can show up on any section.
When The Washington Post asked for clarification on this question, they were sent the counselor’s paragraph, verbatim. Once again, the terminology was not defined and it was not clarified that “pretest” does not mean before the exam, but experimental.
For starters, I was unaware that the term “pretest” could have a second meaning. Even by the College Board’s current standards, that’s pretty brazen (although closer to the norm than not).
Second, I’m not sure how it is possible to have a standardized test that has different versions with different lengths, but one set of scores. (Although students who took the old test with accommodations did not receive an experimental section, they presumably formed a group small enough not to be statistically significant.) In order to ensure that scores are as valid as possible, it would seem reasonable to ensure that, at bare minimum, as many students as possible receive the same version of the test.
As Murphy rightly points out, issues of fatigue and pacing can have a significant effect on students’ scores — a student who takes a longer test will, almost certainly, become more tired and thus more likely to incorrectly answers questions that he or should would otherwise have gotten right.
Second, I’m no expert in statistics, but there would seem to be some problems with this method of data collection. Because the old experimental section was given to nearly all test-takers, any information gleaned from it could be assumed to hold true for the general population of test-takers.
The problem now is not simply that only one group of testers will be given experimental questions, but that the the group given experimental questions and the group not given experimental questions may not be comparable.
If you consider that the colleges requiring the Essay are, for the most part, quite selective, and that students tend not to apply to those schools unless they’re somewhere in the ballpark academically, then it stands to reason that the group sitting for the Essay will be, on the whole, a higher-scoring group than the group not sitting for the Essay.
As a result, the results obtained from the non-Essay group might not apply to test-takers across the board. Let’s say, hypothetically, that test takers in the Essay group are more likely to correctly answer a certain question than are test-takers in the non-Essay group. Because the only data obtained will be from students in the non-Essay group, the number of students answering that question correctly is lower than it would be if the entire group of test-takers were taken into account.
If the same phenomenon repeats itself for many, or even every, experimental question, and new tests are created based on the data gathered from the two unequal groups, then the entire level of the test will eventually shift down — perhaps further erasing some of the score gap, but also giving a further advantage to the stronger (and likely more privileged) group of students on future tests.
All of this is speculation, of course. It’s possible that the College Board has some way of statistically adjusting for the difference in the two groups (maybe the Hive can help with that!), but even so, you have to wonder… Wouldn’t it just have been better to create a five-part exam and give the same test to everyone?