11

images

Of CATs and Claims

The First Step toward Wisdom

The first step toward wisdom is calling
things by their right names.
—confucius

I've got a little CAT,
And I'm very fond of that.
—Joseph Tabrar, 1892

The goal of this book parallels the principal goal of much of science—to try to understand the world through the integration of evidence and careful thought. I would like to end with a brief discussion of one instance where this tactic, which has always seemed completely convincing to me, was an utter failure in convincing some others. Such failures occur, alas, with disheartening frequency. A majority of this book has tried to establish that anecdotes are not data, and so basing decisions on anecdotes yields no guarantee of a positive outcome. As part of this chapter I try to go one step further and emphasize an important, but too rarely recognized, distinction between data and evidence. One might say, “I believe that you'll have a nice birthday, but my belief is not supported by any data.” Of course there is likely lots of data; the temperature, the Yankee's team batting average, my blood sugar level. What I mean is, “My belief is not supported by any evidence.”

Evidence is data related to a claim. If I want to establish a claim regarding your facility in French, I could record your shoe size and Social Security number—these are undoubtedly data—but they are not evidence. To illustrate this distinction on a topic relevant to this book's theme, I will take the opportunity to introduce you to a modern marvel of measurement, Computerized adaptive testing, in which the test shapes itself to suit the examinee. It is in the use of this technology that an instance of confusion between data and evidence emerges. But I am getting ahead of myself; let me begin at the beginning.

THE SHIFTING CHARACTER OF TESTING FROM INDIVIDUAL TO GROUP ADMINISTRATIONS

The use of tests to improve the use of human resources is very old indeed, with recorded instances ranging back to China's Xia dynasty more than 4,000 years ago. But the beginning of the twentieth century saw a shift in the way tests were administered. For most of testing's history the majority of examinations were given one-on-one by a wise examiner and a nervous candidate. This began to change in the nineteenth-century Indian civil service exams instituted under the British Raj, but accelerated when the military needed to classify large numbers of enlisted men in a short time at modest cost. After World War I, the mass-administered examination became the dominant form.

A mass-administered exam had many advantages. It was much cheaper and easier to administer, especially after the technology of multiple-choice items was mastered. Also, because its content could be carefully scrutinized and standardized, it was fairer and vastly more reliable. And because each test item took a relatively small amount of time to administer, a much broader span of topics could be covered, thus minimizing the possibility that an unfortunate choice of topics for the exam would disadvantage examinees whose instruction offered a slightly different view of the subject matter.

SHORTCOMINGS OF GROUP ADMINISTRATIONS

But group administration also had some drawbacks. The most important one was efficiency for the individual examinee. When a test is individually administered the examiner doesn't waste time asking inappropriate questions. If a question is clearly too hard we learn nothing when the examinee gets it wrong; similarly, if a question is too easy. We get maximal information from a question that is aimed close to the examinee's ability. A mass-administered standardized test must be constructed for the ability distribution of the anticipated population, with most of its items in the middle of the difficulty range and relatively fewer at the extremes. Thus the examinees whose abilities are near the middle are most accurately measured, whereas those at the extremes much less so. This issue becomes critical for diagnostic testing in which we wish to use the performance on the test to guide further instruction. An individualized exam can zero in on the examinee's weaknesses and pinpoint areas for remediation. A mass-administered exam would need to become impractically long to yield equivalent results.

CATS: A COMPROMISE

All this changed in the latter part of the twentieth century as computers moved out of climate-controlled rooms and onto everyone's desktop. A new form of testing was developed called a Computerized Adaptive Test, or CAT for short.1

A CAT tries to emulate a wise examiner. In a CAT the examinee sits in front of a computer monitor and the testing algorithm begins the test by selecting an item of moderate difficulty from a large pool of test items. These items had been constructed previously to span the entire range of examinee abilities. If the examinee gets the item right the CAT algorithm chooses a more difficult one; if the examinee gets it wrong it selects an easier one. This continues until the algorithm has zeroed in on the examinee's ability with as much accuracy as required. Typically, this takes about half as many items as would be required for a linearly administered paper-and-pencil test.

Of course, the item selection is more complicated than this. The algorithm must also choose items that cover all of the aspects of the topic being covered by the exam, and it tries to use all items at each difficulty level equally. Reasons for the former restriction are obvious, for the latter it is for test security. Because computers are more expensive than #2 pencils, CATs are typically administered a few at a time, continuously. This is in sharp contrast to gathering hundreds of examinees together in a school gymnasium on a specific Saturday in October and passing out test books, answer sheets, and #2 pencils. Continuous test administrations open wider the doors to cheating through item pilferage, hence the need for large item pools in which all of the items of the same difficulty level are used with more or less equal frequency.

A CAT is scored differently than a linear test. Because all examinees correctly answer approximately 50 percent of the items presented to them, the old scoring standby “percentage correct” is meaningless. Instead we must use something akin to “the difficulty of the hardest items answered correctly consistently.” Or, more metaphorically, think of the test items as a sequence of hurdles of increasing heights. A typical response pattern would yield a series of low hurdles (easy items) left standing and a sequence of high hurdles (hard items) knocked over. The examinee's ability is measured in the same metric as the height of the hurdles, and it lies between the height of the highest hurdle cleared and the height of the lowest hurdle knocked over. If an examinee answers all items asked of her correctly, all we can say for sure is that her ability is greater than the difficulty of the most difficult item presented.

CATS CAN DO A LOT, BUT NOT EVERYTHING

When CATs were first introduced we were all delighted with what could be done.

 

1. Individuals can work at their own pace, and speed of response can be recorded as additional information.

2. Each individual can stay busy productively—everyone is challenged but not discouraged.

3. The test can be scored immediately.

4. A greater variety of questions can be included, expanding the test developer's repertoire beyond multiple-choice questions.

5. Any errors uncovered in the test's items can be corrected on-the-fly, and hence the number of examinees exposed to the flawed item can be limited.

But some benefits of linear testing were lost. For example, you cannot review previously answered items and change your answer. Because the efficiency of the test depends on selecting an appropriate item after observing your response, if you change that response the item selected is no longer optimal. Skipping items, either with the notion of coming back to them later, or just going on and seeing what comes next, is no longer acceptable, because the item selection algorithm doesn't know what to do next. Thus, if you omit an item, your response must be treated as incorrect.

DOING IT ANYWAY: THE TRIUMPH
OF HOPE OVER LOGIC

The problem is never how to get new,
innovative thoughts into your mind,
but how to get old ones out.
—Dee Hock, founder and former
CEO of VISA International

In a pair of articles published in 1992 and 1994, Mary Lunz and her colleagues asserted that “the opportunity to review items and alter responses is important to examinees,” that they “feel at a disadvantage when they cannot review and alter their responses,” and that “the security of a final review can provide comfort and reassurance to the examinee.” They looked into allowing examinees some control over their exam by allowing them to skip an item they didn't want to answer. If an item was skipped, another one was chosen from the pool on the same topic and at about the same difficulty. Lunz et al. described experimental results that suggested that allowing examinees this sort of freedom had only a small effect on the outcome and hence recommended that operational tests consider including this additional flexibility.

Let us consider each of these ideas.

First, skipping: Consider a spelling test with a corpus of, say, 200,000 words. If I asked you to spell 100 words that I had chosen at random and you spelled 80 of them correctly I could safely infer that you could spell roughly 80 percent of the words in the corpus. But suppose after I asked you a word, you could choose to skip it and its successors until I offered a word that was more to your liking. And we continued doing this until you accepted 100 words (perhaps it took 1,000 words to arrive at these 100). Now if you spelled 100 of these correctly what could we infer about your spelling ability? All we can be sure of is that you could spell at least 100 words out of the corpus. The difference between these two tests—with and without skipping—is what statisticians characterize as the difference between ignorable missingness and nonignorable missingness. In the first case your nonresponses to the 199,900 words that I didn't ask you about can be ignored, because I know why you didn't respond—I didn't ask them. But in the second case your nonresponse to the 900 that you refused to answer cannot be ignored, because I don't know why you didn't answer. I might infer that you didn't try to spell them because you didn't know how, but even that isn't sure. So all that allowing skipping gets us is an increase in the uncertainty about the meaning of the test score.

Second, item review: Allowing item review can be benign, if an examinee doesn't change very many responses, or if the responses that are changed occurred toward the end of the test when her ability had already been pretty well estimated. But if she decides to revise the answers to items at the very beginning of the test, there could be a more profound effect. In addition, since most tests that allow review and revision of answers do so only after the test has been completed, revision of responses must diminish the efficiency of the sequence of items that were chosen to be presented to the examinee. Thus the accuracy of the final score will be diminished. Whether this inaccuracy is of serious consequence depends on many factors. But neither accuracy nor efficiency is our primary concern, for minor variations from optimality yields only minor effects.

Our concern is with the consequences of using the flexibility provided by these well-intentioned deviations from proper procedure to game the system.

GATHERING DATA, BUT NOT EVIDENCE

The desire to keep the old procedures within the new format was remarkably strong. Hence there were dozens of studies2 measuring the effect of allowing skipping and revision in a CAT. Many of these studies were extensive and careful with control groups, random assignment, and all of the other features that characterize the gold standard of science. And these studies provided data that supported the notion that examinees who made only modest use of these extra freedoms would not change their final scores very much.

But are these data evidence? To answer this question, we must make explicit exactly what are the claims that these data were gathered to support. As I discussed in chapter 6, tests can have three purposes:

 

1. Tests as measurements

2. Tests as prods

3. Tests as contests

When tests are used for measurement, as they usually are to guide decisions of placement or instruction, there is little incentive to cheat. For if you score higher than you deserve, you might end up in a too difficult course or missing out on some needed remediation. No sane person cheats on an eye test.

When a test is a prod; for example giving a test to get students to study, cheating may be a sensible strategy depending on what are the consequences of the score.

But when tests are contests, where the higher scorer wins a job, gets a license, earns a scholarship, or is offered entry into an elite school, cheating may have practical value. This value may overwhelm the legal, moral, and ethical strictures against it.

And so the data generated by the studies that examine how moderate use of the freedom to skip items and change answers provides evidence supporting claims about the use of CATs for measurement. But what about for contests?

GAMING THE SYSTEM

Alas, for those examinees who decide to the game the system, these modifications make it easier. Consider the situation where a test is used for licensure. This is high-stakes indeed, for if you pass, you can practice your profession, whereas if you fail, you cannot. So the goal of the examinee is to get a score above some established standard. If the standard is high relative to an examinee's knowledge, is there a strategy that can be followed to increase her chances of passing?

Suppose someone follows the following four steps:

 

1. Never answer a question whose answer you do not know (keep skipping until you get one you know).

2. If you know the answer, be sure to choose an incorrect response. The item selection algorithm will then provide you with an easier one.

3. Continue with this until the testing session is complete. With the full assistance of the item selection machinery, you should have then received the easiest test that could be built from the item pool.

4. Use the opportunity to review your responses to change the answers to all items so that they are all correct.

 

Now comes the fun. Depending on the scoring method used, you will either be given a passing score (remember that if you get all items right, all that can be said of your ability is that it is greater than the most difficult item asked), or you get to smirk in court as the testing company tries to explain to a judge how it could fail someone who answered correctly every question he was asked. I imagine somewhere in the cross-examination the testing company will explain that you failed because you only answered easy questions, to which you can respond, “But those were the only questions you asked me.”

This illustrates my point that data gathered to study these two kinds of modifications to CATs provides no evidence regarding claims when there is a real incentive to game the system. An experiment that would provide such evidence might use some examinees whose abilities are just below the passing standard and ask them to try this strategy and count how many of them can work it to a successful conclusion.

Of course one needn't be this extreme to game the system. There are many other possible strategies. For example, if, after you answer an item, the next one seems clearly easier, you can conclude you made a mistake and change your answer. But expanding on ways to beat the system is not my point. It is enough to show that by deviating from the “rules” of CAT you get into trouble. Some sources of trouble might be foreseeable and perhaps mitigated, but some surely will not be.

WHERE ARE WE NOW?

Don't worry about people stealing your ideas.
If your ideas are any good, you'll have
to ram them down people's throats.
—Howard Aiken3

Proposals for stretching the rules for CATs were made in 1992. In 1993 I laid out how those stretches could allow a motivated examinee to game the system. This warning did not deter large numbers of testing specialists from recommending these changes or from publishing irrelevant studies purportedly showing that such strategies were unlikely to have large effects. In the more than seventeen years since, published studies are still appearing4 that offer suggestions on minimizing the potential damage. Too often data, but not evidence, are offered in support.

Old theories never die, just the people who believe them.

Albert Einstein


This chapter developed from H. Wainer, “Some Practical Considerations When Converting a Linearly Administered Test to an Adaptive Format,” Educational Measurement: Issues and Practice 12 (1): 15-20, 1993.

1 The standard reference for CAT is Wainer 2000.

2 See Vispoel 1998 for a thorough and accurate summary of this work.

3 Howard Hathaway Aiken (1900-1973) was a pioneer in computing, being the primary engineer behind IBM's Harvard Mark I computer.

4 A recent instance by Papanastasiou and Reckase (2007) tries to provide a workaround to help mitigate the damage. I remain convinced that the best strategy is to avoid the problem rather than to try to repair its damage.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.253.221