Baron, J., & Ubel, P. A. (2002). Types of inconsistency in health-state utility judgments. Organizational Behavior and Human Decision Processes, 89, 1100-1118.

Types of inconsistency in health-state utility judgments

Jonathan Baron1 and Peter A. Ubel
University of Michigan

Abstract

In making judgments of health-related quality of life, respondents often compare the relative magnitude of two intervals between health states, such as the interval between normal health and blindness compared to that between normal health and death. We examined two ways of comparing such intervals: person trade-off (PTO) - in which the judgment concerns matching numbers of people so that two changes are equivalent - and direct judgment of the ratio. Both measures showed ratio inconsistency (a ratio that should be the product of two other ratios is too high) and superadditivity (two ratios that should add to 1 are too high). Some responses in both methods implied that two intervals which should have been different (because they shared a top or bottom point, but differed on the other point) were nevertheless viewed by subjects as being of equal size. These equality responses were more common when death was the bottom (worse end) of both intervals being compared (e.g. the interval between death and blindness is perceived as being the same size as the interval between death and normal health) than when any other condition was at the bottom or when the condition common to the two intervals was at the top. A second experiment indicated that subjects really do consider the intervals to be equal. Our findings argue for giving subjects a chance to reflect on such apparent inconsistencies in practical utility elicitation.

Introduction

Choice among programs, such as those concerned with allocation of medical resources, is sometimes based on measurement of benefits of competing options on a common scale of utility. Utility measures require numerical judgments based on descriptions of health states. In recent years, a few methods have been used to elicit these judgments. In standard gambles, subjects are asked for a probability p at which they would be indifferent between obtaining some health state (e.g., blindness) and a gamble with a p chance of obtaining a better health state (e.g., normal health) and a 1-p chance of obtaining a worse state (e.g., death). In direct judgment, they are asked to assign a number to blindness on a scale in which 0 is death and 100 is normal health (for example). In time trade-off, they are asked how many years of normal health, followed by death, is equivalent to, for example, 10 years of blindness followed by death. And in person trade-off, they are asked, for example, preventing how many people from dying is just as good as preventing 100 people from becoming blind. These methods have been used in practical decisions and have been the topic of extensive study (Baron, 1997; Bleichrodt & Johanesson, 1997; Krabbe, Essink-Bot, & Bonsel, 1997; Llewellyn-Thomas, 1997); Nord, 1995, 1999; Ubel, Loewenstein, Scanlon, & Kamlet, 1996, 1998). Similar methods have been used in decision analysis (Keeney & Raiffa, 1976; von Winterfeldt & Edwards, 1986).

At least three factors make such measurement difficult (Baron, 1997). First, the judgment may over- or under-weigh some factor that makes a health condition more or less severe, such as the extent of adaptation to a chronic condition. More generally, people may change the way in which they think about a health condition as they reflect on what it involves. This is not an issue that we address here. Second, judgments may be affected by aspects of the judgment task that are unrelated to the utility of the conditions. For example, responses to hypothetical gambles may be affected by attitudes toward risk or by the overweighing of certainty (de Neufville & Delquié, 1988; Tversky & Kahneman, 1992). Third, numerical responses may be distorted by scaling problems, such as tendencies to underweigh differences farther away from a reference point (Tversky & Kahneman, 1992) or to space judgments evenly along the given range (Mellers & Birnbaum, 1983; Poulton, 1979).

The studies reported here investigate the last two problems as they apply to two methods that might be used for utility estimation, difference judgments, and person-tradeoff (PTO). The main purpose of this article is to determine the nature of internal inconsistencies in PTO and direct measures, so that we can know how to deal with them for practical purposes. One inconsistency of particular interest results from judgments which imply that two health states are equally distant from a third (e.g., death) yet still different from each other.

We chose these two methods because they represent two general approaches to measurement. In direct judgment, subjects focus on the conditions and make a sort of psychophysical comparison of magnitudes. The PTO, by contrast, requires a hypothetical decision, and is therefore subject to influence by factors other than the magnitudes in question. The PTO, like standard gamble and time-tradeoff, has been advocated exactly because it is potentially sensitive to factors, such as fairness, other than utility magnitudes of health states.

The two methods: difference judgments and person-tradeoff

We think of each judgment as a comparison of two utility intervals, such as the interval between condition A and condition B versus the interval between conditions A and C. One of the ``conditions'' could be normal health, or death. Each judgment permits us to infer the ratio of the two utility differences, e.g., the size of the smaller interval relative to the larger. In difference judgments, this is what the the subject is asked to judge.

In PTO judgments, we infer the judgment from the subject's answer. If preventing 100 people's health from declining from Health Condition A to B is judged to be equally important as preventing 200 people from declining from A to C, we infer that the difference between A and C is half of the distance between A and B. This inference assumes that the subject takes the utilitarian approach of adding differences across people. (Again, we return to this assumption later.)

In difference judgments, the subject makes a direct comparison of the sizes of the two intervals, such as the interval between normal health and blindness and the interval between normal health and death. This corresponds to the method of swing weights used in decision analysis (Fischer, 1995; von Winterfeldt & Edwards, 1986). Difference judgments may be distorted in several ways (Baron, 1997; Birnbaum, 1978). Of interest here is the effect of distance from the reference point. Much evidence supports the general conclusion that people underweigh differences that are more distant from a reference point, regardless of whether the states at issue are better or worse than the reference point. For example, people seem to judge that the disutility of monetary losses is marginally declining as the loss increases; for example, losing $100 seems less bad when it is added to a $900 loss than when it occurs alone (Kahneman & Tversky, 1979; Thaler, 1985). When we measure the utility of health states, two different reference points are available, normal health and death. It may matter which we use. If some condition X is roughly half as bad as death, then people may judge it as numerically closer to death when they use normal health as the reference, and as closer to normal health when they use death as the reference. When people are asked ``How large is the difference between normal health and X compared to the difference between normal health and death?'', they might answer 60%, and when they are asked ``How large is the difference between X and death compared to the difference between normal health and death?'', they might answer 60% again. Their answers would thus sum to 120%, not 100%. We call this effect superadditivity.

In PTO judgments, the subject provides a number of people such that some benefit (harm) to that number is just as good (bad) as another benefit (harm) to a given number of other people (Nord, 1995; Ubel, Loewenstein, Scanlon, & Kamlet, 1996, 1997). This provides a utility measure if we assume that the judgment is utilitarian, so that the total benefit to a group is just the benefit per person times the number of people. For example, if preventing 250 cases of blindness is seen as equivalent to preventing 100 deaths (of normal people), then we infer that the interval between normal and blindness is .40 of the interval between normal and death. This is because the total utility of preventing blindness is .40 ·250, which equals the total utility of preventing 100 deaths. If we infer the utility of blindness on a scale on which normal is 1 and death is 0, then we infer that blindness has a utility of .60.

PTO judgments are affected by principles of fairness, regardless of the condition that remains when lives are saved (Nord, 1995; Ubel, Loewenstein, Scanlon, & Kamlet, 1996, 1997). PTO measures have been advocated as a way of capturing attitudes about policy. Some of these attitudes are not utilitarian. Still, if these measures are not internally consistent, it is difficult to see how they can be used. Internal inconsistency may arise from fairness considerations. Of particular interest here is the desire to give patients equal opportunity for treatment regardless of their present condition.

This principle may be applied in particular to saving lives. People may feel that the decision to save lives should not depend on the quality of the life after it is saved. There are many reasons for such a view. For example, people may feel that health professionals should not be in the business of judging the worth of a person's life. Once we allow such factors as handicaps to enter decisions about life saving, where do we draw the line between these conditions and such other factors as number of friends or economic productivity? People may also think in terms of the emotional or utility effects of changes rather than differences between existence in two different states, and it may be approximately true that the reaction to the change from a prospect of certain death to the prospect of continued life may be just as good for a person with a severe handicap as to a person without one. Or, alternatively, people may just apply a heuristic of fairness.

Consistency checks

The present experiments look for these effects. In part to look for these effects, we test consistency among judgments in two ways, explained shortly, which we call ratio consistency, and additivity. Consistency checks are of interest in their own right. Few efforts to test consistency are reported in the literature (Bleichrodt & Johanneson, 1997; Miyamoto & Eraker, 1988). For PTO, only ratio consistency has been tested (Ubel, Loewenstein, Scanlon, & Kamlet, 1996), and it is, as we show, a necessary but not a sufficient condition for utility measurement. Internal consistency is necessary if a judgment is to be interpreted as a judged utility function and used for making decisions that depend on such judgments. Yet, both researchers and practitioners in medical areas almost never test internal consistency in any way.

In the experiment reported here, subjects made PTO and difference judgments on the same set of conditions, all of which are long lasting and serious. To describe the judgments in general, we use the notation u(A)-u(B) to mean the utility difference, and we take J{[u(A)-u(B)]/[u(A)-u(C)]} to refer to the judged or inferred ratio of u(A)-u(B) to u(A)-u(C). We can think of the denominator u(A)-u(C) as the unit of utility. For example, u(A)-u(B) could be the difference between normal health and blindness, and u(A)-u(C) could be the difference between normal health and blindness-and-deafness. Someone who thought that blindness was half as bad as blindness-and-deafness, relative to normal health, would judge the utility of u(A)-u(B) as .5 using u(A)-u(C) as the standard. In other words, they would judge the u(A)-u(B) interval as half the size of the u(A)-u(C) interval.

As noted, we test two kinds of internal consistency: ratio consistency and additivity. Ratio consistency (Fagot, 1978) tests the condition that


J{[u(A)-u(B)]/[u(A)-u(C)]} ·J{[u(A)-u(C)]/[u(A)-u(D)]} = J{[u(A)-u(B)]/[u(A)-u(D)]}

If we suppose that judgments are insensitive to differences among utility intervals, then


J{[u(A)-u(B)]/[u(A)-u(C)]} ·J{[u(A)-u(C)]/[u(A)-u(D)]} < J{[u(A)-u(B)]/[u(A)-u(D)]}

For example, suppose that A is normal health, B is blindness, C is combined blindness and deafness, and D is death. Then each interval is the ``badness'' of A, B, C, or D compared to normal health. If B is half as bad as C and if C is half as bad as D, then A should be a quarter as bad as D. If, in the extreme, a subject is so insensitive to differences among utility intervals as to say ``half'' to every question, the product of the first two judgments will still be a quarter, but the third judgment will be a half. Such insensitivity could result from undersensitivity to large differences, oversensitivity to small differences, or both. Ubel, Loewenstein, Scanlon, and Kamlet (1996, 1997) found such an effect in PTO and called it ``multiplicative intransitivity.'' Fagot (1978) found ratio inconsistency (but not the same as we find) in a variety of psychophysical judgments; we use Fagot's term.

Ratio inconsistency could also be related to the finding of Birnbaum and his colleagues (Birnbaum, 1978; Birnbaum & Sutton, 1992) that subjects asked to judge the ratio of two stimuli respond (with a nonlinear response function) to the difference between the stimuli rather than to the ratio of their distances from zero (no stimulation, in a sensory task). However, when subjects are asked for ratios of differences - e.g., what is the ratio between the utility (or loudness, etc.) difference between A and B and the difference between C and D? - they tend to base their responses correctly on the ratio of the differences, and not the difference of the differences. Birnbaum's result can be taken to imply that we must state the reference point explicitly if we want subjects to use it, so we do this when we ask about differences. Explicitness in stating the ends of ranges being compared is one of the prescriptions of decision analysis (Fischer, 1995), but it is not used routinely in other value-elicitation tasks. Still, explicit statement of end points may not remove all of the tendency to respond in terms of differences. Such a tendency could be described as a neglect of changes in the standard of comparison. When subjects attend more to differences (as opposed to ratios), they might be undersensitive to changes in the standard of comparison they are given, from question to question. Such underattention would also produce ratio inconsistency and is, in our procedures, indistinguishable from other explanations of this effect.

Additivity tests whether the judged utility of two parts of an interval add up to the whole interval, that is, whether


J{[u(A)-u(B)]/[u(A)-u(C)]} + J{[u(B)-u(C)]/[u(A)-u(C)]} = 1

assuming A, B, and C are ordered. For example, if the difference between normal health and blindness is .3 of the difference between normal health and blindness-and-deafness, then the difference between blindness and blindness-and-deafness should be .7 of the difference between normal health and blindness-and-deafness. Superadditivity could result from numbers that are too high (.4 and .8 instead of .3 and .7). Superadditivity could also result from adoption of one of the states in the numerator (e.g., A in u(A)-u(B) or C in u(B)-u(C)) as the reference point, which would produce the effect if we assume that differences near the reference point loom larger.

Additivity is nearly sufficient for a utility scale, as such a scale can be defined in terms of units that are added. The Additivity implies ratio consistency, but ratio consistency does not imply additivity. The Appendix explains these relationships, and it also discusses inversion consistency, which concerns the direction of comparison - smaller to larger interval or the reverse. Inversion inconsistency was found by Baron, Wu, Brennan, Weeks, and Ubel (2001). It is a special case of ratio consistency.

Experiment 1

This experiment compared two methods for eliciting judgments from which utility can be inferred. Each method compares two intervals, such as the interval between normal health and blindness or the interval between normal health and death. Each of the two intervals has a top (best health state) and a bottom (worst health state). Sometimes the two intervals share the same top, and sometimes they share the same bottom. Sometimes the bottom is death, and sometimes the top is normal health.

In the difference method, subjects make direct numerical judgments of the size of one interval compared to another, the standard, which has a size of 100. For example, people are asked to imagine that the difference between normal health and death is 100 and then asked what the difference between normal health and some other health state would be. We infer judgments from PTO responses as described earlier. In the PTO, we ask subjects to compare the benefit of preventing changes for the worse, such as death, getting a handicap that one does not have, or getting a worse handicap. In previous studies not reported here, we found that different wording, such as ``saving'' rather than ``preventing'' yielded identical results.

Subjects completed our studies on the World Wide Web. This method of recruitment is new (Birnbaum, 2000) and is therefore worthy of comment. We originally recruited subjects by posting to Usenet news groups, but then various other people put links to our studies in their web pages, such as the American Psychological Society, and still others found us with search engines (looking for terms like ``surveys'' and ``pay''). Subjects who wish to be paid must provide their name, email address, address, and (for U.S. residents) their Social Security Number. It is therefore at least as difficult as with other methods to complete the same questionnaire under different names. Several studies find little or no difference between web subjects and other subjects (Birnbaum, 2000; Soetikno, Mrad, Pao, & Lenert, 1997; Soetikno, Provenzale, & Lenert, 1997). Of course, the web, like paper, can be used in many ways. One feature of our approach is to include checks for sensible answers (using JavaScript, a programming language for web pages) so that subjects who do not take the study seriously find it difficult to complete.

Method

One hundred and fourteen subjects completed a questionnaire on the World Wide Web, for $3. Three additional subjects were eliminated because they gave the same response to every item (100 or 50). Ages ranged from 13 to 69 (median 29); 71% were female; and, coincidentally, 71% were non-students.

Each subject answered a series of 32 questions regarding the relative magnitudes of differences in health states. There were two question frames, one using PTO and one using the difference method, with 16 items each.

The introduction began:

This study is about different ways of eliciting numerical judgments of health. If we could measure the benefit of curing or preventing different conditions, we could allocate scarce resources so as to get the greatest benefit per dollar. The following questions are about methods that might really be used to measure the badness of various conditions. The worse a condition, the more good is done by preventing it.

All the items refer to conditions that people acquire at about age 60, from strokes, and last for the rest of their lives. DEATH refers to death at age 60. The conditions are always in CAPITAL LETTERS. Here are the conditions:

PARALYSIS OF LEGS
PARALYSIS OF ARMS AND LEGS
BLINDNESS
BLINDNESS AND DEAFNESS
NORMAL HEALTH
DEATH

All questions require judgments on a numerical scale from 0 to 100. It never makes sense here to give an answer of more than 100.

All questions concern comparison of two ranges. For example one range might be between normal health and deafness, and another range might be between deafness and death.

In one type of question, we ask about the relative size of two ranges. How large is the smaller range, as a percent of the larger one? You might think of each range as a distance defined by two locations. How large is the distance from New York to Paris, as a percent of the distance from Chicago to Moscow? This is the kind of judgment you must make.

The second kind of question is like this:

How many people must be prevented from changing from
normal health to death
in order to make this just as good as
preventing 100 people from changing from
normal health to deafness?

In this case, you should give a number less than 100. Preventing blindness does a lot of good, but not as much as preventing death, so it would require fewer cases of preventing death to do the same amount of good.

Going back to the distance example:

How many trips from
Chicago to Moscow
is just as far as
100 trips from
New York to Paris?

The subject then did a practice item using the deafness example. Instead of filling in the number, the initial number was set at 50, and the subject pressed one of two buttons to indicate which option was better, until the subject judged them to be equal. This was to get the subject used to the idea of filling in a number that made two options equal. (The subject could not go on without doing the practice item.)

Each of the 32 screens presented either a PTO item or a difference item. The PTO item was of the form:

How many people must be prevented from changing from
A to B
in order to make this just as good as
preventing 100 people from changing from
C to D.

The difference item was of the form:

If the difference between
A to B
is 100, how big is the difference between
C to D.

In each case, one end of the two intervals was the same. For example, the ``tops'' of both intervals (A and C) could be normal health, or the ``bottom'' (B and D) could be death. The C-D interval was always intended as smaller than the A-B interval. Thus, we intended that the answers to both PTO and difference questions would less than or equal to 100. For both kinds of questions, a response of 100 implies that the intervals are equal (if they interpreted as utility intervals). Table 1 shows the basic comparisons, using paralysis items.

Insert Table 1.

For example, for the first item, the difference form was, ``If the difference between normal health and death is 100, how big is the difference between normal health and paralysis of legs?'' The PTO form was, ``How many people must be prevented from changing from normal health to death in order to make this just as good as preventing 100 people from changing from normal health to paralysis of legs.''

The same items were used for sensory impairments with blindness substituted for legs and blindness and deafness for paralysis of arms and legs.

Notice that items 1-3 compare intervals with Normal in common at the top; 5, 6, and 8 compare intervals with Death in common at the bottom; 7 has a non-death condition at the bottom; and 4 has a non-normal condition at the top. Comparisons 1-3 allow a test of ratio consistency: the ratio in 1 should be the product of the ratios in 2 and 3. Likewise for 6, 8, and 5. Finally, comparisons 1 and 5 together test additivity, as do comparisons 2 and 6, 3 and 7, and 4 and 8.

The 32 items (the eight items above, each in its paralysis or sensory form, and then either PTO or difference) appeared in a different random order for each subject.

Results

The nature of the disability - sensory or paralysis - did not affect the judgments, and this factor did not interact significantly with PTO vs. difference or with comparison (1-8). Accordingly, we combined the results for types of disability for analysis and presentation, and we use the paralysis conditions to stand for both.

Table 2 shows the mean utility ratios for the eight comparisons, the percentage of responses in which each measure yielded equality responses, that is, responses indicating that the two intervals were equal (100), and the mean utility ratios with equality responses excluded. In contrast to previous results (e.g., Baron et al., 2001), PTO and difference judgments were very close, and their overall means did not differ significantly across subjects (with or without the equality responses). Conceivably, the random intermixing of the two measures encouraged subjects to see them as more similar. (Baron et al., 2001, Experiment 2, found that close proximity of the two measures made their results closer.)

Insert Table 2.

Equality responses

An equality response is a response of 100, which we take to indicate that the subject judged the two intervals to be equal. We expected fewer of these responses in the difference method than in the PTO method because subjects might think that any other response implies unequal treatment of two groups differing in current health or in what can be prevented. The difference method is likely to be seen as a judgment of seriousness without any implications for how two groups should be treated. In fact, the proportion of equality responses was greater for PTO than for the difference method (t113=4.94, p=.0000, across subjects). Table 1 shows the mean proportion of equality responses for each of the eight comparisons.

We classified the comparisons into four types according to what condition was common to both intervals being compared: both intervals have normal health at the top (the better end of the interval - comparisons 1-3, called Common-top-normal); both have the same condition at the top, other than normal (comparison 4, Common-top-not-normal); both have death at the bottom (comparisons 5, 6, and 8, Common-bottom-death); and both have some other condition at the bottom (comparison 7, Common-bottom-not-death). We computed the mean proportion of equality responses for each of the four types.

The four types differed in an analysis of variance that included subject, PTO-vs.-difference, and type as factors (F3,339=11.50, p=.0000). The differences among the types did not interact significantly with PTO-vs.-difference. Equality responses were higher in the Common-bottom-death comparisons than in the other conditions, which were approximately equal. Collapsing across PTO and difference, the proportion of equality responses was .15 in Common-bottom-death and .08, .07, and .07, respectively in the other three types: Common-top-normal, Common-top-not-normal, and Common-bottom-not-death. In separate t tests, the proportion of equality responses in Common-bottom-death was higher than for each of the three other types at p < .002 (with Bonferroni correction, collapsing across PTO vs. difference). No other differences between types were significant.

In sum, equality responses are relatively frequent for Common-bottom-death, that is, when both intervals compared involve saving lives.

Additivity

We assessed additivity for comparisons involving death but not normal health (comparison 4 plus comparison 8), those involving normal health but not death (3 plus 7), and those involving both normal and death (extremes: 1 plus 5 and 2 plus 6). Table 3 shows the measure of superadditivity, the sum of the two ratios, minus 1. The measure is 0 if the two ratios are additive.

Utility ratios were superadditive, as shown in Table 3. Superadditivity was significantly positive overall (F1,113=82.9, p=0000), and in each the six types (three for PTO and three for difference; p < .002 by t test). Although the six types differed significantly, the differences are difficult to interpret.

Insert Table 3.

Superadditivity was reduced when equality responses were excluded (t113=6.28, p=.0000, using the average across the six types - three for difference and three for PTO). When equality responses were excluded, superadditivity remained overall (t113=2.41, p=.0176).

The substantial reduction in superadditivity - from a mean of .28 to a mean of .11 - suggests that superadditivity results largely from equality responses. For example, if the difference between death and leg paralysis is equal to the difference between death and normal health, then any judgment that there is a difference between leg paralysis and normal health will lead to superadditivity.

On the other hand, the results also suggest that superadditivity is in part a scaling effect, i.e., the J function is nonlinear. This observed nonlinearity is not just a by-product of equality responses, nor of principles applied to choices among groups of people. Superadditivity was significant for the difference method alone, without equality responses (t113=2.83, p=.0052). It could arise either in the way conditions are perceived or in the mapping of these perceptions into numerical responses. When subject compare intervals with one end of one of the intervals at normal, or one end at death, they may take that end as a reference point, exaggerate differences near it and minimize differences far from it, thus producing ratios that are too high.

Ratio inconsistency

Ratio inconsistency was positive for all measures (p < .0005, with or without equality responses, t > 5 for both). The measure of inconsistency was not affected by method (difference vs. PTO) or by whether the comparisons involved normal health (comparisons 1-3) or death (comparisons 5, 6, and 8). The measure was slightly higher when equality responses were excluded. The mean of the logarithmic ratio was .18 with all responses and .20 with equality responses excluded (using all remaining data, t113=2.49, p=.0142). These means correspond, respectively, to ratios of 1.52 and 1.60. In other words, a ratio that should be the product of two others is more than 1.5 times larger than the product, on the average. This result (also found by Baron et al., 2001) suggests that people do not differentiate large ratios and small ratios sufficiently.

Experiment 2

Experiment 1 suggested that subjects were using an equality principle, in which they truly judged the intervals to be equal (in the difference method) or thought that two groups should get equal priority (in the PTO method). Experiment 2 attempted to find direct evidence for these principles by asking subjects about them. To avoid biasing the subjects by our questions, we also asked about the opposite principle, the idea of complete triviality of one interval compared to the other, so that one interval ``dominates'' the other.

We also changed the response format. The use of an open-ended response in Experiment 1 might have encouraged subjects to respond with round numbers, and a response of 100 might have meant ``closer to 100 than to 90.'' To avoid this possibility, we gave subjects a set of buttons labeled 100, 99, 95, 90, 80, .... This response mode makes a response of 99 just as acceptable as a response of 100.

Method

Sixty-eight subjects completed a questionnaire on the World Wide Web, for $3. Ages ranged from 17 to 73 (median 34); 69% were female; and 72% were non-students.

The procedure was identical to that of Experiment 1, except for the addition of a set of buttons for responses, and the addition of two questions, one about equality, one about what we call dominance. An example of a difference-method item is:

If the difference between
BLINDNESS and DEATH
is 100, how big is the difference between
BLINDNESS AND DEAFNESS and DEATH?

Choose the closest number:

100
99
95
90
80
70
60
50
40
30
20
10
5
1
0
Consider the following statements:

The difference between
BLINDNESS AND DEAFNESS and DEATH
is as large as the difference between
BLINDNESS and DEATH.
    AGREE    
    DISAGREE    
The difference between
BLINDNESS AND DEAFNESS and DEATH
is trivial compared to the difference between
BLINDNESS and DEATH.
    AGREE    
    DISAGREE    

For the PTO item, the buttons were the same, but the equality and dominance questions were (using a different example):

It is just as important to prevent people from changing from
PARALYSIS OF ARMS AND LEGS to DEATH as it is to prevent people from changing from
PARALYSIS OF LEGS to DEATH.
    AGREE    
    DISAGREE    
When we have a choice between preventing the from
PARALYSIS OF ARMS AND LEGS to DEATH in some people, and preventing the change from
PARALYSIS OF LEGS to DEATH in other people, we should always prevent the change from PARALYSIS OF LEGS to DEATH no matter how few people we can help.
    AGREE    
    DISAGREE    

Results

Table 4 shows the mean utility ratios for the eight comparisons, the percentage of responses in which each measure yielded equality responses, and the percent agreement with the equality and dominance questions. In general, the proportion of equality responses (numerical responses of ``100'') was higher than in Experiment 1, despite our effort to reduce them by giving the subjects a clear ``99'' option. The equality question may have called attention to the possibility that intervals were equal. Interestingly, the dominance question - although often endorsed - did not seem to play an analogous role. (Only 2.6% of the responses were the lowest possible, which turned out to be 1 rather than 0 because of a programming error, an error that no subject complained about.)

Insert Table 4.

Of primary interest, subjects endorsed the equality question more often than they followed it literally in their numerical judgments, but their responses to this question followed the same pattern. Again, we classified the comparisons into four types according to what condition was common to both intervals being compared - Common-top-normal, Common-top-not-normal, Common-bottom-death, Common-bottom-not-death - and we examined each subject's proportion of each type of response for each of these types. The effect of type was highly significant for all three relevant measures: numerical utility judgments (F3,201=14.04, p=.0000), equality responses (judgments of 100; F3,201=17.67, p=.0000), and equality agreement (F3,201=29.34, p=.000). As in Experiment 1, Common-bottom-death differed very clearly from all other types for utility judgments, equality responses, and equality agreement.

Unlike Experiment 1, some of the other type differences were significant too. For equality responses and equality agreement, all of these involved the Common-bottom-not-death type (Condition 7 in Table 1 and Table 4), which was higher than both Common-top-normal and Common-top-not-normal types (p < .025 in all cases). These results suggest that subjects pay more attention to the bottom of the interval than the top in making these equality judgments.

The pattern for the dominance question is not just the reverse of the equality question, as seen in Table 4. Answers to the dominance question are most frequent when one of the intervals includes death and the other does not (cases 1, 2 and 4 vs. all others; t67=6.48, p=.0000, for the means of these cases vs. the means of the others). Subjects seem to apply a principle that preventing death dominates preventing anything else.

PTO and difference judgments also differed significantly in utility judgments, equality responses, and equality agreement (p < .03 for all) with more equality and higher utilities in PTO. Dominance responses showed no effect.

In sum, direct judgments of the equality of intervals support our interpretation of the results of Experiment 1. Agreement that intervals are equal was more frequent for Common-bottom-death, that is, when both intervals compared involve saving lives. We also found some evidence of greater attention to the bottom of the interval than the top, and for judgments of dominance when one bottom involved death and the other did not.

Results for superadditivity replicated those of Experiment 1, although the effect was somewhat larger (with a minimum of .21, for the difference method when normal health was involved but not death). Again, the effect was significantly reduced when equality responses were omitted (t33=3.03, p=.0048). This time, however, there was no significant superadditivity effect when the equality responses were removed. Note that many subjects had insufficient data for these tests, because the number of equality responses was so much higher than in Experiment 1.

Ratio inconsistency was, again, significantly positive for all measures (p < .025, with or without equality responses). Again, this measure of inconsistency was not affected by method (difference vs. PTO), by whether the comparisons involved normal health (comparisons 1-3) or death (comparisons 5, 6, and 8), or by whether or not equality responses were excluded.

General Discussion

We examined two methods, direct judgment and person trade-off. The former asks the subject to think about utility in an abstract way. The latter infers the subject's utility from a matching response in a hypothetical decision. Both methods are subject to distortions resulting from scaling effects or heuristics.

Our results suggest that further use of these methods - and possibly others as well - should include checks for consistency, such as those we have used. Such checks may help respondents bring their judgments into line with their true values, or construct their values through a process of reflection with the help of the consistency checks. For example, a person might come to agree that the difference between death and normal health is greater than the difference between death and some state of impairment, after reflecting on the implication of equality judgments that the impairment is no different from normal health.

Equality responses and superadditivity

We found evidence for a heuristic or principle of equality in which subjects judged the intervals being compared as equal. This principle led to higher mean utility ratios when they were measured with death as the reference point, especially in PTO, and the principle also led to superadditivity. It is possible that some of these responses resulted from careless responding. Careless responding, however, cannot account for the observed differences among conditions. The results of Experiment 2, in which subjects acknowledge equality in a consistent pattern, also argue against a carelessness account. Equality responses in PTO (but not in direct rating) have also been found using interview methods Pinto-Prades & Lopez-Nicholás, 1998).

We would expect superaddivity to result from equality responses. If, for example, a subject judged J{[u(A)-u(B)]/[u(A)-u(C)]} to be 1 and J{[u(B)-u(C)]/[u(A)-u(C)]} to be greater than 0, then the sum would be greater than 1. In Experiment 1, superadditivity was, in fact, reduced when these equality responses were removed, but it was not eliminated. Note also that ratio inconsistency was not reduced by removal of equality responses, and there is no general reason to expect such reduction.

If utility measures are affected by a principle of equality, it is not clear which measures are relevant to which decisions. At issue, in part, are the reasons for the principle. One possible reason is that subjects regard handicaps as irrelevant when they are properties of people. In this regard, they are like economic or social status, outside the bounds of medical decisions - we do not take these things into account in allocating health care. This explanation, however, is inconsistent with our finding that equality responses are particularly common only when death is at the bottom end of both intervals being compared. If the top condition were generally irrelevant when it differed between the two intervals, then equality responses would be used as much when the bottom condition was some non-death condition as when it was death. It seems that the equality responses are peculiar to comparisons involving death. Moreover, equality responses involving death were prevalent in difference judgments as well as PTO judgments.

Another possible explanation of equality responses is that subjects regard preventing death or saving lives as lexically ordered before any other health outcomes (Norcross, 1997). This, too, conflicts with our results, as we did not find many ``zero'' responses, in which subjects gave infinite weight to death in comparisons where the top of the two intervals was constant (comparions 1 and 2). In Experiment 2, the pattern of responses to the dominance question, which assessed this principle directly, was somewhat different from that of the equality responses. On the other hand, the same experiment provided some evidence that, in general, people pay more attention to the bottom of an interval, so they are more likely to make equality responses if the bottom of two intervals is the same than if the top is the same.

One possible explanation, consistent with our results, is that people employ a heuristic principle that is specific to life-saving, which is that decisions about life-saving are special, in that everyone should be treated equally. The use of this principle for PTO measures is consistent with the finding that these measures showed more equality responses than difference measures. These results are also consistent with the findings of Ubel and Richardson (2000): In a PTO task, subjects compared saving the lives of people with paralysis of their legs to saving the lives of people in normal health. The people were either paralyzed before they became ill or they were in normal health before they became ill. When the pre-existing paralysis case came first, most subjects gave equal priority to the two life-saving procedures, but, when the people were normal beforehand, most subjects gave priority to restoring people to normal health over restoring them to a state of paralysis. Evidently, the pre-existing condition elicited the equality heuristic more often than did the control condition.

We found equality responses for difference judgments as well as for PTO, and they were more frequent when death was at one end of both intervals. This finding suggests that people sometimes see saving lives as equally valuable regardless of the end state after the life is saved. Difference judgments are about magnitudes, and they are not hypothetical decisions. Equality responses for difference judgments thus would result from a judgment of the value for each individual saved. Equality responses in general, then, are not easily understood as resulting from a heuristic involving fairness, since they occur for difference judgments as well as PTO judgments. Also, fairness to the handicapped would seem to apply even when the bottom condition is not death, and death is crucial for equality judgments in both PTO and difference judgments.

In principle, such a judgment that two different intervals are equal, or nearly equal, could be correct, if people think of intervals as changes from one state to another rather than as differences between two persistent states. The utility of a change need not depend solely on the difference between the utility of being in the start state and the utility of being in the end state. For example, the utility of change itself might have some maximum. Thus, the utility of a change from A to B might be just as great as the utility of a change from A to C, even if the change from B to C has non-zero utility as well.

It is unlikely, however, that this perception of value is typically accurate for the affected people. Given a choice between having one's life saved and being handicapped and having one's life saved and not being handicapped, most people would strongly prefer the latter, even in the face of great happiness in knowing that their life would be saved. Although we could simply assume that these judgments represent true expressions of public preferences, we should find out how people respond to challenges, either by asking them outright whether they prefer to be handicapped or not or by explaining the superadditivity effect to them. (We may also try to explain to them that they are to judge states rather than changes.)

Superadditivity could also result from a tendency to see intervals as more equal than they are. This, in turn, could result from declining influence of differences further from a reference point, as discussed in the Introduction.

Ratio inconsistency

Ratio inconsistency, which we also found, can be described as a failure to make sufficient distinctions among high and low ratios; subjects tend to give responses toward the middle of the scale they are given (Poulton, 1979).

More generally, our results suggest two general problems with utility measurement, one resulting from the use of heuristic principles and the other resulting from psychophysical effects. Further research is needed to examine the malleability of these effects, and also whether these effects are found in other measures of utility, such as standard gambles and time tradeoffs. We are undertaking this research. These are not the only possible problems, of course.

PTO as a utility measure

Some have argued that the PTO is not a utility measure but, instead, a measure of societal value (Nord, Pinto, Richardson, Menzel, & Ubel, 1999) As such, it is acceptable, they argue, for PTO measurements to show superadditivity. If, for example, saving the life of someone who is left blind is just as valuable as saving someone else's life who will have full vision, according to a PTO measurement, then people should still be allowed to say that there is value to curing blindness. They would not be able to do so if the PTO was a measure of utility and was not allowed to show superadditivity.

Suppose that treatment A saves people's lives and leave them blind, and treatment B cures blindness. Treatment C saves people's lives and leaves them with full vision. People judge that giving A to 10 people is as valuable as giving C to 10 people, and giving B to 10 people is half as valuable as either of the other treatments (i.e., giving B to 20 people is equally valuable). Then, if value is additive, the value of giving A, followed by B, to 10 people would be greater than the value of giving C to 14 people. (Assume that patients are under anesthesia once when A and B are given to the same patients.) Yet the result would be the same.

The only escape from this apparent conflict is either to say that values of events are not additive, or else to say that decisions should not be based on consequences. Ubel et al. (in press) have discussed the possibility that values need not be additive. In principle, such a state of affairs could prevent us from making inferences about choices, without asking about the particular choices. In practice, such situations may be rare, and unproblematic. We have suggested earlier that, for some situations involving changes, values need not be additive.

If decisions are not based on consequences, then we must ask for justification. If, for example, we choose the combination of A and B over giving C to more people, how can we explain to the people who would have gotten C why they are not getting it. We cannot say that we had to make a choice and that the benefits to others were greater. Moreover, suppose people were behind a veil of ignorance and did not know which group they were in, the A+B group (10) or the C group (14), but their chance of being in each group depended on its size. In this case, the policy of choosing C would be better for each individual. We thus could make a decision that is judged to be good on the basis of a PTO judgment and that makes everyone worse off. (The fact that people are worse off only ex ante does not seem to weaken the force of this argument, which is based on the work of Kaplow and Shavell (2000).

In addition, even as a measure of societal value and not utility, PTO measures should not show ratio inconsistency, as they have done in this study (Baron et al., 2001). The problem is again that, by combining judgments we could arrive at different conclusions about the same choice. If curing one-eye blindness in 200 people is as good as curing blindness in 50, and if curing blindness in 50 is as good as saving the lives of 10, then we must conclude that curing one-eye blindness in 200 is better than saving the lives of 9, yet someone might judge that the figure is really 18 rather than 9. Thus, preference itself would be intransitive.

These considerations argue that the PTO measure must behave like a utility measure, although it could still reflect a different kind of utility that we might call societal utility. Yet, consider again the veil of ignorance. Any PTO judgment has an equivalent judgment using gambles for identically placed individuals. (The fact that people are rarely identically placed is, again, irrelevant, since a normative theory of the sort we are considering must apply everywhere.) If the PTO judgment disagrees with the judgment of gambles, then it could lead to a societal decision that makes everyone worse off, in terms of their own judgments of personal good. This sort of argument could apply to any other utility measure (including direct ratings, if people take their direct ratings to be interval measures of personal good).

Conclusion

Our findings suggest that practical uses of utility elicitation methods, even methods as simple as direct judgment, should be accompanied by consistency checks of the sort we have used (as suggested, for example, by Keeney & Raiffa, 1976, p. 271, and Baron et al., 2001). Use of consistency checks is common practice in decision analysis, but is rarely used in other traditions of practical judgment elicitation. The evidence to date suggests that people can make judgments that are both consistent and honest. That is, people are generally willing to accept the conclusion that inconsistent judgments are erroneous (Baron et al., 2001).

When consistency checks fail, subjects should be confronted with the results and asked to resolve them. When one of the ends of the scale is death and when subjects give equality responses, they should also be asked to reflect on these responses, asking themselves, for example, whether, given that someone's life was to be saved, it really didn't matter what state they were in after that. This can be done in a personal interview, or with the use of an interactive computer program, or both together. The equality effect is not, itself, inherently inconsistent for a single judgment, but it apparently leads to superadditivity when it is combined with other judgments, so the checks should focus on the inconsistency rather than on the equality effect itself. Moreover, as we noted, we have no reason to expect reduction of the equality effect to influence other types of inconsistency, aside from superadditivity.

Our results also suggest that, when PTO and difference judgments differ, the difference judgments yield fewer equality responses and are thus somewhat less subject to inconsistencies. This does not eliminate the need for consistency checks. But our results suggest that direct comparison of intervals remains a viable method for assessment of utility.

Our results are limited to the two methods we have used. Similar inconsistencies have been found using other methods. For example, Ubel et al. (1996) found ratio inconsistency in the standard-gamble method (which is based on expected-utility theory), and O'Leary et al. (1995) found unwillingness to trade off any time in the time-tradeoff, a result that may be analogous to equality responses here. As yet we do not know whether the same patterns of results would be found for these other measures (involving the nature of the common bottom) as we found here.

Appendix

Why additivity is special

Additivity, under reasonable conditions, implies ratio consistency, but ratio consistency does not imply additivity. Moreover, addivity is closer to being a sufficient condition for a utility scale. (The additional assumptions required are likely to hold.)

Additivity as defined above can be seen as a special case of a more general additivity property of utility judgments: J{[u(A)-u(B)]/[u(E)-u(F)]} + J{[u(B)-u(C)]/[u(E)-u(F)]} = J{[u(A)-u(C)]/[u(E)-u(F)]}. The denominator need not be u(A)-u(C); the denominator u(E)-u(F) simply defines the unit of measurement. This general additivity property implies ``monotonicity'' (Krantz, Luce, Suppes, & Tversky, p. 145 - actually as stated, ``weak monotonicity'' is based on inequality rather than equality, but we use the equality version of p. 137 for clarity of exposition). Specifically, omitting the denominator J[u(E)-u(F)] (which is the unit of measurement for all judgments), the monotonicity property is:

if J[u(A)-u(B) = J[u(A¢)-u(B¢)] and J[u(B)-u(C)] = J[u(B¢)-u(C¢)], then J[u(A)-u(C)] = J[u(A¢)-u(C¢)] .

In other words, if the interval between A¢ and B¢ is judged equal to that between A and B (where the primes indicate different health states) and B¢ and C¢ is equivalent to B and C, then A¢ and C¢ is equivalent to A and C. This is the most critical property required for an interval scale of utility. The general form of the additivity test (with J[u(E)-u(F)] in the denominator) implies monotonicity, since J[u(A)-u(C)] is the sum of J[u(A)-u(B)] and J[u(B)-u(C)] and must therefore have the same value for any other intervals equal to these two, respectively. Here, as noted, we test a special case in which A and C define the unit, but, if additivity fails for this case, it is not generally true, so a utility scale cannot be constructed. If such a scale can be constructed, then ratio consistency also holds.

But the converse does not hold. For example, suppose that judgments are based on the square of utility differences rather than the differences themselves. Thus

J{[u(A)-u(B)]/[u(A)-u(C)]} = [u(A)-u(B)]2/[u(A)-u(C)]2.

Then the following are equivalent:

1. J{[u(A)-u(B)]/[u(A)-u(C)]} ·J{[u(A)-u(C)]/[u(A)-u(D)]} = J{[u(A)-u(B)]/[u(A)-u(D)]}

2. {[u(A)-u(B)]2/[u(A)-u(C)]2} ·{[u(A)-u(C)]2/[u(A)-u(D)]2} = {[u(A)-u(B)]2/[u(A)-u(D)]2}

3. {[u(A)-u(B)]/[u(A)-u(C)]} ·{[u(A)-u(C)]/[u(A)-u(D)]} = {[u(A)-u(B)]/[u(A)-u(D)]}

So ratio consistency will still hold, but additivity will not hold in general, since

J{[u(A)-u(B)]/[u(A)-u(C)]} + J{[u(B)-u(C)]/[u(A)-u(C)]}

= {[u(A)-u(B)]2/[u(A)-u(C)]2} + {[u(B)-u(C)]2/[u(A)-u(C)]2}

= [([u(A)2 - 2u(A)u(B) + u(B)2] + [u(B)2 - 2u(B)u(C) +u(C)2])/([u(A)-u(C)]2)] ,

which need not equal [([u(A) - u(B)] + [u(B) -u(C)])/([u(A)-u(C)]2)] .

In sum, additivity (together with other assumptions that are likely to hold, such as larger intervals having greater utilities) implies a utility scale and the two consistency measures do not.

Why inversion consistency is a special case of ratio consistency

Inversion consistency (Baron et al., 2001) refers to the effect of direction of comparison, J{[u(A)-u(B)]/[u(A)-u(C)]} vs.  J{[u(A)-u(C)]/[u(A)-u(B)]}. For example, ``How big is the difference between normal health and blindness compared to that between normal health and blindness-and-deafness,'' vs. ``How big is the latter difference compared to the former.'' If subjects are biased toward higher numerical responses, no matter what the question, then the product of the two ratios will be greater than 1, as found for a number of psychophysical judgments (Fagot, 1979; Fagot & Pokorny, 1989). High numbers could result from a tendency to give numbers in the middle of the given range; for numbers over 100, subjects could see the top of the range as very high. Inversion consistency can be seen as a special case of ratio consistency (as shown in the Appendix).

Inversion consistency amounts to J{[u(A)-u(B)]/[u(A)-u(C)]} ·J{[u(A)-u(C)]/[u(A)-u(B)]}=1,

but it can also be written as

J{[u(A)-u(B)]/[u(A)-u(C)]} ·J{[u(A)-u(C)]/[u(A)-u(B)]} = J{[u(A)-u(B)]/[u(A)-u(B)]}. Ratio consistency says that J{[u(A)-u(B)]/[u(A)-u(C)]} ·J{[u(A)-u(C)]/[u(A)-u(D)]} = J{[u(A)-u(B)]/[u(A)-u(D)]}. If we set D equal to B, it is apparent that ratio consistency implies inversion consistency, if ratio consistency holds for reversals (such as J{[u(A)-u(C)]/[u(A)-u(B)]}).

Because inversion consistency is a special case of ratio consistency, it is no more definitive as a consistency check. It is necessary, but not sufficient, for a utility scale.

References

Baron, J. (1997). Biases in the quantitative measurement of values for public decisions. Psychological Bulletin, 122, 72-88.

Baron, J., Wu, Z., Brennan D. J., Weeks, C., Ubel P. A. (2001). Analog scale, ratio judgment and person trade-off as measures of health utility: biases and their correction. Journal of Behavioral Decision Making, 14, 17-34.

Birnbaum, M. H. (1978). Differences and ratios in psychological measurement. In N Castellan & F Restle (Eds.), Cognitive theory, (Vol. 3, pp. 33-74). Hillsdale, NJ: Erlbaum.

Birnbaum, M. H. (Ed.) (2000). Psychological Experiments on the Internet. New York: Academic Press.

Birnbaum, M. H., Sutton S. E. (1992). Scale convergence and utility measurement. Organizational Behavior and Human Decision Processes, 52, 183-215.

Bleichrodt, H., Johanneson, M. (1997). The validity of QALYs: An experimental test of constant proportional tradeoff and utility independence. Medical Decision Making, 17, 21-32.

de Neufville, R., Delquié P. (1988). A model of the influence of certainty and probability ``effects'' on the measurement of utility. In B. Munier (Ed.), Risk, Decision, and Rationality, pp. 189-205. Dordrecht: Reidel, 1988.

Fagot, R. F. A theory of relative judgment. (1978). Perception and Psychophysics, 24, 243-252.

Fagot, R. F., Pokorny, R. (1989). Bias effects on magnitude and ratio estimation power function exponents. Perception and Psychophysics, 45, 221-330.

Fischer G. W. (1985). Range sensitivity of attribute weights in multiattribute value models. Organizational Behavior and Human Decision Processes, 62, 252-266.

Kahneman, D., Tversky A. (1979). Prospect theory: An analysis of decisions under risk. Econometrica, 47, 263-291.

Kaplow, L., Shavell S. (2000). Principles of fairness versus human welfare: On the evaluation of legal policy. Discussion Paper No. 277, Center for Law, Economics, and Business, Harvard Law School. http://www.law.harvard.edu/programs/olin_center).

Keeney, R. L., Raiffa, H. Decisions with multiple objectives. (1976). New York: Wiley.

Krabbe P. F. M., Essink-Bot M-L, Bonsel G. K. The comparability and reliability of five health-state valuation methods. Social Science and Medicine, 45, 1641-1652.

Krantz D. H., Luce R. D. Suppes, P., Tversky A. (1971). Foundations of measurement (Vol. 1). New York: Academic Press, 1971.

Llewellyn-Thomas H. A. (1997). Investigating patients' preferences for different treatment options. Canadian Journal of Nursing Research, 29, 45-64.

Mellers, B. A., Birnbaum, M. H. (1983). Loci of contextual effects in judgment. Journal of Experimental Psychology: Human Perception and Performance, 8, 582-601.

Miyamoto, J. M., Eraker S. A. (1988). A multiplicative model of survival duration and health quality. Journal of Experimental Psychology: General, 117, 3-20.

Norcross, A. (1997). Comparing harms: Headaches and human lives. Philosophy and Public Affairs, 26, 135-167.

Nord, E. (1995). The person trade-off approcah to valuing health care programs. Medical Decision Making, 15, 201-208.

Nord, E. (1999). Cost-value analysis in health care : making sense out of QALYs. Cambridge, UK: Cambridge University Press.

Nord, E., Pinto J. L., Richardson, J., Menzel, P., Ubel P. A. (1999). Incorporating societal concerns for fairness in numerical valuations of health programs. Health Economics, 8, 25-39.

O'Leary, J. F., Fairclough, D. L., Jankowski, M. K., & Weeks, J. C. (1995). Comparison of time-tradeoff utilities and rating scale values of cancer patients and their relatives. Medical Decision Making, 15, 132-137.

Pinto-Prades, J.-L, Lopez-Nicholás A. (1998). More evidence on the plateau effect: A social perspective. Medical Decision Making, 18, 287-294.

Poulton, E. C. (1979). Models for biases in judging sensory magnitude. Psychological Bulletin, 86, 777-803.

Soetikno, R. M., Mrad, R., Pao, V., Lenert, L. A. (1997). Quality-of-life research on the Internet: feasibility and potential biases in patients with ulcerative colitis. Journal of the American Medical Informatics Association, 4, 426-435.

Soetikno, R. M., Provenzale, D., Lenert L. A. (1997). Studying ulcerative colitis over the World Wide Web. American Journal of Gastroenterology, 92, 457-60.

Thaler, R. (1985). Mental accounting and consumer choice. Marketing Science, 4, 199-214.

Tversky, A., Kahneman D. (1992). Advances in prospect theory: Cumulative representations of uncertainty. Journal of Risk and Uncertainty, 5, 297-323.

Ubel, P. A., Loewenstein, G., Scanlon, D., Kamlet, M. (1996). Individual utilities are inconsistent with rationing choices: A partial explanation of why Oregon's cost-effectiveness list failed. Medical Decision Making, 16, 108-16.

Ubel, P. A., Nord, E., Gold, M., Menzel, P., Pinto Prades J-L, Richardson J. (in press). Improving value measurement in cost-effectiveness analysis. Medical Care, (in press).

Ubel, P. A., Richardson J. (2000). Order effects in person trade-off elicitations. Manuscript.

Ubel, P. A., Loewenstein, G., Scanlon, D., & Kamlet, M. (1996). Individual utilities are inconsistent with rationing choices: A partial explanation of why Oregon's cost-effectiveness list failed. Medical Decision Making, 16, 108-116.

Ubel, P. A., Loewenstein, G., Scanlon, D., Kamlet, M. (1998). Value measurement in cost-utility analysis: Explaining the discrepancy between analog scale and person trade-off elicitations. Health Policy, 43, 33-44.

von Winterfeldt, D., Edwards W. (1986). Decision analysis and behavioral research. New York: Cambridge University Press.

Table 1.
The values of A-D for comparisons 1-8, illustrated with paralysis.





1.
A = Normal B = Death
C = Normal D = Paralysis of legs
2.
A = Normal B = Death
C = Normal D = Paralysis of arms and legs
3.
A = Normal B = Paralysis of arms and legs
C = Normal D = Paralysis of legs
4.
A = Paralysis of legs B = Death
C = Paralysis of legs D = Paralysis of arms and legs
5.
A = Normal B = Death
C = Paralysis of legs D = Death
6.
A = Normal B = Death
C = Paralysis of arms and legs D = Death
7.
A = Normal B = Paralysis of arms and legs
C = Paralysis of legs D = Paralysis of arms and legs
8.
A = Paralysis of legs B = Death
C = Paralysis of arms and legs D = Death



Table 2.
Mean judged ratios of utility differences (in percent), percentage of equality responses, and judged ratios excluding equality responses, Experiment 1.





Comparison Mean ratio Equality responses Mean, excluding equality
Diff. PTO Diff. PTO Diff. PTO
1. N-D vs. N-L 47 49 2 12 46 42
2. N-D vs. N-A&L 58 56 2 15 57 48
3. N-A&L vs. N-L 56 63 3 13 55 57
4. L-D vs. L-A&L 53 53 2 11 52 47
5. N-D vs. L-D 61 63 10 20 56 53
6. N-D vs. A&L-D 54 59 8 18 50 48
7. N-A&L vs. L-A&L 52 56 3 10 50 51
8. L-D vs. A&L-D 65 63 17 16 58 57

Table 3.
Superadditivity, the sum of two ratios that should add to 1, minus 1, for Experiment 1. Equality responses are excluded for numbers in parentheses.



Items in comparison Difference PTO
Death but not normal health .18 (.09) .16 (.04)
Normal health but not death .08 (.06) .19 (.08)
Both normal health and death .10 (.06) .14 (-.02)

Table 4.
Mean judged ratios of utility differences, percentage of equality responses, and percent agreement with equality and dominance questions, Experiment 2.





Comparison Mean ratio Equality responses Equality agree Dominance agree
Diff. PTO Diff. PTO Diff. PTODiff. PTO
1. N-D vs. N-L 51 58 4 29 13 19 53 59
2. N-D vs. N-A&L 61 60 5 27 21 30 42 47
3. N-A&L vs. N-L 60 67 12 26 33 40 21 27
4. L-D vs. L-A&L 54 57 8 21 20 27 45 49
5. N-D vs. L-D 72 76 29 47 52 69 24 16
6. N-D vs. A&L-D 64 66 24 36 43 57 24 22
7. N-A&L vs. L-A&L 61 67 17 32 31 52 32 24
8. L-D vs. A&L-D 71 74 32 36 56 61 18 18


Footnotes:

1This research was supported by N.S.F. grant SBR95-20288, and by a grant from the Penn Cancer Center. Peter Ubel's work was supported by the Department of Veterans Affairs through a Career Development Award in health services research and by the Robert Wood Johnson Foundation's Generalist Physician Faculty Scholar Program. We thank Gretchen Chapman, Michael DeKay, Leslie Lenert and the reviewers for comments on a draft. Send correspondence to Jonathan Baron, Department of Psychology, University of Pennsylvania, 3815 Walnut St., Philadelphia, PA 19104-6196, or (e-mail) baron@psych.upenn.edu.


File translated from TEX by TTH, version 3.01.
On 24 Jun 2002, 10:22.