Baron, J. (2000). Measuring value tradeoffs: problems and some solutions. In E. U. Weber, J. Baron, & G. Loomes (Eds.) Conflict and tradeoffs in decision making: Essays in honor of Jane Beattie. New York: Cambridge University Press.
The measurement of value tradeoffs is central to applied decision analysis. It was also a major concern of Jane Beattie from the time of her thesis to her work with Graham Loomes and others, described in the last chapter. In part of her thesis, Jane examined the use of holistic rating judgments of two-attribute stimuli, as a way of measuring the tradeoff between the two attributes (Beattie & Baron, 1991).
One possible effect of difficulty is to make judgments of tradeoffs more labile, more influenced by extraneous factors. One kind of tradeoff judgment is to make holistic desirability ratings of stimuli in a set of stimuli that vary in at least two dimensions, e.g., cost of a purchase and travel time in order to buy it. Tradeoffs between two dimensions can be assessed by asking how much of one dimension must be given up in order to compensate for a change in the other dimension, with respect to the effect of these changes on the rating. This measure of tradeoffs should reflect the effect of these changes on the goals that motivate the judgments, e.g., the willingness to sacrifice time to save money. It should not be affected by the range of values on either dimension.
We found, in general, that tradeoffs were, in fact, unaffected by the range of variation, provided that the range conveyed no relevant information about utility. If this result is generally true, then holistic judgments would be a good way to measure tradeoffs, for practical purposes. Mellers and Cooke (1994), however, found range effects in several similar tasks.
The present research finds an inconsistent pattern of sensitivity to ranges themselves. However, it also has found substantial effects of magnitude of the values, which can be independent of the range of variation within a group of trials. For example, the amount of money that must be saved to justify spending an extra hour is greater when the purchase price is higher, as if the utility of money were judged as a proportion of the price rather than an absolute amount. Some range effects may result from magnitude effects, but we still do not understand the conditions that produce range effects when magnitude is held constant.
The measurement of value tradeoffs is central to applied decision analysis. How much money is a life worth? A year of health? An hour's relief from pain? In choosing a cancer treatment, how should we trade off the symptoms of the treatment against the probability of a cure? In buying a car, how should we trade off the safety of the car for the driver against its effects on air pollution? If we could answer these questions on the average, then we could design policies designed to maximize utility. For example, a health-care provider could provide all life-saving treatments up the the point at which the average cost per year of life is more than its customers or citizens, on the average, think should be paid. The same can be said for other public expenditures, from highway safety to protection of wilderness.
A problem with this approach is that responses are often internally inconsistent (Baron, 1997a). Some of the inconsistency is specific to the methods used. For example, the use of hypothetical gambles seems to be distorted by the certainty effect, and, more generally, by the fact that probabilities greater than zero and less than one seem to be treated as more similar than they should be.
Other sources of inconsistency are more ubiquitous. Primarily, people are insensitive to quantity when they compare two attributes. People find it surprisingly easy to say that health is more important than money, without paying much attention to the amount of health or money in question. For example, Jones-Lee, Loomes, and Philips (1995) asked respondents to evaluate hypothetical automobile safety devices that would reduce the risk of road injuries. The respondents indicated their willingness to pay (WTP) for the devices. WTP judgments were on the average only 20% higher for a risk reduction of 12 in 100,000 than for a reduction of 4 in 100,000. Such results imply that the rate of substitution between money and the good, the dollars per unit, depends strongly on the amount of the good. If a risk reduction of 12 is worth $120 and a risk reduction of 4 is worth $100, then the dollars per unit of risk reduction are 10 and 25, respectively. If we extrapolate downward linearly, a risk reduction of 0 would be worth $90. Or we might think that the scale is logarithmic, so 1/3 of the risk reduction would be worth 5/6 of the price. So a risk reduction of 4 ·1/3·1/3, or .44, would be worth $100 ·5/6 ·5/6 or $69.44, or about $156 per unit of risk reduction. The dollars per unit could increase without limit. We cannot communicate the size of the error with a confidence interval - even on a logarithmic scale - because the confidence interval is potentially unbounded. This makes it difficult to generalize results to different amounts of money or risk, a generalization that is nearly always required. Even when such generalization is not required, such extreme insensitivity over small ranges raises questions about validity of any single estimate.
The problem is not limited to WTP. It happens when respondents are asked to assign relative weights directly to non-monetary attributes. Typically, the attributes are given with explicit ranges, such as ''the difference between paying $10 and paying $20,'' and ''the difference between a risk of 1 in 10,000 and a risk of 2 in 10,000.'' The weights are typically undersensitive to the range of the attributes. If risk is judged to be twice as important as cost, this judgment is relatively unaffected when risk reduction is doubled (Weber & Borcherding, 1993). Keeney (1992, p. 147) calls this kind of undersensitivity to range ``the most common critical mistake.''
A third type of judgment suffers from the same problem, the judgment of the relative utility of two intervals. In health contexts, respondents are often asked to evaluate some condition, such as blindness in one eye, on a scale anchored at normal health and at death. Implicitly, they are asked to compare two intervals: normal - blind-in-one-eye, and normal - death. What happens when we change the standard, the second interval? Normatively, the judgment should change in proportion. For example, keeping normal at one end of each dimension, the utility of blind-in-one-eye relative to death should be the product of two other proportions: the utility of blind-in-one-eye relative to blindness (in both eyes), and the utility of blindness relative to death. In fact, people do not adjust sufficiently for changes in the standard (Ubel et al., 1996), just as they do not adjust sufficiently for changes in the magnitude of other dimensions involved in other judgments of tradeoffs. I shall call this phenomenon ``ratio inconsistency,'' since it is based on a product of ratios (following Baron et al., 1999). I shall also view these various forms of insensitivity as manifestations of the same problem. In principle, all known manifestations of insensitivity could be understood as tendencies to give the same answer regardless of the question. It is an open question whether the various forms of inconsistency can be explained in the same ways or not.
Undersensitivity to range can be reduced. Fischer (1995) found complete undersensitivity to range when respondents were asked simply to assign weights to ranges (e.g., to the difference between a starting salary of $25,000 and $35,000 and between 5 and 25 vacation days - or between 10 and 20 vacation days - for a job). When the range of vacation days doubled, the judged importance of the full range of days (10 vs. 20) relative to the range of salaries ($10,000) did not increase. Thus, respondents showed inconsistent rates of substitution depending on the range considered. Respondents were more sensitive to the range, with their weights coming closer to the required doubling with a doubling of the range, when they used either direct tradeoffs or swing weights. In a direct tradeoff, the respondent changed one value of the more important dimension so that the two dimensions were equal in importance, e.g., by lowering the top salary of the salary dimension. (Weights must then be inferred by either measuring or assuming a utility function on each dimension.) In the swing weight method, respondents judged the ratio between the less important and more important ranges, e.g., ``the difference between 5 and 25 vacation days is 1/5 of the difference between $25,000 and $35,000.''
In the direct tradeoff method, the range is given for one dimension only. The task is thus analogous to a free-response CV judgment, so we might still expect - and Fischer still found - some insensitivity. Baron and Greene (1996) found that this insensitivity could be reduced still further by giving no specific ranges for either dimension. Respondents were asked to produce two intervals, one on one dimension and one on the other, that were equally large in utility. For example, instead of asking ``How much would you be willing to pay in increased taxed per year to prevent a 10% reduction in acquisition of land for national parks?'', the two-interval condition asked subjects to give an amount of taxes and a percentage reduction that they would find equivalent. Of course, one end of each interval was zero.
Another way to measure tradeoffs is to ask for ratings of stimuli that vary in two or more dimensions. For example, the stimuli could be policies that differ in cost and amount of risk reduced. If the respondent produces enough of these judgments, we could fit simple models to her responses and infer how much of a change in one dimension is needed to make up for a change in another dimension, so that both changes together would yield the same rating. The rating response need not be a linear function of overall utility (but we could assume that it was for a first approximation). A great variety of methods use this general approach. The two most common terms are functional measurement (e.g., Anderson & Zalaski, 1988) and conjoint analysis (Green & Wind, 1973; Green & Srinivasan, 1990; Louviere, 1988).
In such a method, the numbers given to the respondent on each dimension represent attributes that the respondent values, such as minutes or dollars. These value of these attributes does not, we assume, depend on what other things are available. Thus the tradeoff between a given change on one dimension and a given change on the other should be unaffected by the range of either dimension within the experimental session. If a change from 50 to 100 minutes is worth a change from $20 to $40, then this should be true regardless of whether the dollar range is from $20 to $40 or from $2 to $400. The need for invariance in the substitution of time and money arises from the basic idea of utility itself, which is that it is about goal achievement (Baron, 1994). The extent to which goals are achieved depends on what happens, not what options were considered.
Two exceptions should be noted, however. First, sometimes the options considered affect the outcome, through their effects on emotions. Winning $80 may seem better if that is the first prize than if $160 is the first prize, because of the disappointment of not winning the first prize. Second, in some cases the meaning of a description in terms of goal achievement will depend on the range. For example the raw score on an examination can have a different effect on goal achievement as the range of scores is varied, if the examination is graded on a curve. Even when this is not true, respondents who know little about a quantitative variable may think of it this way, because they cannot evaluate the significance of the numbers (e.g., ``total harmonic distortion'' when buying an audio system - see Hsee, 1996).
Beattie and Baron (1991), using such a holistic rating task, found no effects of relative ranges on rates of substitution with several pairs of dimensions, but we found range effects with some dimensions, particularly those for which the numerical representation was not clearly connected to fundamental objectives, e.g., numerical grades on an exam. (The meaning of exam grades depends on the variance.) This gave us hope that holistic ratings could provide consistent and meaningful judgments of tradeoffs. Lynch et al. (1991) also found mostly no range effects for hypothetical car purchases, except in one study with novice consumers. (They used correlations rather than rates of substitution, however, so it is difficult to tell how much their results were due to changes in variance.) Mellers and Cooke (1994), however, found range effects in tasks where the relation of the numbers to fundamental objectives was clear, e.g., distance to campus of apartments.
The experiments I report here make me more pessimistic about holistic ratings. Although I cannot fully explain the discrepant results, I have been able to show that holistic ratings are generally subject to another effect that is potentially just as serious, specifically, a magnitude effect. People judge the utility of a change or a differece as a proportion of the overall magnitude of potential, even when the change alone is more closely related to the goal (Baron, 1997a). The result is that judgments are dependent on the maximum magnitude on each attribute scale. The classic example is the jacket-calculator problem of Tversky and Kahneman (1981; replicated under some conditions by Darke & Freedman, 1993).
Imagine that you are about to purchase a jacket for $125, and a calculator for $15. The calculator salesman informs you that the calculator you wish to buy is on sale for $10 at the other branch of the store, located 20 minutes' drive away. Would you make the trip to the other store? (Tversky & Kahneman, 1981, p. 457)
Most subjects asked were willing to make the trip to save the $5. Very few subjects were willing to make the trip to save $5 on the jacket, though, in an otherwise identical problem. In both cases, the ``real'' question is whether you would be willing to drive 20 minutes for $5. People judge the utility of saving $5 as a proportion of the total amount, rather than in terms of its effects on other goals, i.e., its opportunity cost. Baron (1997b) found a similar effect: subjects were less willing to pay for government medical insurance for diseases when the number of people who could not be cured was higher, holding constant the number who could be cured. When many people were not cured, the effect of curing a few seemed like a ``drop in the bucket'' and was thus undervalued.
Typically, magnitude and range effects are confounded. Magnitude is defined as the difference between the maximum and zero, and range is defined as the difference between the maximum and minimum. Usually experimenters who vary the range manipulate the maximum as well. Indeed, both Beattie and Baron (1991) and Mellers and Cooke (1993) had higher magnitudes of the maximum on each attribute whenever the range was higher. Evidently, magnitude effects do not always occur. The fact that they occur, however, makes the measure untrustworthy. The point is that they would occur if magnitude were varied enough, so the tradeoff that subjects make is specific to the magnitudes of the dimensions they are given.
Baron (1997b) suggests that the magnitude effect is part of a more basic confusion between similar (and often correlated) quantitative measures. Just as young children answer questions about number as if they were about length (a correlated attribute), and vice versa, so do adults answer questions about differences as if they were about ratios, and vice versa. Differences and ratios are correlated. Thus, in discussions of drug effects on risk, people talk about relative risk (e.g., ratio of breast cancer cases with the drug to cases without it) rather than change in risk (difference between cancer probability with and cancer probability without). It is the latter that is more relevant to decision making.
Both pure range effects and magnitude effects can result from the use of a proportionality heuristic. Someone who uses this heuristic evaluates a change on one attribute as a proportion of something else, even when it should be evaluated on its own. This is a reasonable heuristic to use when we know nothing about the meaning of an attribute. For example, when we evaluate the difference between 30 points and 40 points on a midterm exam, the meaning of this difference may well depend on whether the range of scores was 20-50 or 0-60.
In the rest of this chapter, I describe two sets of experiments. The first set shows the existence of magnitude effects in holistic ratings and describes some of the limits on their occurrence. The results are damaging to the idea of using holistic ratings to measure tradeoffs.
In the last two experiments, I explore a different approach, picking up where Loomes left off in his chapter. Perhaps we can measure value tradeoffs by working with the respondents, facing them with the inconsistencies in their judgments and asking them to resolve these inconsistencies. Decision analysts claim that consistency checks usually do not violate the respondent's best judgment, for example: ``... if the consistency checks produce discrepancies with the previous preferences indicated by the decision maker, these discrepancies must be called to his attention and parts of the assessment procedure should be repeated to acquire consistent preferences. ... Of course, if the respondent has strong, crisp, unalterable views on all questions and if these are inconsistent, then we would be in a mess, wouldn't we? In practice, however, the respondent usually feels fuzzier about some of his answers than others, and it is this degree of fuzziness that usually makes a world of difference. For it then becomes usually possible to generate a final coherent set of responses that does not violently contradict any strongly held feelings'' (Keeney & Raiffa, 1993, p. 271). Such checks can even improve the perceived validity of numerical judgments (e.g., Keeney & Raiffa, 1993, p. 200).
Baron et al. (1999) found evidence supporting these claims in studies of elicitation of health utilities. Consistency checks for the kind of ratio insensitivity described above led to no serious objections from the subjects. Moreover, different kinds of utility measures were more likely to agree when each measure was adjusted, by the subject, to make it consistent.
Experiment 1 built on the jacket-calculator problem. Subjects
did three tasks:
Rating: Subjects rated purchases that differed in price and time, for attractiveness.
WTP: Subjects expressed their willingness to pay (WTP) money to save time, or time to save money.
Difference judgments: Subjects compared a time interval (e.g., ``the difference between 30 minutes and 1 hour'') and a price interval (e.g., ``the difference between $90 and $100''). They indicated which mattered more to them, and the relative sizes of the intervals in terms of what mattered.
Magnitude and range varied somewhat independently. Magnitude was manipulated by multiplying price by 4. Range was varied in the rating task by changing the first two items in each group of 8.
Fifty-three subjects - 38% males, 92% students, ages 17-52 (median 19) - completed a questionnaire on the World Wide Web. Subjects were solicited through postings on newsgroups and links from various web pages. They were paid $4, and they had to provide an address and social-security number in order to be paid.
The questionnaire had two orders. Order did not affect the results. The questionnaire had four sections, Ratings, WTP, Difference judgments, and Ratings again.
The ratings task began, ``Imagine you are buying a portable compact-disk player and you have settled on a brand that lists for $120. It is available at several stores, which differ in travel time from where you live (round trip), sale price, and terms of the warranty. Rate the following options for attractiveness on a scale from 1 to 9, where 1 means that you are very unlikely to choose this option and 9 means that you are very likely to choose it. Try to use the entire scale. The first two items are the worst and best.'' Items differed in price, travel time, and warranty. The warranty was not analyzed. It was used simply to create variation to allow duplicate presentation of items that were otherwise the same. It was counterbalanced with all other variables. A typical list of items to be rated was:
|Price||Travel time||Warranty||Rating ...|
|$80||30 min.||1 year|
|$90||1.5 hours||1 year|
|$110||1 hour||1 year|
|$90||1 hour||1 year|
|$120||1.5 hours||1 year|
|$110||30 min.||1 year|
|$100||30 min.||1 year|
|$100||1.5 hours||1 year|
Notice that, in this list, the first two items in each group of 8
have a price range of $40 and a time range of 1 hour. In the
contrasting condition, the time range was 2 hours (0 to 2 hours)
and the price range was $20 ($90 to $110). In the
high-magnitude conditions, price was simply multiplied by 3, so
that the range was also multiplied by 3. Two goods, a CD player
and a TV set, appeared in two orders,
given to different subjects. In one order, the conditions were:
CD, low price, high price range (low time range);
TV, high price, high price range;
CD, low price, high time range (low time range);
TV, high price, high time range.
In the other order, the conditions were reversed.
In between the first two and second two rating were the WTP and Difference conditions (always in that order). A typical item in the WTP condition read, ``You plan to buy a $110 CD player at a store that is 1 hour away. What is the most time you would be willing to spend traveling in order to buy it for $100 instead?'' or ``What is the most you would be willing to pay for one that is 30 min. away?'' The subject was instructed to answer in terms of total price or time. For the first order, the WTP conditions were ordered as shown in Table 1, and these were reversed for the second order.
Goods use for WTP conditions in Experiment 1. In the rightmost column are the geometric means of the inferred dollars per hour.
|Good||Price||Time||Price or time||Dollars/hour|
|CD||$90||1.5 hours||30 min.||$15.68|
|CD||$90||1.5 hours||1 hour||$21.63|
|TV||$270||1.5 hours||30 min.||$29.44|
|TV||$270||1.5 hours||1 hour||$38.86|
For the Difference judgment, a typical item was:
Which difference matters more to you?
1. The difference between $90 and $100 for a CD player.
2. The difference between 30 minutes and 1.5 hours travel time.
What percent of the larger difference is the smaller difference, in terms of how much it matters?
(In retrospect, the wording of this item is difficult to understand. In the data analysis, however, subjects were eliminated who showed misunderstanding by responding in the reverse way.)
For the first order, the items are shown in Table 2 (reversed for the second order). Notice that the range manipulation was in both price and time: when the price range is higher, the time range is lower. This makes the range manipulation stronger.
Items used in the Difference task in Experiment 1. The table shows the good, the intervals compared, and the geometric mean implied dollars per hour of the responses.
|CD||$90 - $100||vs.||30 minutes - 1.5 hours||$15.54|
|CD||$90 - $110||vs.||30 minutes - 1 hour||$24.06|
|TV||$270 - $300||vs.||30 minutes - 1.5 hours||$25.83|
|TV||$270 - $330||vs.||30 minutes - 1 hour||$56.47|
The design permitted an inference of the tradeoff between dollars and hours in all condition. For ratings, I calculated the orthogonal contrast for the price and time effects on ratings, and took the ratio. I calculated the geometric mean across subjects and did statistical tests on the logarithms. (It is arbitrary whether to use the ratio or its reciprocal. Using the log means that this choice affects only the sign, not the distance from zero.)
For ratings, the inferred monetary value of time was affected by magnitude (confounded with range) but unaffected by range alone. The (geometric mean) values were (for subjects who had sufficient data in both conditions being compared): $20.32 for high amounts of money vs. $89.33 for low amounts (t47 = 13.71, p = .0000); and $43.60 when the money range was small (and the time range large) vs. $46.13 when the money range was large (and time small).
For WTP, geometric means of inferred dollars per hour are shown in Table 1. T tests on the means of the relevant conditions (e.g., all the high-dollars vs. all the low dollars) showed that time was worth more when the dollar magnitude was higher (t52 = 11.4, p = .0000) and when the subject responded with money rather than time (t44 = 8.14, p = .0000). Subjects also paid more for time when the range of time was small or when the range of money was large, holding magnitude constant (t51 = 4.34, p = .0001). In sum, the WTP measure showed both magnitude and range effects, whereas the rating measure showed only a magnitude effect.
Difference judgments also showed effects of both range (t47 = 3.85, p = .0004) and magnitude (t47 = 5.69, p = .0000), as shown in Table 2. Magnitude was confounded with range. So these effects can be seen as a replication of the finding that matching judgments are insensitive to range (e.g., Fischer, 1995; see Baron, 1997a, for discussion).
To summarize the results, all three measures - holistic ratings, willingness to pay, and difference judgment - were affected by magnitude (confounded with range), but only difference judgments and WTP were affected by range alone. One explanation of these results is that the WTP and difference tasks presented two ends of the range to be compared, and this encouraged subjects to consider these two ends as the relevant reference points. Holistic ratings, by contrast, may have allowed subjects to adopt an implicit zero as the low end of each range.
Whatever the explanation, the fact remains that magnitude effects render these tasks unsatisfactory as measures of tradeoffs.
Experiment 2 manipulated the range and magnitude locally, within each group of four hypothetical purchases, by presenting two items to establish a range and then another two to test the effect of the first two. Range was manipulated by holding constant the top of each dimension and varying the bottom: in one condition, the money ranged from $120 to $80 and the time from 120 to 0 min., and, in the other condition, the money ranged from $120 to $100 and the time from 120 to 60 min. The magnitude manipulation simply added $100 to the price, holding range constant.
Eighty subjects - 25% males, 51% students, ages 16 to 51 (median 23) - completed a questionnaire on the World Wide Web for $5. The questionnaire began:
Purchases: time and money
This is about how people make tradeoffs between time and money when they buy consumer goods. Imagine that all the items refer to some piece of audio or video equipment like a compact-disk player or a TV. You have decided to buy a certain model in the price range indicated on each screen. The issue is whether you are willing to travel some distance in order to save money on the price.
Half the time, you will evaluate one purchase at a time on a 9-point scale (1=very unlikely to buy, 5=indifferent, 9=very likely to buy). The rest of the time, you will compare two purchases, also on a 9-point scale (1=A is much better, 5=equal, 9=B is much better). Some purchases will be repeated several times. This is not to annoy you but to make sure that you pay attention to their existence. When you see these repeated purchases, you don't have to give the same answer you have given before, but you can if you want.
There are 56 screens of questions (2 or 4 questions on a screen), followed by a few questions about you.
Each single-purchase evaluation (evaluation, for short) screen had four purchases, and each purchase-comparison screen had two. The purchases were described in terms of price and time, e.g., ``$100, 60 minutes.'' Table 3 shows the base values used for both evaluation and comparison conditions:
Base conditions for Experiment 2. Each row represents the items presented on one screen. In the comparison condition, the subject compared A and B, and then C and C. In the evaluation condition, the subject evaluated A, B, C, and D. In cases 1-7, the time range is high (0-120) and the dollar range low (100-120). In cases 8-14, the time range is high (60-120) and the dollar range high (80-120).
Each comparison screen presented a comparison of A and B, and of C and D. Each evaluation screens presented A, B, C, and D separately. Notice that, within this basic design, the first 7 purchases have a high range of times (0-120 min.) for purchases A and B, and a low range of prices ($100-$120). The second 7 purchases are the reverse(60-120 min. vs. $80-$120). The last two purchases in each screen are the same for the corresponding items. Thus, effects of the relative ranges of times and prices are determined by examining the responses to purchases C and D. Notice also that the tops of the ranges ($120 and 120 min.) are constant within the items in the basic design.
This basic design was replicated four times, to make the 56 screens. Replications 1 and 2 were comparisons, 3 and 4 were evaluations. (Because of a programming error, evaluation data were lost for 28 subjects, leaving 52.) Replications 2 and 4 extended the magnitude of prices, and the range, by adding $100 to each price. Comparisons of replications 2 with 1, and 4 with 3, then, test for a magnitude effect.
The order of the 56 screens was randomized separately for each subject.
As a measure of the tradeoff, I computed the relative preference for the option with lower price (and higher time). If people evaluate price and time with respect to their ranges, this relative preference would be greater when the range of prices is small and the range of times is high. This result occurred in the evaluations (t51 = 4.67, p = .0000) but not in the comparisons (t = 0.94). For the evaluation items, when price range was small, subjects favored the low-priced item by a mean rating difference of .30, but when the range was high, they favored the low-time item by .34.
A simple explanation of this result is that, in the evaluation condition, subjects attend to all four items presented on each screen. When one of the items contains a very low price, they give it a high rating, but then they feel obliged to give a lower rating to the item that does not have such a low price. In sum, for the evaluation items, the first two items set up a range of responses. In the comparison items, on the other hand, subjects simply compare the two items they are given. They do not feel bound by their responses to other items on the same screen.
Subjects showed no significant magnitude effect in either condition. Although this result seems optimistic, the presence of a range effect undercuts the optimism for using this task to measure value tradeoffs. The magnitude effect may depend on encouraging the subject to use 0 as one of the reference points. When both ends of the dimension are explicitly stated (e.g., 120 minutes and 80 minutes) - rather than leaving it implicit that one end is 0 - range effects may take over.
Experiments 1 and 2 show either range effects or magnitude effects in holistic rating. Despite the promising results of Beattie and Baron (1991), the use of holistic rating tasks does not seem to provide a reliable means of consistent measures of tradeoffs. The measures it provides seem to depend on what subjects use as the top and bottom reference point of each scale.
Another approach to eliciting consistent tradeoffs is to face the respondent with her inconsistencies and ask her to resolve them. That is difficult to do in holistic rating tasks, because the respondent would have to deal with a great many responses as once. When the respondent makes direct judgments of relative magnitude, however, resolution of inconsistency might be easier.
Experiment 3 is an example of one method that might be used to help in the resolution of inconsistency. It involves the comparison of utility intervals. Examples of possible intervals include ``the difference between 60 and 120 minutes,'' ``the difference between $90 and $120,'' and ``the difference between normal health and complete hair loss.'' The last sort of difference is of interest for measurement of health utilities. For example, if we wanted to determine whether the benefit of chemotherapy for cancer is worth the cost, part of the cost might be the side effects of the therapy. A standard way to measure utilities in health is to compare everything to the interval between ``normal health'' and ``death.'' Policy makers often assume that this interval has the same utility for everyone.
Experiment 3 concerns health intervals of this sort rather than those involving time and money. The subject judges the utility of interval A as a proportion of B, B as a proportion of C, and A as a proportion of C. The AC proportion should be the product of the AB proportion and the BC proportion. Typically, the AC proportion is too high (as I noted earlier), which is a kind of insensitivity to the standard of comparison.
In the method used here, the subject is forced to resolve the inconsistency but is not told how to resolve it. The subject answers three questions on a computer screen. Then, if they are inconsistent, buttons appear on the screen next to each judgment. Each button says ``Increase'' or ``Decrease'' according to whether the judgment is too low or too high, respectively, relative to the other two judgments. Each button raises or lowers its associated response by one unit. The subject can make the responses consistent by clicking any or all of the buttons.
This experiment used three different methods for comparing intervals: time tradeoff (TTO), standard gamble (SG), and direct rating (in two versions, DT and DP, to be described). In the TTO method, the subject made a judgment of how many weeks with one health condition was equivalent to 100 weeks with a less serious health condition. The ratio of the answer to 100 is taken as a measure of the utility of the less serious health condition relative to the more serious one, on the assumption that time and utility multiply to give a total utility. In the SG method, the subject gives a probability of the more serious health condition, and this is taken as a measure of the utility of the less serious condition, on the assumption that the expected utility is what matters. The direct rating method asks simply for a comparison of the intervals.
The intervals to be compared were constructed by manipulating either the health condition or its probability or duration. Each interval was bounded by normal health at one end. Two health conditions were used for the other end of each set of intervals, one more severe and one less severe. For TTO, and DP (where P stands for probability), the third condition was a 50% chance of the less severe health condition. For SG and DT (T for time), the third condition was 50 weeks of the less severe condition, instead of 100 weeks.
The idea of manipulating a health condition by changing its probability comes from Bruner (1999). Bruner was interested in measuring the utilities of the major side effect of prostate cancer treatments, sexual impotence and urinary incontinence. She used the time-tradeoff method. She asked subjects, in effect, how much of their life expectancy they would sacrifice rather than have a treatment that would give them an 80% chance of impotence, or a 40% chance, for example. Over a wide range of probabilities, the answer to this question was insensitive to probability. Subjects' willingness to sacrifice their life expectancy did not depend on whether the probability of impotence was 40% or 80% (although they were a little less willing when it went up to 99%). This sort of insensitivity to probability makes the measure useless as a way of eliciting judgments of the utility of impotence.
The critical question is the one that compares the discounted less severe health condition (50% or 50 days) with the more severe condition. For this to be a good utility measure, the answer should be half of that to the question that compares the non-discounted less severe condition to the more severe condition. Will the adjustment process lead to this result?
Sixty-three subjects completed a questionnaire on the World Wide Web, for $3. The subjects were 60% female, 51% students, and had a median age of 24 (range: 13 to 45). Three additional subjects were not used because they gave the same initial answer to every group of items.
The introduction to the study, called ``Health judgments,'' began:
This study is about different ways of eliciting numerical judgments of health quality. If we could measure the loss in health quality from the side effects of various cancer treatments, for example, we could help patients and policy makers decide whether the benefits of treatment are worth the costs in loss of quality.
The side effects are always written in CAPITAL LETTERS. Here are the effects:
HAIR LOSS (complete)
NAUSEA (food consumption half normal)
DIARRHEA (three times per day)
FATIGUE (enough to be unable to work)
There are also combinations of effects.
In some question, you make two options equal by saying how much time with one side effect is equivalent to a longer time with some other side effect that isn't so bad. In some cases, the side effects are not certain.
Make sure to try the practice items before going on.
In one kind of question, you give a time. You must answer with a number from 0 to 20. Feel free to use decimals. Here is an example (using deafness):
A. 100 weeks with deafness.
B. 50 weeks with blindness and deafness.
To answer this, you must pick a number for B so that the two options are equal. Try picking different numbers of weeks for B, going up and down, until you feel A and B are equal. Do this now by clicking on one of these two buttons:
The buttons were labeled ``A is worse now'' and ``B is worse now.'' Clicking one button adjusted the number of days in the box by smaller amounts. The next practice item use probability instead of time to equate two options. Subjects were also told about the rating items, and they were told the number of items. Finally, they were told:
After you enter your answers, the buttons will suggest changes in your numbers. They will say ``Increase'' or ``Decrease.'' Please choose the button that is most consistent with your true judgment of the conditions. Keep clicking one button or another until you are told you can go on. I am interested in how you choose to adjust your responses when you are forced to adjust them. ...
The items were worded as follows, with S1 being the less severe of two symptoms and S2 the more severe. S1 was always one of the four symptoms listed. S2 was either two of the symptoms, including S1 (e.g., NAUSEA AND FATIGUE when S1 was NAUSEA) or all four. (Each symptom occurred equally often as a member of the pair.)
Fill in each blank so that the two options are equal.
A. 50% chance of S1 for 100 weeks
B. S1 for ___ weeks
A. S1 for 100 weeks
B. S2 for ___ weeks
A. 50% chance of S1 for 100 weeks
B. S2 for ___ weeks
Fill in each blank so that the two options are equal.
A. S1 for 50 weeks
B. ___% chance of S1 for 100 weeks
A. S1 for 100 weeks
B. ___% chance of S2 for 100 weeks
A. S1 for 50 weeks
B. ___% chance of S2 for 100 weeks
Direct judgment (time)
If the difference between normal health and 100 weeks of S1 is 100, how large is the difference between normal health and 50 weeks of S1?
If the difference between normal health and 100 weeks of S2 is 100, how large is the difference between normal health and 100 weeks of S1?
If the difference between normal health and 100 weeks of S2 is 100, how large is the difference between normal health and 50 weeks of S1? Direct judgment (probability)
If the difference between normal health and S1 is 100, how large is the difference between normal health and a 50% chance of S1? All the symptoms in this example are for 100 weeks.
If the difference between normal health and S2 is 100, how large is the difference between normal health and a 100% chance of S1?
If the difference between normal health and S2 is 100, how large is the difference between normal health and a 50% chance of S1?
To the right of each response box was a button, blank at the outset. After the responses were filled in, the program first checked whether the third was less than each of the others and required a change of answers if it was not. Then the program checked to see whether they were consistent. Consistency was defined in terms of the relation of the three responses: after all the responses were divided by 100, the third response had to be the product of the other two, to the nearest unit. If the responses were consistent, the subject could go on to the next screen. If the third response was too high, the word ``Increase'' appeared on the first two buttons and ``Decrease'' appeared on the third button. The subject clicked any of the three buttons until told to go on. Each button adjusted the response by 1 unit, up for increases and down for decrease. (The subject could also type in the response.) If the third response was too low, ``Increase'' and ``Decrease'' were switched. The subject had to make the responses consistent before going to the next screen.
Subjects were initially inconsistent and insensitive to probability and time, as expected. The requirement for them to become consistent made them more sensitive to probability and time.
The measure of inconsistency for each screen was the log (base 10) of the ratio of the third answer to the product of the first two answers (after dividing all answers by 100). This would be 0 if responses were consistent. The mean inconsistency over all four elicitation methods was .0245 (t63 = 3.45, p = .0010), which implies that the third answer was about 6% too high, averaged in this way. The four methods differed in the size of this effect (F3,189 = 5.57, p = .0011): .0422 for time tradeoff, .0335 for gambles, .0098 for direct-judgment (probability), and .0123 for direct-time.
The main measure of insensitivity to probability and time was the ratio of the third answer to the second, minus .5. (The normative standard was .5.) This measure was positive, as expected if subjects adjust too little for the change in probability and time between the second and third answers. The mean was .0260 (t63 = 2.96, p = .0043). The four methods differed in the size of this effect (F3,189 = 6.49, p = .0001): .0145 for time tradeoff, .0475 for gambles, .0284 for direct-judgment (probability), and .0135 for direct-time. (Note that these difference cannot be understood as involving effects of time vs. probability.)
The response to the first question did not differ significantly from .5 overall, although the four methods differed (F3,189 = 5.33, p = .0015), with means of .4714 for time tradeoff, .5083 for standard gamble, .5042 for direct-time, and .5106 for direct-probability.
The main result of interest concerned the ratio of the second and third questions. Should have been .5, but its mean, over all methods, was, as noted, was too high by .0260. After the adjustment for consistency, the mean was .0050, not significantly different from 0. The change was significant (t63 = 3.65, p = .0005). However, this result could arise artifactually if the adjustment button on the third answer said Decrease more often than it said Increase, assuming that the direction of change did not affect the magnitude of change. Accordingly, I computed the measure for the Increase and Decrease trials separately. The average change for the Decrease trials was .0842 (in a downward direction), and the average for the Increase trials was .0479 (in an upward direction). For the 52 subjects who had data for both cases, the mean difference between these was .0382. That is, the downward change was greater than the upward change so that, on the whole, subjects became more consistent (t51 = 3.36, p = .0015). Thus, the benefit of the adjustment is not simply the result of forcing subjects to move in the direction required. When they were forced to move in this direction, they moved more than when they were forced to move in the opposite direction. They also moved more often in the former direction (66% of the possible cases vs. 56%; t51 = 1.76, p = .0422, one tailed).
Experiment 4 illustrates another approach to consistency adjustment. Subjects are given an estimate of what their responses would be if they were consistent. Unlike Experiment 3, the subjects do not have to adjust their responses. They are given the adjusted responses as a suggestion only. At issue is whether they will accept the suggestion and become more consistent.
Experiment 4 used three different health conditions, rather than using two health conditions one of which was discounted. It used only two methods, time tradeoff and direct rating.
Fifty-eight subjects completed a questionnaire on the World Wide Web, for $5. The subjects were 65% female, 38% students, and had a median age of 27 (range: 12 to 69).
The introduction to the study, called ``Health judgments,'' began:
This study is about different ways of eliciting numerical judgments of health quality. If we could measure the loss in health quality from various conditions, we could measure the benefits of treating or preventing these conditions. This would allow more efficient allocation of resources.
The conditions we consider are:
NEARSIGHTEDNESS (need glasses)
BLINDNESS IN ONE EYE
PARTIAL DEAFNESS (hearing aid restores normal hearing)
DEAFNESS IN ONE EAR (complete, hearing aid doesn't help)
LOSS OF WALKING IN ONE LEG
LOSS OF WALKING IN BOTH LEGS
PARALYSIS OF ALL LIMBS
SPLINT ON INDEX FINGER (dominant hand)
SPLINT ON HAND (dominant side)
SPLINT ON ARM (dominant side) CAST ON FOOT
CAST ON LEG
CAST ON BOTH LEGS
LOSS OF EYEBROWS
LOSS OF HAIR ON FACE AND HEAD
LOSS OF ALL HAIR (including face and head)
Notice that these conditions are in six groups of three. Within each group, the conditions are ordered in severity. The subject had to do a time-tradeoff practice items before beginning, as in the last experiment. They were also told about the rating items. They were told the number of questions and encouraged to used decimals in their answers.
The first 36 trials contained 18 time-tradeoff judgments and 18 direct judgment items. Each time-tradeoff item was introduced with ``How many days makes these two outcomes equal?'' The direct judgment items were worded the same as the practice item.
The subject made each type of judgment three times for each of the six groups of conditions. Within each group of conditions, the subject compared the first and second, first and third, and second and third. S1 and S2 thus stand for the conditions being judged, e.g., ``BLINDNESS IN ONE EYE'' and ``TOTAL BLINDNESS.'' By using all three comparisons, I could test for internal consistency. In particular, the judgment of the extremes (first as a proportion of third) should be the product of the other two judgments (as proportions). These 36 trials were presented in a random order, each on its own screen, which disappeared when the subject responded.
After these 36 trials, the subject saw 24 screens with three judgments to a screen, again in a random order. These consisted of two types of judgments, each in a trained and untrained version, for each of the six groups of conditions.
In both trained and untrained conditions, each screen began: ''Please respond to all three items again in the boxes provided. You do not need to give the same response you gave, and you do not need to make your answers consistent. Try to make your answers reflect your true judgment.''
In the trained direct-judgment condition, the next paragraph read, ``The second column shows one way to make the ratios of your answers agree. The second row percentage is the product of the percentage in the first and third rows. In the trained time-tradeoff condition, the last sentence read, ``The second row ratio of days (to 100) is the product of the ratios in the first and third rows. This assumes that all days count equally.''
The subject then saw a table with the items on the left and either two or three columns of numbers, for example:
was 5% as bad as
DEAFNESS IN ONE EAR
was 5% as bad as
|DEAFNESS IN ONE EAR|
was 5% as bad as
For the time tradeoff, the upper left entry would have read, ``100 days of DEAFNESS IN ONE EAR was as bad as 5 days of PARTIAL DEAFNESS,'' and the second column would contain days instead of percent. For the untrained condition, the second column was omitted. Let us refer to the three comparisons as AB, BC, and AC, for the three rows, respectively. AB and BC are adjacent, and AC is extreme.
The consistent values in the second column were computed so as to preserve the ratio of the two adjacent comparisons and otherwise make the responses consistent. The two adjacent comparisons (AB and BC) were multiplied by a correction factor, and the extreme comparison (AC) was divided by the same factor. (The correction factor was not constrained to be more or less than one.) In particular, the correction factor was [AC/(AB·BC)]1/3. The correct values were rounded to the nearest unit, but the subject was encouraged to use decimals.
For each condition group, I computed a measure of inconsistency: log10[AC/(AB·BC)]. (The direction of the ratio is arbitrary. It could be inverted. The log insures that inversion would affect only the sign of the inconsistency, not its magnitude.) I averaged this measure over the six sets of conditions, for each of the two methods, in the initial, trained, and untrained conditions. I also computed an absolute-value inconsistency measure for each condition group, and averaged it in the same way. Table 4 shows the means of these two measures for the two methods.
Mean inconsistency measures for the two methods.
|Method||Inconsistency (signed)||Absolute inconsistency|
|Direct judgment, initial||.25||.35|
|Direct judgment, untrained||.08||.21|
|Direct judgment, trained||.04||.14|
The trained items were more consistent than the untrained, which, in turn, were more consistent than the initial items, by both signed and absolute measures. I tested this with four analyses of variance, one for initial vs. untrained and one for untrained vs. trained, for each of the two inconsistency measures. It is superfluous to compare initial and trained. But the comparison of initial and untrained tests the (confounded) effects of doing the items together in a group and doing them for the second time. The initial vs. untrained effect was significant for signed measures (F1,57 = 68.2, p = .0000) and for unsigned measures (F1,57 = 62.0, p = .0000). The effect of time-tradeoff vs. direct judgment was significant only for the absolute measures (F1,57 = 17.9, p = .0001): time-tradeoff was less consistent. In neither case was the interaction between method and initial vs. untrained significant. The improvement that resulted from presenting the items together and again was present for both methods.
Inconsistency was smaller in trained than untrained for both signed and unsigned measures (F1,57 = 8.14, p = .0060, and F1,57 = 51.6, p = .0000, respectively). Again, the effect of method was significant only for the absolute measure (F1,57 = 13.1, p = .0006). The interaction between training and method was not significant. Training improves consistency in both methods.
The first two experiments add to existing demonstrations that holistic ratings are sometimes subject to extraneous influences in the form of range effects or magnitude effects. We can account for these effects in general by assuming that subjects adopt two reference points, top and bottom, for each dimension and evaluate the position of an item relative to these reference points, at least some of the time. That is, they think of variation along the dimension as a proportion of the distance from top to bottom rather than as an absolute change along a dimension whose units have value in their own right. This is sometimes a reasonable method of evaluation, e.g., in evaluating examination grades. But it is used even when the subject can evaluate the units in their own right.
What is adopted as the top and bottom is somewhat variable and dependent on details of the task. Experiment 1 found magnitude effects in holistic ratings, WTP, and difference judgment, in the tradeoff of time and money. It found range effects in WTP and difference judgment but not in holistic ratings. As noted, the rating task may have differed from the others in that subjects might have found it easier to adopt zero as the implicit bottom of the range.
Experiment 2 used four items at a time, with the first two items setting the range. It found range effects when subjects evaluated items one at a time, but not when they compared one item to the other within a single question. A possible explanation of this result is that the comparison format provides its own context, so subjects ignore the context in previous questions. If so, the direct comparison may be helpful in overcoming range effects. This conclusion would be similar to that of Fischer (1995). Note, however, that the direct comparison is very much like the direct judgment tasks used in Experiments 3 and 4.
In those experiments, subjects compared two intervals rather than making a judgment of a single two-attribute stimulus. As found in previous studies (Baron et al., 1999; Ubel et al., 1996), all of these measures showed ratio inconsistency: subjects did not give small enough numbers when they compared a small interval to a much larger one (or, conversely, they did not give large enough numbers in their other responses). When this inconsistency was called to their attention, responses became more consistent. This is the recommended approach of applied decision analysis, and so far it seems to work, at least in the sense that it yields usable, consistent answers.
Holistic judgments have other problems. When respondents are asked to rate multiattribute stimuli with several attributes, they seem to attend only to a couple of attributes that they find particularly important, thus ignoring the less important attributes too much (Hoffman, 1960, Figs. 3-7; von Winterfeldt & Edwards, 1986, p. 365). However, this is likely to be a less serious problem when respondents rate two attributes at a time. Still, the existence of range and magnitude effects seems difficult to avoid. The only way to avoid it seems to be to present explicit intervals for comparison.
This claim is consistent with the finding of Birnbaum and his colleagues (1978; Birnbaum & Sutton, 1992) that subjects asked to judge the ratio of two stimuli respond (with a nonlinear response function) to the difference between the stimuli rather than to the ratio of their distances from zero (no stimulation, in a sensory task). However, when subjects are asked for ratios of differences - e.g., what is the ratio between the utility (or loudness, etc.) difference between A and B and the difference between C and D? - they base their responses on the ratio of the differences, and not the difference of the differences. It would seem that the two-stimulus ratio task does involve four stimuli, because a reference point is implied, e.g., zero loudness or normal health. Birnbaum's result can be taken to imply, however, that we must state the reference point explicitly if we want subjects to use it, so we do this when we ask about differences.
Explicitness in stating the ends of ranges being compared is one of the prescriptions of decision analysis (Fischer, 1995), but it is not used routinely in other value-elicitation tasks. The results reported here suggest that such explicitness in the comparison of intervals is a good starting point for value elicitation. The rest of the process involves applying consistency checks and asking respondents to make adjustments. The checks used here are only examples of many others that could be used.
Baron, J. (1994). Thinking and deciding (2nd ed.). New York: Cambridge University Press.
Baron, J. (1997a). Biases in the quantitative measurement of values for public decisions. Psychological Bulletin, 122, 72-88.
Baron, J. (1997b). Confusion of relative and absolute risk in valuation. Journal of Risk and Uncertainty, 14, 301-309.
Baron, J., & Greene, J. (1996). Determinants of insensitivity to quantity in valuation of public goods: contribution, warm glow, budget constraints, availability, and prominence. Journal of Experimental Psychology: Applied, 2, 107-125.
Baron, J., Wu, Z., Brennan, D. J., Weeks, C., & Ubel, P. A. (1999). Analog scale, ratio judgment and person trade-off as utility measures: biases and their correction. Manuscript.
Beattie, J., & Baron, J. (1991). Investigating the effect of stimulus range on attribute weight. Journal of Experimental Psychology: Human Perception and Performance, 17, 571-585.
Birnbaum, M. H. (1978). Differences and ratios in psychological measurement. In N. Castellan & F. Restle (Eds.), Cognitive theory, (Vol. 3, pp. 33-74). Hillsdale, NJ: Erlbaum.
Birnbaum, M. H., & Sutton, S. E. (1992). Scale convergence and utility measurement. Organizational Behavior and Human Decision Processes, 52, 183-215.
Bruner, D. W. (1999). Determination of preferences and utilities for the treatment of prostate cancer. Doctoral dissertation, School of Nursing, University of Pennsylvania.
Darke, P. R., & Freedman, J. L. (1993). Deciding whether to seek a bargain: Effects of both amount and percentage off. Journal of Applied Psychology, 78, 960-965.
Fischer, G. W. (1995). Range sensitivity of attribute weights in multiattribute value models. Organizational Behavior and Human Decision Processes, 62, 252-266.
Green, P. E., & Srinivas an, V. (1990). Conjoint analysis in marketing: New developments with implications for research and practice. Journal of Marketing, 45, 33-41.
Green, P. E., & Wind, Y. (1973). Multiattribute decisions in marketing: A measurement approach. Hinsdale, IL: Dryden Press.
Hoffman, P. J. (1960). The paramorphic representation of clinical judgment. Psychological Bulletin, 57, 116-131.
Hsee, C. K. (1996). The evaluability hypothesis: An explanation of preference reversals between joint and separate evaluation of alternatives. Organizational Behavior and Human Decision Processes, 46, 247-257.
Jones-Lee, M. W., Loomes, G., & Philips, P. R. (1995). Valuing the prevention of non-fatal road injuries: contingent valuation vs. standard gambles. Oxford Economic Papers, 47, 676 ff.
Keeney, R. L. (1992). Value-focused thinking: A path to creative decisionmaking. Cambridge, MA: Harvard University Press.
Keeney, R. L., & Raiffa, H. (1993). Decisions with multiple objectives. New York: Cambridge University Press (originally published by Wiley, 1976).
Louviere, J. J. (1988). Analyzing individual decision making: Metric conjoint analysis. Newbury Park, CA: Sage.
Lynch, J. G., Jr., Chakravarti, D., & Mitra, A. (1991). Contrast effects in consumer judgments: Changes in mental representation of in the anchoring of rating scales. Journal of Consumer Research, 18, 284-297.
Mellers, B. A., & Cooke, A. D. J. (1994). Tradeoffs depend on attribute range. Journal of Experimental Psychology: Human Perception and Performance, 20, 1055-1067.
Tversky, A., & Kahneman, D. (1981). The framing of decisions and the psychology of choice. Science, 211, 453-458.
Ubel, P. A., Loewenstein, G., Scanlon, D., & Kamlet, M. (1996). Individual utilities are inconsistent with rationing choices: A partial explanation of why Oregon's cost-effectiveness list failed. Medical Decision Making, 16, 108-116.
von Winterfeldt, D., & Edwards, W. (1986). Decision analysis and behavioral research. Cambridge University Press.
Weber, M., & Borcherding, K. (1993). Behavioral influences on weight judgments in multiattribute decision making. European Journal of Operations Research, 67, 1-12.
1 This research was supported by N.S.F. grant SBR95-20288, and by a grant from the University of Pennsylvania Cancer Center.