Baron, J., & Ubel, P. A. (2001). Revising a priority list based on cost-effectiveness: The role of the prominence effect and distorted utility judgments. Medical Decision Making, 21, 278-287.
A natural strategy for the use of cost-effectiveness analysis in health care allocation is to rank treatment for cost-effectiveness and then cover all the treatments at the top of the ranking, going down as far as the budget allows. The experience of the state of Oregon suggests, however, that such rankings are intuitively unappealing. The state created a priority list for determining which health care services its Medicaid enrollees would receive. The list ranked the cost-effectiveness of 709 condition/treatment pairs, such as antibiotic treatment for pneumonia. However, the list was an immediate failure. It was not even forwarded to the State legislature by the commission that created it, on the grounds that it did not capture people's health care priorities. For example, it ranked surgical treatment for ectopic pregnancy and for appendicitis at about the same part of the list as dental caps for ``pulp or near pulp exposure.'' The Commission that created the list found these rankings counterintuitive and, instead, abandoned cost-effectiveness as the sole basis of its priority list.
A number of experts have debated why Oregon's cost-effectiveness list was such a failure. Some argued that Oregon's list was plagued by technical flaws and measurement error. Hadorn argued that Oregon's cost-effectiveness list failed, not because of inadequate cost-effectiveness data, but instead because cost-effectiveness itself fails to capture people's preferences. Cost-effectiveness ignores the rule of rescue - people want to give priority to treatment for life threatening illnesses in a way that is not captured by the cost-effectiveness of those treatments.1 When people are given a list based on cost-effectiveness, they will want to give higher priority to expensive life-saving treatments.
Others argue that the rule of rescue, or similar values, could be captured in cost-effectiveness analysis, if utility measures were altered to account for the societal value people place on various treatment benefits, that is, their desired priority in public policy, which may differ from their utility for patients 3,4. For example, a person may say that a condition is typically in the middle between death and normal health, for a typical individual patient. This judgment would imply a utility of .5 on a scale on which death is 0 and normal health is 1. But the same person might rather save the lives of 25 people than cure 100 people of the condition in question. Choices that involve numbers of people reflect, it is argued, a societal perspective. From this perspective, the condition would have a ``utility'' of .75 or higher for the purpose of cost-effectiveness analysis. This kind of judgment is the basis of the person trade-off method of utility elicitation, which asks respondents (for example) how many saved lives is equivalent to curing 100 people of the condition.
This paper addresses primarily a different possible explanation of the conflict between intuitive rankings and cost-effectiveness analysis: the prominence effect. When people make choices, they pay attention to the most prominent attribute of the available options, the attribute usually judged more ``important,'' whereas, in judgment or matching tasks, they pay attention to all the relevant attributes.5-8 In one study, subjects were asked to choose between two hypothetical highway safety programs. Program X saved 100 lives cost $55 million, and Y saved 30 lives and cost $12 million. Subjects usually chose program X, despite the cost. When similar subjects were asked to indicate how many lives saved by program X would make the two programs equally attractive, however, they usually gave answers greater than 100. Such answers imply that program X would be less attractive than Y with the figures presented in the original choice. Presumably, subjects considered both cost and benefit when making the matching judgment (in which they indicated what number would make the two equivalent), but they considered mainly the benefit when making the choice. This is because, in this case, the benefit is more prominent.
The prominence effect could explain the desire to revise a ranking based on cost-effectiveness analysis because the analysis is based on judgments in which the respondent must supply a numerical judgment and the revision of the list is more like choice. When people provide numerical judgments of utility, they could attend to costs as well as benefit, but when they look at a priority list, their desire to revise it could be based largely on the benefit, because benefit is the more prominent attribute. Just as cost is less prominent than benefit, it may also be true that number of patients helped is less prominent than the amount of benefit per patient.
For example, suppose Treatment A costs $100 and yields an average of 0.01 QALYs, therefore, costing $10,000 per QALY. And suppose Treatment B costs $10,000 and yields 1 QALY, therefore having similar cost-effectiveness. Because of the different costs of each treatment, 100 people can receive Treatment A for the same cost as providing one person with Treatment B (in both cases yielding 1 QALY). A cost-effectiveness ranking would show these two treatments as being equally cost-effective. But, when evaluating such a ranking, people might focus primarily on the more prominent attribute, the amount of benefit brought by providing one person with each treatment and, thus, they may want to move Treatment B up higher on the list.
In the studies we report here, we conducted utility elicitations for a series of hypothetical treatment/condition pairs. We used two methods for eliciting judgments of utility: visual analog scale (because it is easy for subjects) and person trade-off (because it is designed to reflect preferences about allocation). Then we show subjects a cost-effectiveness ranking based on their own utility elicitations. In some cases, we provide the cost of the treatment at the same time we elicit utilities. In other cases we do not. In all experiments, we ask people to look at the cost-effectiveness ranking and adjust any items that they think are misplaced. We hypothesize that people pay more attention to the prominent attribute of benefit - and therefore less attention to cost - when evaluating a cost-effectiveness ranking than when responding to utility elicitations. They will thus want to increase the priority of treatments that are expensive or highly beneficial.
As a secondary issue (addressed in Experiment 2), we also test the possibility that the initial judgments of the utility of treating mild conditions is exaggerated because subjects misuse the judgment scale. Previous research suggests that such distortion occurs and can be corrected somewhat by asking subjects to think about the relation of what they are judging to the midpoint of the scale.9
Both our hypotheses are about psychological factors that affect the evaluation of priority lists derived from utility judgments. One factor, the prominence effect, affects the response to the list, and the other factor involves distortion of the initial utility judgments that are used to produce the list.
In Experiment 1, each subject made two kinds of utility judgments. In one, called ``without-cost,'' subjects used a standard visual analog scale to rate the amount of benefit brought by treating various health conditions. The two ends of the rating scale were labeled ``no good at all'' and ``as good as preventing death.''
In the other kind of judgment, called ``with-cost,'' subjects had the opportunity to consider cost as well as benefits. The two ends of the rating scale were ``no good at all'' and ``as good as preventing death at no cost.'' This type of judgment is not traditionally in medical cost-utility analysis, although it is used in holistic judgments of consumer goods, when the price of the good is one attribute among many. As we show, our subjects did attend to cost when making these holistic judgments.
Either method - without-cost judgments, or holistic with-cost judgments - can lead to a cost-utility priority list. The utility/cost ratio for the without-cost method is based on assumed costs (which can then be listed in the rank list). The ratio for the with-cost method is determined directly from subjects' judgments of utility, because they include cost as an attribute when they make these judgments.
We used three levels of cost. The middle level was intended to be plausible. The low level was half of the middle, and the high level was twice the middle. This manipulation allowed us to determine whether subjects were following the instructions to take cost into account. If they took cost into account, their with-cost ratings would be higher for the low-cost items and lower for the high-cost items.
After subjects rated the 16 pairs in these four versions (without-cost, and with-cost using three levels of cost), we presented them with a list of condition-treatment pairs ranked according to the middle-cost with-cost judgments. The list included the cost of each item. The subject could then indicate which, if any, of these pairs should be moved higher in the list, or lower. They did not have to indicate how many steps higher or lower, but the answer to this question allows us to assess which (if any) of the pairs seemed out of place to each subject. To analyze these results, we can attempt to predict the subject's desire to move an item from the item's cost. If high-cost items are moved up, this supports the hypothesis that people tend to ignore cost when looking at a ranking but not when making the ratings that determined the ranking. Note that this analysis requires computaion of a correlation coefficient across the condition-treatment pairs within each subjects. The hypothesis concerns the across-subjects mean of these within-subject correlations.
In sum, the final list is based on ratings of the items in the list, complete with cost information. If subjects want to change the rankings for systematic reasons, this is the simplest possible demonstration of a reversal of preference between rating and ranking.
Seventy subjects completed a questionnaire on the World Wide Web. They found out about the study from links in a variety of web pages, including one advertising ``free stuff on the internet.'' They ranged in age from 13 to 50, with a median of 29; 71% were female; and 36% were students. Subjects were paid $3, and they had to provide an address and social-security number in order to be paid. The questionnaire included a variety of checks to make sure that responses were serious.
The questionnaire began as follows:
Health judgmentsThen the subjects saw 64 screens in a random order chosen differently for each subject. The 64 screens required judgments of 16 condition-treatment pairs (henceforth simply ``pairs'') under four versions of cost: without-cost, low, medium, and high. We assigned costs to treatments in an effort to be plausible to the subjects rather than accurate. Here are the pairs with their costs in the medium-cost version:
Health insurance can be provided by the government or by private companies. Insurance covers different things. Almost all insurance covers emergency-room care for heart attacks. Almost no insurance covers in-vitro fertilization (artificial insemination) for couples who cannot conceive a child. There are hundreds of such treatments or preventive measures that could be covered or not.
Suppose that a commission were appointed to try to draw up a list of priorities for insurance coverage. The idea would be that each insurer would go down the list until it ran out of money. Heart-attack treatment would be near the top of the list, so all insurers would cover it. Insurers with a lot of money might cover the top 500 treatments. Insurers with less might cover only 400. Suppose there was a law about this. The law says that, if you cover one treatment, you must cover everything above it on the list, unless you get special permission.
In such a situation, insurers must decide how to spend their money most wisely. Judgments of the importance of curing or preventing various conditions will play a role.
In the items that follow, you will rate the value of treating or preventing various conditions. There are two kinds of items. One has a dark grayish background and provides information about cost. When you respond to this item, take the cost of the treatment into account. 48 items are of this type. The other type has a dark red background, and it involves only the treatment. This is about the benefit of the treatment alone, irrespective of cost. 16 items are of this type, for a total of 64.
In each item, you provide a numerical rating, but you do this by clicking on arrows or on a scale. At the end, you will have a chance to examine and change the priority list that resulted from your responses.
|READING GLASSES TO RESTORE ABILITY TO READ||$100|
|CATARACT SURGERY TO RESTORE NORMAL VISION||$4000|
|ANTIBIOTICS FOR PNEUMONIA||$30|
|ANTIDEPRESSANT DRUG FOR DEPRESSION (1 YR.)||$2000|
|EMERGENCY TREATMENT FOR HEART ATTACK||$1000|
|SURGERY FOR APPENDICITIS||$10000|
|MEDICATION FOR HIGH BLOOD PRESSURE (10 YRS. AVG.)||$10000|
|INSULIN FOR DIABETES (20 YRS. AVG.)||$5000|
|REMOVAL OF WARTS FROM HANDS||$300|
|CAPPING OF BROKEN TOOTH||$500|
|LIVER TRANSPLANT FOR ALCOHOL INDUCED CIRRHOSIS||$100000|
|REMOVAL OF PRE-CANCEROUS SKIN MOLE||$100|
|ANTIBIOTICS FOR STREP THROAT||$30|
|VACCINATION AGAINST CHICKEN POX||$20|
|BANDAGE FOR SPRAINED ANKLE||$50|
|HEARING AID TO RESTORE NORMAL HEARING||$1000|
The cost in the low-cost version was half of that given here, and the cost in the high-cost version was double. On each with-cost trial, the subject saw a screen with a heading like the following (from the high-cost version):
How much good does this do?:Below the heading on the left was a scale ranging from 0 to 100, in units of 5, as well as two arrows pointing up, labeled +5 and +1, respectively, and two arrows pointing down, labeled -1 and -5, respectively. The scale was labeled ``As much as preventing death at no cost'' at the top and ``No good at all'' at the bottom. It was white above the utility value in effect, and red below it. To the right of the scale was a summary of the judgment so far:
REMOVAL OF WARTS FROM HANDS at a cost of $600 per case (which means 1667 cases for $1,000,000).
REMOVAL OF WARTS FROM HANDS at a cost of $600 per case (which means 1667 cases for $1,000,000). is % as good as preventing death at no cost.
The blank was initially filled in with 0, but its value changed as the subject manipulated the scale or the arrows. The visual scale was marked in units of 5, so finer adjustments (from using the +1 and -1 arrows) were reflected only in the blank space.
In the without-cost version, the information about cost and number of people helped for $1,000,000 was omitted, and the top of the scale was labeled Äs much as preventing death.'' The without-cost version was presented in a different color, as noted in the instructions.
At the end, subjects were told, ``Here is a list of treatments, ranked according to your responses. Please check it to see whether you want to change anything. The treatments higher in the list would get priority. (For example, A would get priority over B.)'' They were instructed to type the letters of conditions ranked too low on the list, and, separately, those ranked too high. The list included cost information by adding ``at a cost of ...'' to the end of each description. The list was based on the responses to the middle-cost items.
We used the logarithm (base 10) of cost in all calculations because the distribution of cost was highly skewed.
To determine whether the with-cost ratings were sensitive to cost across the 16 condition-treatment pairs, we regressed, for each subject, the with-cost utilities for each cost level against the without-cost utilities and cost, across the 16 condition-treatment pairs. The mean within-subject (unstandardized) regression weights for cost were -2.4, -2.6, and -2.3, for the three cost levels (low, medium, high), respectively. That is, multiplying the cost by 10 reduced the utility by about 2.5 points on the 100 point scale. The mean of the three coefficients was significantly different from zero (t66 = 4.04, p = .0001). Cost level (low-medium-high) also affected utility judgments overall. The mean utility ratings for the low, medium, and high cost levels were, respectively, 46.4, 44.7, and 43.2 (t69 = 4.12, p = .0001, for the declining linear trend across the three levels). Thus, subjects took cost into account in the with-cost condition.
Subjects had a chance to revise, at the end, the ranking of condition-treatment pairs based on the standard costs. We asked if they wanted to move a pair up or down on the priority list. We coded their response as 1 if they wanted to move a pair up, as -1 if they wanted to move a pair down, and as 0 if they did not want to move a pair. 23 subjects did not revise any ranking. Our hypotheses all concern within-subject correlates, across the 16 pairs, of wanting to move a pair up or down in the ranking, so these 23 subjects did not contribute to the relevant correlations. They had no variance in one of the variables being correlated.
We first computed, for each subject who made revisions, the correlation between the revision (1 for up, -1 for down, 0 for no revision) and the without-cost rating, across the 16 condition-treatment pairs. The mean of these correlation coefficients (one for each subject) was .10, which was significantly positive (t44 = 2.42 across the 45 subjects who had nonzero variance in both variables, p = .0195). In other words, subjects wanted to revise the rankings based on their with-cost ratings in the direction of those that would be based on their without-cost ratings, the ratings of benefit that ignored cost.
The revision was also correlated with cost across the 16 condition-treatment pairs. The mean within-subject correlation, across the 47 subjects, was .11 (t46 = 2.32, p = .0251). In other words, pairs that were moved up tended to be those with high cost. Presumably, this happened because subjects took cost into account in their ratings (as we noted), which were then used to make the ranking. High-cost items were thus ranked lower than they would have been ranked on the basis of their benefit alone. When subjects examined the final ranking, they did not attend to the cost as much as they had done in their initial ratings - even though the cost information was provided in the list - so they wanted to give the high-cost items higher priority.
More generally, a mean of 27% of adjacent items in the ranking that each subject wanted to switch (either by moving the higher item down or the lower item up, or both) were characterized by both higher without-cost utility and higher cost for the item moved up, vs. 12% that were both lower in cost and lower in without-cost utility.
In sum, subjects wanted to move high-benefit and high-cost treatments higher than the position implied by the subjects' own ratings that took cost into account. Their response to the final ranking was less influenced by cost considerations, and more influenced by benefit alone, than were the utility ratings they made for the same items.
Experiment 2 examined a second reason for the desire to revise rankings. The utility measure itself may be invalid because subjects may use the scale in a way that does not correspond to their own representations of what they are judging. In particular, the usual methods of utility assessment may overstate the disutility of mild or moderate health conditions, or, equivalently, they may overstate the benefit of curing or preventing such conditions.9 For example, Ubel et al. asked subjects, ``You have a ganglion cyst on one hand. This cyst is a tiny bulge on top of one of the tendons in your hand. It does not disturb the function of your hand. You are able to do everything you could normally do, including activities that require strength or agility of the hand. However, occasionally you are aware of the bump on your hand, about the size of a pea. And once every month or so the cyst causes mild pain, which can be eliminated by taking an aspirin. On a scale from 0 to 100, where 0 is as bad as death and 100 is perfect health, how would you rate this condition?''4 The mean answer was 92. The cyst was judged about 1/12 as bad as death. This seems too high, too far from 100. If the seriousness of minor conditions is overrated because of a distortion in the response scale, then these conditions will get higher priority in the final ranking than they deserve. Subjects would then want to move them down.
Baron et al. found that rating scale judgments yield smaller disutilities for health conditions and are more consistent with each other when subjects are first asked whether a condition is closer to one end of the scale or the other (normal health or death).9 For example, a subject who rates blindness as 40 on a scale from 0 (death) to 100 (normal) may revise this number upward after judging that blindness is closer to normal health than to death. These results suggest that, indeed, conditions tend to be rated as closer to death than they ought to be, and that this bias can be corrected by instructions to consider the midpoint of the scale. When the bias is corrected, internal inconsistencies among ratings are reduced. Baron et al. suggest that this effect results from a general tendency to use normal health as a reference point and exaggerate the differences near the reference point.
We use such ``interval instructions'' here as a way to ``debias'' utility judgments. Before the subjects rated a condition-treatment pair relative to preventing death, we asked them whether the pair did more or less than half as much good as preventing death, for half of the trials. This question may have made subjects think more carefully about the nature of the scale as a measure of differences (even if they would have given the same answer if they had given their ratings first and based the answer on their ratings).
Sixty-six subjects completed a questionnaire on the Web, as in Experiment 1. Ages ranged from 13 to 65 (median 27); 73% were female; and 44% were students.
The method was the same as Experiment 1, with the following changes. We used only the middle cost, not high and low. We used four versions, 16 trials each, with the order of all 64 trials randomized separately for each subject. Two of the versions provided cost information and two did not. One of the with-cost versions and one of the without-cost versions provided interval instructions. We did not use the visual scale (because we thought it would detract from the salience of the interval instructions). The text for each trial with the interval instructions read as follows (for example):
The treatment here is to provide INSULIN FOR DIABETES (20 YRS. AVG.) at a cost of $5000 per case (which means 200 cases for $1,000,000).The version without the interval instructions began with ``On a scale ....'' The version without cost omitted the cost information. The introduction was changed to reflect the changes just described.
First, think about whether this does more or less than half as much good as preventing death at no cost.
If it does more than half as much good, type 'm'.
If it does less than half, type 'l'.
If it does exactly half as much good, type 'h'.
On a scale where
0 means 'no good at all' and
100 means 'as much good as preventing death at no cost', how much good does it do to provide
INSULIN FOR DIABETES (20 YRS. AVG.)
at a cost of $5000 per case
(which means 200 cases for $1,000,000)?
Utility ratings (that is, ratings of benefit from treating the condition) were lower when cost information was provided (F1,65 = 8.67 p = .0045) and they were lower when interval instructions were provided (F1,65 = 7.13, p = .0095). The second result shows that the interval instructions were effective in inducing subjects to provide lower ratings of benefit, as we hypothesized.
Of interest were the correlates of the final revision of the ranks. Sixteen subjects did not revise their ranking. Recall that the ranking was based on the cost-no-instructions version. We asked about the correlation between desired revisions in ranking and the ratings in the other three versions, as well as cost. The mean within-subject correlations, across the 50 remaining subjects, were .13 for the correlation between revision and ratings in without-cost-no-instructions (t49 = 4.41, p = .0001), .11 for without-cost-instructions (t49 = 3.61, p = .0004), .12 for cost-instructions (t49 = 3.99, p = .0002), and .09 for cost itself (t49 = 2.24, p = .0298).
There are three results here. First, as we found in Experiment 1, subjects wanted to revise in the direction of giving priority to treatments with greater benefit, regardless of cost (the without-cost correlations). Second, as found in Experiment 1, they wanted to revise in the direction of higher cost (the correlation with cost). Third, they want to revise in the direction of the instructed ratings (cost-instructions).
The third finding suggests that one of the reasons that people want to change rankings based on the analog scale is that the scale provided invalid utility ratings. That is, the ratings do not represent peoples' most reflective judgments. This invalidity can be reduced by asking subjects to reflect on the relation of the condition they are rating to the midpoint of the utility scale. When they do this, their ratings are more consistent with their final ranking.
In Experiments 1 and 2, we based the final ranking on judgments when the subjects were provided cost information. In essence, we asked them to evaluate each condition-treatment pair in terms of its benefit-cost ratio, and we used this to determine the priority list. In Experiment 3, we asked only about benefit, and we constructed the priority list by computing the benefit-cost ratio for the subjects. For this computation, we used the same costs as in Experiment 2. We showed subjects these costs and their implications for the number of patients who could be treated for $1,000,000, as part of the priority list, as in Experiments 1 and 2.
We used two methods to elicit utilities for benefits. One was the simple direct-rating method used in Experiment 2 (without cost information and without interval instructions). The other was the person-tradeoff (PTO). We asked subjects how many lives saved was just as as attractive as 100 people getting each of the treatments in the list. For example, if the subject says that saving 25 lives is just as attractive as giving some treatment to 100 people, and if we interpret this answer as a measure of relative benefit, then the treatment would be 1/4 as beneficial as saving a life.
We focused primarily on the PTO, because some have suggested that judgments of benefits for the purpose of allocation might differ from judgments of benefits for other purposes (such as individual decision making).3,4 In addition, the PTO explicitly asks people to think about the relative number of patients who would have to receive a treatment to bring a specific amount of benefit. If these arguments are correct, PTO judgments should yield an acceptable final ranking, one that people would not want to revise in any systematic way. The assumption here is not that the PTO automatically takes cost into account. We did not provide cost information. Rather, the argument assumes that PTO ought to provide the correct measure of benefit for allocation based on cost as well as benefit.
Seventy-four subjects completed a questionnaire on the Web, as in Experiments 1 and 2. Ages ranged from 13 to 65 (median 30, or 28 after omissions described below); 75% were female (70% after omissions); and 40% were students (47% after omissions).
The PTO item read: ``How many people saved from death is just as attractive as providing 100 people with [the pair].'' The introduction to the experiment explained the PTO as follows:
In another type of item, you are to imagine that you have a choice between saving some number of lives, call it X, and giving 100 people the treatment in question. The question is, at what value of X would you find your two options equally attractive. For example:Subjects then did a practice item in which they clicked on buttons that adjusted the answer higher or lower until they were satisfied.
How many people saved from death is just as attractive as providing 100 people with chemotherapy for breast cancer?
Again, imagine that the people treated and the people saved from death are similar, and consider the average person who would get the treatment. In this case, you should not give a number greater than 100. Providing chemotherapy does a lot of good, but not as much good as saving a life, on the average. So your answer would be lower than 100. When the treatment in question does less good, then your answer should be lower.
The first 32 screens of the experiment mixed the 16 rating items (without cost information) and the 16 PTO items in a different random order for each subject. Then the subject saw two screens with rankings, like those in Experiments 1 and 2. The first screen was always based on the PTO judgments. The second was based on the ratings. In each case the priority rankings were determined by dividing the utility (rating or PTO) by the cost.
Subjects found the PTO task difficult, some by their own admission. To determine whether they interpreted the task correctly, we formed an index of sensitivity to severity for each measure, ratings and PTO. We formed the index by subtracting the responses for the three pairs with the lowest utility on both measures on the average (warts, sprained ankle, hearing aid) from the three rated highest (antibiotics for pneumonia, emergency treatment for heart attacks, insulin for diabetes). We examined a scatterplot of these two sensitivity indices (one for ratings, one for PTO) and found one group of subjects in which they were highly correlated and close to one another and another in which the index for PTO was much lower than that for ratings. These subjects presumably were the ones who did not use the PTO in a way that reflected their views. We eliminated subjects who differed by .25 (on a scale of 0-1) in this direction. This left 54 subjects out of the original 73. The subjects who were eliminated in this way were grouped together in the scatterplot, with a space between them and the subjects who were retained. We did this before looking at other results. Of course, we also examined the results with all subjects included, but we wanted to make sure that any positive results could not arise from the inclusion of subjects who had serious difficulty with the PTO.
Each subject wanted to move the more costly treatments higher in the priority list. Revision upward was correlated with cost. The mean within-subject correlation for the list based on the PTO was .43 (t48 = 9.87, p = .0000) and the mean correlation for the list based on ratings was .40 (t48 = 8.78, p = .0000). The two correlations did not differ significantly. (When all subjects were included, the two correlations were, respectively, .46 and .45. Again, both correlations were significantly positive and did not differ significantly.)2
Perhaps the cost information was not salient enough when the ranking was presented to the subjects. It is possible that subjects would pay attention to cost, and to the number of people who could be treated for a fixed cost, if this information were more salient in the ranking. (Of course, the tendency to ignore cost when evaluating a priority list would be exacerbated if the cost information is missing from the listed items.) To determine whether subjects would still want to move high-cost items upward under these conditions, we reran Experiment 3 with one change. In the final presentation of the list, the information about number of people was presented first. Thus, each line in the list was of the form, ``1000 cases of HEARING AID TO RESTORE NORMAL HEARING (cost $1000).'' The questionnaire was completed by 70 subjects (ages 15-68, median 28; 61% females, 41% students).
Once again, we eliminated subjects with more than a .25 difference between the indices for PTO and ratings, and also subjects with less than .25 on either measure (suggesting little attention to the difference between serious and mild conditions), leaving 49. The mean within-subject correlations between cost and upward revision were .47 for PTO (t41 = 10.32, p = .0000) and .44 for ratings (t40 = 10.99, p = .0000.) These two means did not differ significantly. (When all subjects were included the two mean correlations were, respectively .42 and .45 - again, these did not differ significantly and were significantly positive.) These means were no lower than in the original version of the experiment, which suggests that the manipulation of salience had no effect. Only three subjects wanted to move high-cost items down (or low-cost items up) for each measure.
In sum, when subjects see a priority list based on cost-effectiveness of condition-treatment pairs, even when the information about cost and the number of patients who can be helped is salient, they want to ignore cost and move high-cost items up. The use of the PTO does not reduce this effect.
We found evidence for two factors that affect evaluation of priority lists derived from cost-effectiveness analysis. First, when people evaluate a priority list, they attend less to cost and more to benefit than when they are asked to make explicit tradeoffs of cost and benefit. This is, we believe, an instance of the prominence effect. We found this consistently in several different cases: when the priority list was based on ratings that took cost into account, when it was based on ratings of benefit only (with cost taken into account afterward, in devising the list), and when person tradeoff (PTO) was used to elicit utility judgments of benefit. The last result is of interest because PTO judgments have been suggested as closer to the actual allocation decision, hence capturing a type of value that is relevant to allocation decisions in particular.3,4
Second, numerical utility judgments are often too high, especially for small effects. This is an artifact of the way in which people assign numbers to internal representations.10 When cost-effectiveness analysis is based on such judgments, it favors allocation of too many resources to options with minor benefits. We found that this bias exists and that it is reduced if subjects are asked to consider the midpoint of the scale as well as its two ends. Such a debiasing method may be useful in practice. Simple rating scales are often the easiest methods to use, but they are suspect because of just this sort of problem. Interval instructions could help overcome the problem (and make the scales more theoretically justifiable as well).11
Of course, our results do not address the existence of other possible factors that cause conflict between priority lists and the judgments that led to them. We did not examine judgment of cost at all. It is possible that the public has a broader conception of cost than the costs typically used in cost-effectiveness analyses, which includes such things as effects on the family. It is also possible that priority lists conflict with intuitions about fairness.10
It is also possible that the rule of rescue is an additional factor, beyond the prominence effect. Note first that the prominence effect could explain results that seem to imply a rule of rescue, if those results show that the rule is applied in choices rather than matching judgments. Such an explanation could be based on the assumption that benefit itself consists of two dimensions: the seriousness of the condition before treatment; and the benefit of the treatment (or the final condition after treatment). Of these two dimensions of benefit, seriousness might be more prominent. But that is a different kind of prominence effect than the one we have examined. Demonstration of such an effect would require comparison of matching judgments (or utility ratings) and choice judgments (or judgments about the revision of a priority list).
A major limitation of our results is that our lists were hypothetical, not real, and the subjects were not necessarily representative of activist citizens who complain about priority lists. Thus, our experiments address questions about the psychology of evaluating priority lists, but not about the political and institutional factors that affect the evaluation of such lists in the real world. Any conclusions about the real world are based on the assumption that properties of human judgment studied in the laboratory are active outside the laboratory enough to affect policy outcomes. Although this assumption has been defended,12,13 it cannot be taken for granted.
If our results are relevant at all outside the laboratory, their main implication is that people should also be wary of attempts to tinker with it on the basis of judgments about the list itself, even if it contains information about costs and their implications. When questions about a list are raised, they should be resolved by re-doing the procedure that generated it, perhaps more carefully or with additional checks (such as the interval instructions used in Experiment 2), rather than by using intuition to change the ranks.
2. Eddy DM. Oregon's methods: Did cost-effectiveness analysis fail? Journal of the American Medical Association, 1991;266:2135-41.
3. Nord E. The person trade-off approach to valuing health care programs. Medical Decision Making, 1995; 15:201-208.
4. Ubel PA, Loewenstein G, Scanlon D, Kamlet M. Individual utilities are inconsistent with rationing choices: A partial explanation of why Oregon's cost-effectiveness list failed. Medical Decision Making, 1996;16:108-16.
5. Fischer GW, Carmon Z, Ariely D, Zauberman G. Goal-based construction of preferences: Task goals and the prominence effect. Management Science, 1999; 45:1057-1075.
6. Hsee CK. The evaluability hypothesis: An explanation of preference reversals between joint and separate evaluation of alternatives. Organizational Behavior and Human Decision Processes, 1996;46:247-57.
7. Schkade DA, Johnson E. Cognitive processes in preference reversals. Organizational Behavior and Human Performance, 1989;44:203-31.
8. Tversky A, Sattath S, Slovic P. Contingent weighting in judgment and choice. Psychological Review, 1988;95:371-84.
9. Baron J, Wu Z, Brennan DJ, Weeks C, Ubel PA. Analog scale, ratio judgment and person trade-off as measures of health utility: biases and their correction. Journal of Behavioral Decision Making, in press.
10. Baron J. Biases in the quantitative measurement of values for public decisions. Psychological Bulletin, 1997; 122:72-88.
11. Krantz DH, Luce RD, Suppes P, Tversky A. Foundations of measurement (Vol. 1). New York: Academic Press, 1971, ch. 4.
12. Baron J. Judgment misguided: Intuition and error in public decision making. New York: Oxford University Press, 1998.
13. Baron, J. Thinking and deciding (3d edition). New York: Cambridge University Press, 2000.
1This research was supported by a pilot-project grant from the Penn Cancer Center and by National Science Foundation grant SES9876469. Dr. Ubel is a Robert Wood Johnson Foundation Generalist Physician Faculty Scholar, recipient of a career development award in health services research from the Department of Veterans Affairs, and recipient of a Presidential Early Career Award for Scientists and Engineers. This research was also supported by the National Cancer Institute (R01-CA78052-01). We thank Andrea Gurmankin and the reviewers for comments. Send correspondence to Jonathan Baron, Department of Psychology, University of Pennsylvania, 3815 Walnut St., Philadelphia, PA 19104-6196, or (e-mail) firstname.lastname@example.org.
2These correlations were
considerably higher than those in Experiment 1, for example.
The difference may be related to the fact that subjects mad
many more revisions in the list in Experiment 3. The mean
number of revisions was 6.4 in Experiment 3 and 2.4 in
Experiment 1 (t140 = 7.82, p = .0000). The correlation is
necessarily low when the number of revisions is small. This
difference in number of revisions is itself something we cannot