I heard a great presentation this morning by Joe Newhouse, from the Department of Health Policy and Management at Harvard Medical School. There was one point that he made that really caught my attention. It was a cite to a 2004 article in the Journal of the American Medical Association (Dimick, et al, JAMA 2004; 292: 849) that presented the issue of how many cases you would need to collect of a certain clinical procedure to be able to make a determination that a given hospital's mortality for that procedure was twice the national average. It turns out that only for CABGs (coronary artery bypass grafts) are there enough cases performed to have statistical confidence that a hospital has that poor a record compared to the national average. For other procedures (hip replacements, abdominal aortic aneurysm repairs, pediatric heart surgery, and the like) there are just not enough cases done to make this assessment. (By the way, if you just want to know if a hospital is say, 20%, worse on relative mortality, you need even a bigger sample size.)

I have copied the basic chart above. Sorry, but I couldn't nab the whole slide. The vertical axis is "Observed 3 year hospital case loads", or the number of cases performed over three years. The horizontal access is "Operative mortality rates". The line curving down through the graph shows the frontier at which statistical significance can be determined. As you see, only CABGs are above the line.

And, as Joe pointed out, this chart is based on three years of data for each hospital. With only a year's worth from each hospital, you surely don't have enough cases to draw statistically interesting conclusions about relative mortality. And remember, too, that this is hospital-wide data. No one doctor does enough cases to cross the statistical threshold.

So, this would suggest that publication of hospital mortality rates for many procedures would not be helpful to consumers or to referring physicians.

Meanwhile, though, you might recall a post I wrote on surgical results as calculated by the American College of Surgeons in their NSQIP project. This program produces an accurate calculation of a hospital's actual versus expected outcomes for a variety of surgical procedures. Unfortunately, the ACS does not permit these data to be made public.

Where does this leave us? Well, as I noted in a Business Week article, the main value of transparency is not necessarily to enable easier consumer choice or to give a hospital a competitive edge. It is to provide creative tension within hospitals so that they hold themselves accountable. This accountability is what will drive doctors, nurses, and administrators to seek constant improvements in the quality and safety of patient care. So, even if we can't compare hospital to hospital on several types of surgical procedures, we can still commend hospitals that publish their results as a sign that they are serious about self-improvement.

## Monday, December 17, 2007

Subscribe to:
Post Comments (Atom)

## 13 comments:

I'd like an explanation, in relatively layman's terms, of why you need a *bigger* sample to prove 20% greater mortality than 100%.

I understand a fair amount about statistics - the null hypothesis, how confidence levels tell you how likely it is that a given result was random. But I have a distant sense that you need more data to be confident about a more extreme finding.

Can anyone help here?

Joe explained it in a very concise and understandable fashion -- which I immediately forgot. I used to know this stuff . . .

Well let's go GET it from Joe, then. People need to understand this.

Should I contact him? Can you invite him to come explain it here, clearly and concisely?

Mr Levy (doctor?), that explains what I said in my blog entry several days ago about my own reported actual and expected mortality data. And I highly suspect my sample size of 400 discharges in a year is no where near large enough to make my numbers statistically significant.

My expected mortality was about 6.4% and my actual mortality was 6.3%. The national average actual/expected mortality is a little under 4.5%.

As you can see, 2% higher represents only 8 patients in a 400 discharge year, which to me says I had 8 sicker patients on average than my group and the nation as a whole this year.

It does not say to me that I see, on average sicker patients or that my patients on average died at a higher rate. Just that I had 8 more really sick patients.

The public may not understand that if they see my patients are dying at a rate of 6.4% when all other docs patients around me are dying at 4.5%.

I am a firm believer in outcomes reporting, but realize that statistically, it will not always be meaningful.

patient dave, I will try and explain briefly.

Lets say there are 100 patients being treated at two hospitals for the same condition.

Lets say that one hospital has a mortality of 10% and one has a mortality of 20%

So Hospital one 10 people die

Hospital two, 20 people die.

Lets assume the national average is 10%.

So hospital one is average.

Hospital two is 100% worse than the average.

What is 20% worse? well that would be 20% of 10%, or 2%.

Or 2 people

So being only 20% worse than average would mean that 2 more people a year would die. or 10+2=12 people would die.

You can see, if you are dealing with really small numbers, it becomes harder to quantify smaller percentages.

How about 5% worse?

Well, 5% of 10% would be .5%, or 1/2 of a person. So 10.5 people would die to be 5% worse than the national average.

So if a hospital had a mortality 5% above the national average, that would mean only 1/2 extra died per year.

If you started with 10,000 patients instead of 100, than that 1/2 a person becomes 50 people.

You are more likely to catch a pattern with 50 people than 1/2 of a person.

It is a much more statistically meaningful number. In statistics land, I believe it is called the "power" of the test. Are there enough people to "power" the numbers towards statistical significance.

I hope that helps

The larger your sample size, the more likely you are able to detect smaller changes.

Thanks HH, for the statistics explanation in your second comments. I think you have it exactly right and your explanation is elegant.

Just to clarify on your first comment. The NSQIP is risk-adjusted, so it takes care of the problem you mentioned.

Yes, HH's explanation is excellent. Let me test my understanding by paraphrasing.

As I said last night, I know the purpose of calculating probability values (e.g. p<.04) is to answer the question "How likely is it that this result is just a random fluke?" And if you one test sample's result is 100% greater than the other's it ain't very likely. But if the other samples result is just 20% difference, there's a greater risk that it's random.

Now, here's the next step. That reasoning seems to test where there's ANY effect going on, or if it's ALL random. In other words, if we had an infinite data set (which we never do), we might eventually find out that the 100% result really settles out at 80%, right? So although we have confidence that there's SOMEthing going on ("yes there is a real difference"), we're not certain it's at the 100% level.

Yes?

Imagining the statistics with bell curves and Christmas trees might help.

Each hospital's mortality rate is a separate bell curve (the highest point of the curve is the most likely actual mortality rate, and the rest of the curve are also possible rates for that hospital, due to chance factors).

Ok, imagine 3 hospital bell curves as 3 upright Christmas trees. Imagine a sequential number line the floor of the Christmas tree lot that represents all possible mortality rates. Lets line up these bell curves/trees along this number line. If two trees had the same mortality bell curve, they would occupy the same place on this number line and one would stand behind the other. The distance between the trees is relative to the distance between their respective mortality rates. For example, the 2% and 3% mortality curves/trees would be shoulder to shoulder with some of the branches of one tree smushed into the other because they were standing so close together. The 3% and 6% curve/trees might be maybe arms distance apart.

Lets step away (as if you were viewing these Christmas trees from the parking lot). The 2% and 3% trees are standing so close to each other that they essentially look like one tree from where you are standing. The 3% and 6% trees however, are far enough apart from each other that you can actually distinguish that they are two separate trees, even from the distance you are standing.

Basically, if these were bell curves that represent each hospital's possible 'true' mortality rate, the farther the curves are from each other (and the less they overlap), the easier it is to tell them apart. In order to be able to tell apart 2 bell curves that are standing so close together, you would need more information. That information usually comes from having more people in your sample (i.e. you put on your glasses and all the sudden you see 'more').

I hope this made sense... I would love to hear Joe's concise explanation.

Beautifully (and seasonally!) done. I wish you had taught my statistics classes.

Why CABG and not hip replacements as well? I think they are done in far greater numbers, and thus, should be as statistically as robust.

One point Joe didn't make; mortality rates for most of these procedures are very low. If you look at complications which occur much more frequently, then you need lower numbers to identify statistically significant differences.

I puzzled about the median number of hip replacements on the chart. I guess every binky rural hospital must be doing hip replacements. Pretty scary. I'd want a surgeon who does at least 100 a year.

anonymous,

I understand that when the Christmas trees and their branches overlap, we can not attribute real differences. At what distance between the branches does a statistical difference become convincing?

One more explanation on the stats thing that worked for me in the past:

If someone said they had a coin that, when flipped, would be heads 90 percent of the time, tails 10 percent, you might believe him after a relatively small number of flips confirmed it. (say, 50 flips with 45 heads and 5 tails, or even less.)

If someone else said his coin would be heads 55 percent of the time, tails 45 percent, 50 flips with heads 28 and tails 22 wouldn't convince you of anything. (could be just chance). so you'd need a heck of a lot of flips before you believed the difference was real.

That's how I've always thought about it.

Post a Comment