March 13, 2011
“What’s a ‘p-value’ and why do I need one?”
Late last night somebody asked me roughly that question, which led me to realize that I can’t really formulate a clear description of what’s going on in regression analysis. So I thought I’d write a post and give it a try.
Imagine that somebody tells you that, because of a recent head trauma he suffered, he can accurately predict the outcome of coin tosses. Not being an overly credulous person, but having seen a number of sitcoms where such events are possible, you want to find out whether or not what your friend says is true.
You propose a simple test: your friend makes a prediction, “heads” or “tails,” and then you flip a coin and record whether or not he got the prediction right. You will then repeat this simple procedure for 10 total coin flips, and add up his scores to see how many he got right overall.
From intuition, it’s clear that getting, say, 6 right out of 10 would not be a very impressive result. After all, anyone has a 50% chance of guessing the outcome of any given coin toss. You would hardly be ready to declare your friend psychic if he got one more correct answer than you would expect from pure chance.
How many correct answers would it take to convince you that there is something to these legendary sitcom injuries? Eight correct? Nine? When you make this judgment, you are implicitly comparing your friend’s result to the result you would expect from a person with no special powers, and thinking about whether the difference between those two results is large enough to be convincing.
The p-value is just a formalization of this intuition (the “p” stands for “probability”). After you perform the test and look at the number your friend got correct, you want to answer the question, “What is the chance I would see a result as extreme as the one I am now seeing, if my friend were just a normal person with no powers?”
For flipping coins, it turns out this question is very easy to answer. If you guess and flip a coin once, your chance of getting it right is 50%. The chance of getting it right twice in a row is 25%. The chance of getting it wrong twice in a row is also 25%. Some basic extensions of this type of reasoning leads to the following table, for 10 coin flips:
To put the same information in a chart, it looks like this:
At this point, it helps to reformulate what your friend is saying into a testable hypothesis. What we would expect, for a normal person, is that when doing this test with 10 coin flips, the average number they would get right is five. What your friend is saying, in effect, is “My personal average is not five.” The purpose of this test is to figure out whether it’s reasonable to believe your friend.
Okay, so let’s say you do the test and your friend gets 8 out of 10. Vindicated? I don’t know, let’s see. If his real average is 5, what is the chance of seeing a result at least as far away from 5 as this is? To find out, we look at all the results that are at least three away from five. That means 8, 9, and 10, but also 0, 1, and 2. Adding up all those percentages, we find that the probability that a normal person would get a result at least this extreme is almost 11%. It was a pretty good performance overall, but not really overwhelming evidence of psychic powers.
What if you had performed the test and your friend got 0 out of 10 correct? That is, every single time he says “heads,” it comes up “tails,” and vice-versa. Even though all his answers were wrong, you could still take this as strong evidence that his personal expected average is different from 5. According to the table, in only .20% of cases would you see a result as extreme as this one for a normal person (chance of getting 0 right plus chance of getting 10 right). Even though he got all the answers wrong, it would still seem to be the case that you can just reverse all his predictions and do better than chance.
So this is basically what we’re trying to do when we do statistical analysis. We have this information, our friend saying “heads” or “tails,” that we’re trying to use to predict a real-world event, a coin actually coming up either heads or tails. We want to know if his information helps us make better predictions or if they’re just garbage. The p-value is a way to measure that. It tells us that if his predictions were really garbage, this is the probability that they would have seemed to provide a prediction at least as useful as the one they actually did provide.