Sample variance
Set up
We have a random variable with unknown mean and variance .
We have taken a sample from the distribution: . The sample has mean .
We want to estimate , using the sample.
Question
Why do we use as an estimate for , instead of ?
i.e. why divide by instead of by ? Dividing by is what we usually do when calculating variance...
A suggestion...
Let and let .
Instead of thinking that ,
think about it as (an equivalent statement).
Why?
Play with the app below to get a feel for what it is doing. Click "sample" to take different samples. Then read on...
We want an estimate for the population variance: .
This relates to the green lines: how the vary around .
Let be our estimate for .
If we knew , we could use (the mean of the squares of the green lines).
But we don't know .
We have (the mean of the squares of the blue lines).
This is something we can calculate: we have values for and for each .
This is the sample variance and does give a measure of how the vary around .
We also know that if the mean of our sample is treated as a random variable, , then its variance is given by the expression: (the "..." is left as an exercise).
So if we're estimating as , then an estimate for is .
This relates to the purple line: how varies around .
Putting it together...
The green arrows are equivalent to following the purple arrow and then the blue arrows: how the vary around depends on how varies around , and then how the vary around .
So roughly, . (... the maths checks out on this; again, an exercise, with a starting point given below*.)
Solving this equation for gives: .
i.e. .
*The detail
To consider algebraically why should satisfy , start here:
aiming at:
.
Recall that, as each are sampled independently, for .