Why divide by n - 1 to calculate the variance of a sample?
In a recent tweet I asked the question why we use n - 1 to calculate the variance of a sample. Many people contributed an answer, but many of them were of the type I feared. Most consisted of some statistical jargon that confuses me more, rather than less. Other responses were very useful, though, so I recommend checking out the replies to the tweet. In this post, I will try to describe my favorite way of looking at the issue.
- Published On
- Last Updated
Many thanks to the people who replied to my tweet about why you should divide by n - 1. Below I try to show the intuition behind why this is necessary. If you want to follow along in R, you can copy the code from each code section; beginning with some setup code.
The formula
The formula for calculating the variance is:
The variance is a measure of the dispersion around the mean, and in that sense this formula makes sense. We calculate all the deviations from the mean (), square them (for reasons I might go into in a different post) and sum them. We then divide this sum by the number of observations as a scaling factor. If we ignore this number, we could get a very high variance simply by observing a lot of data. So, to fix that problem, we divide by the total number of observations.
However, this is the formula for the population variance. The formula for calculating the variance of a sample is:
Why do we divide by n - 1?
You can Google this question of course, and you will get some answers. You might find a mathematical proof of why you need to subtract 1 or something about degrees of freedom. These kinds of answers don’t work for me. I trust them to be correct, but it doesn’t produce any insight. It does not actually help me understand 1) the problem and 2) why the solution is the solution that it is. So, below I am going to try to figure it out in a way that actually makes conceptual and intuitive sense (to me).
The problem
The problem with using the population variance formula to calculate the variance of a sample is that it is biased. It is biased in that it produces an underestimation of the true variance. Let’s demonstrate that with some simulated data.
We simulate a population of 1000 data points from a uniform distribution with a range from 1 to 10. Below I show the histogram that represents our population.
The variance is 8.76. Note that this is our population variance (often denoted as ). We want to estimate this value using samples drawn from our population, so let’s do that.
To start, we can draw a single sample of size 5. Say we do that and get the following values: 7, 6, 3, 5, 5. We can then calculate the variance in two ways, using division by n and division by n - 1. In the former case, this will result in 1.76 and in the latter case it results in 2.2.
Now let’s do that many many times. Below I show the results of draws from our population. I simulated drawing samples of size 2 to 10, each 1000 different times. I then plotted for each sample size the average biased variance (dividing by n) and the average unbiased variance (dividing by n - 1). Let’s first look at what the draws look like without dividing by n - 1.
We see that the biased measure of variance is indeed biased. The average variance is lower than the true variance (indicated by the dashed line), for each sample size.
And indeed, the unbiased variance is indeed unbiased. On average, the sample variance matches that of the population variance.
The results of using the biased measure of variance reveals several clues for understanding the solution to the bias. We see that the amount of bias is larger when the sample size of the samples is smaller. So the solution should be a function of sample size, such that the required correction will be smaller as the sample size increases. We also see that that the bias at n = 2 is half that of the true variance, at n = 3, at n = 4, and so on. Interesting.
But before we go into the solution, we still need to figure out what exactly causes the bias.
Ideally we would estimate the variance of the sample by subtracting each value from the population mean. However, since we don’t know what the population mean is, we use the next best thing—the sample mean. This is where the bias comes in. When you use the sample mean, you’re guaranteed that the mean lies somewhere within the range of your data points. In fact, the mean of a sample minimizes the sum of squared deviations from the mean. This means that the sum of deviations from the sample mean is almost always smaller than the sum of deviations from the population mean. The only exception to that is when the sample mean happens to be the population mean.
Let’s illustrate this with a few graphs. Below is a graph in which I show 10 data points that represent our population. I also highlight two data points from this population, which represents our sample. In this first graph I show the deviations from the sample mean.
The next graph shows the deviations from the population mean.
We see that in the first graph the sum of squared deviations is much smaller than in the second graph. The sum is (8 - 9)² + (10 - 9)² = 2 in the left graph and in the right graph it’s (8 - 5.8)² + (10 - 5.8)² = 22.48. The sum is smaller when using the sample mean compared to using the population mean.
This is true for any sample you draw from the population (again, except when the sample mean happens to be the same as the population mean). Let’s look at one more draw where the sample mean is closer to the population mean. In this case, the deviations from the sample mean look like this:
And the deviations from the population mean look like this:
The sum of deviations in the first graph is (2 - 6)² + (10 - 6)² = 32 and the sum in the second graph is (2 - 5.8)² + (10 - 5.8)² = 32.08. The difference is small now, but using the sample mean still results in a smaller sum compared to using the population mean.
In short, the source of the bias comes from using the sample mean instead of the population mean. The sample mean is always guaranteed to be in the middle of the observed data, thereby reducing the variance, and creating an underestimation.
The solution
Now that we know that the bias is caused by using the sample mean, we can figure out how to solve the problem.
Looking at the previous graphs, we see that if the sample mean is far from the population mean, the sample variance is smaller and the bias is large. If the sample mean is close to the population mean, the sample variance is larger and the bias is small. So, the more the sample mean moves around the population mean, the greater the bias.
In other words, besides the variance of the data points around the sample mean, there is also the variance of the sample mean around the population mean. We need both variances in order to accurately estimate the population variance.
The population variance is thus the sum of two variances:
Let’s confirm that this is true. For that we need to know how to calculate the variance of the sample mean around the population mean. This is relatively simple; it’s the variance of the population divided by n (). This makes sense because the greater the variance in the population, the more the mean can jump around, but the more data you sample, the closer you get to the population mean.
Now that we can calculate both the variance of the sample and the variance of the sample mean, we can check whether adding them together results in the population variance.
Below I show a graph in which I again sampled from our population with varying sample sizes. For each sample, I calculated the sample variance (the biased one) and the variance of the mean of that sample ()1. I did this 1000 times per sample size, took the average of each and put them on top of each other. I also added a dashed line to indicate the variance of the population, which is the benchmark we’re trying to reach.
Indeed, we see that the variance of the sample and the variance of the mean of the sample together form the population variance.
The math
Now that we know that the variance of the population consists of the variance of the sample and the variance of the sample mean, we can figure out the correction factor we need to apply to make the biased variance measure unbiased.
Previously, we found an interesting pattern in the simulated samples, which is also visible in the previous figure. We saw that at sample size n = 2, the (biased) sample variance appears to be half that of the (unbiased) population variance. At sample size n = 3, it’s . At sample size n = 4, it’s , and so on.
This means that we can fix the biased variance measure by multiplying it with . At n = 2, we multiply the biased variance by . For sample size n = 3, we multiply by . At sample size n = 4, it’s .
In other words, to unbias the biased variance measure, we multiply it by a correction factor of . But where does this correction factor come from?
Well, because the sample variance misses the variance of the sample mean, we can expect that the variance of the sample is biased by an amount equal to the variance of the population minus the variance of the sample mean. In other words:
Rewriting this 2, produces:
The variance of a sample will be biased by an amount equal to . To correct that bias we should multiply the sample variance by the inverse of this bias: 3. This is also called Bessel’s correction.
So, an unbiased measure of our sample variance is the biased sample variance times the correction factor:
Because the n in the denominator of the left term (the biased variance formula) cancels out the n in the numerator of the right term (the bias correction), the formula can be rewritten as:
Summary
We calculate the variance of a sample by summing the squared deviations of each data point from the sample mean and dividing it by n - 1. The n - 1 actually comes from a correction factor that is needed to correct for a bias caused by taking the deviations from the sample mean rather than the population mean. Taking the deviations from the sample mean only constitutes the variance around the sample mean, but ignores the variation of the sample mean around the population mean, producing an underestimation equal to the size of the variance of the sample mean: . The correction factor corrects for this underestimation, producing an unbiased estimate of the population variance.