Improve this answer. So, while you rightly focus on bias versus unbiased, the thread marked as duplicate is more precise. They give great, well articulated answers. We can agree Fashburn should go through and read the other threads for a more complete answer. I simply wanted to provide Flashburn with a really high level idea to chew on.
As he said he is very new to stats. Thanks for the comment. Featured on Meta. Now live: A fully responsive profile. Linked Related 6. Hot Network Questions. This is where the bias comes in. In fact, the mean of a sample minimizes the sum of squared deviations from the mean. This means that the sum of deviations from the sample mean is always smaller than the sum of deviations from the population mean. The only exception to that is when the sample mean happens to be the population mean.
Below are two graphs. In each graph I show 10 data points that represent our population. I also highlight two data points from this population, which represents our sample. In the left graph I show the deviations from the sample mean and in the right graph the deviations from the population mean. We see that in the left graph the sum of squared deviations is much smaller than in the right graph.
The sum is smaller when using the sample mean compared to using the population mean. This is true for any sample you draw from the population again, except when the sample mean happens to be the same as the population mean. The difference is small now, but using the sample mean still results in a smaller sum compared to using the population mean. In short, the source of the bias comes from using the sample mean instead of the population mean. The sample mean is always guaranteed to be in the middle of the observed data, thereby reducing the variance, and creating an underestimation.
Now that we know that the bias is caused by using the sample mean, we can figure out how to solve the problem. Looking at the previous graphs, we see that if the sample mean is far from the population mean, the sample variance is smaller and the bias is large. If the sample mean is close to the population mean, the sample variance is larger and the bias is small. So, the more the sample mean moves around the population mean, the greater the bias.
In other words, besides the variance of the data points around the sample mean, there is also the variance of the sample mean around the population mean. We need both variances in order to accurately estimate the population variance. For that we need to know how to calculate the variance of the sample mean around the population mean. This makes sense because the greater the variance in the population, the more the mean can jump around, but the more data you sample, the closer you get to the population mean.
Now that we can calculate both the variance of the sample and the variance of the sample mean, we can check whether adding them together results in the population variance. Below I show a graph in which I again sampled from our population with varying sample sizes. So the first is the idea of the mean, of the mean. So if we're trying to calculate the mean for the population, is that going to be a parameter or a statistic?
Well, when we're trying to calculate it on the population, we are calculating a parameter. We are calculating a parameter. So let me write this down. So this is going to be-- so for the population we are calculating a parameter.
It is a parameter. And when we calculate, when we attempt to calculate something for a sample we would call that a statistic-- statistic. So how do we think about the mean for a population? Well, first of all, we denote it with the Greek letter mu. And we essentially take every data point in our population. So we take the sum of every data point. So we start at the first data point and we go all the way to the capital Nth data point. So every data point we add up. So this is the i-th data point, so x sub 1 plus x sub 2 all the way to x sub capital N.
And then we divide by the total number of data points we have. Well, how do we calculate the sample mean? Well, the sample mean-- we do a very similar thing with the sample. And we denote it with a x with a bar over it. And that's going to be taking every data point in the sample, so going up to a lower case n, adding them up --so these are the sum of all the data points in our sample-- and then dividing by the number of data points that we actually had.
Now, the other thing that we're trying to calculate for the population, which was a parameter, and then we'll also try to calculate it for the sample and estimate it for the population, was the variance, which was a measure of how dispersed or how much of the data points vary from the mean.
So let's write variance right over here. And how do we denote any calculate variance for a population? Well, for population, we'd say that the variance --we use a Greek letter sigma squared-- is equal to-- and you can view it as the mean of the squared distances from the population mean. But what we do is we take, for each data point, so i equal 1 all the way to n, we take that data point, subtract from it the population mean.
So if you want to calculate this, you'd want to figure this out. Well, that's one way to do it. We'll see there's other ways to do it, where you can calculate them at the same time. But the easiest or the most intuitive is to calculate this first, then for each of the data points take the data point and subtract it from that, subtract the mean from that, square it, and then divide by the total number of data points you have.
Now, we get to the interesting part-- sample variance. There's are several ways-- where when people talk about sample variance, there's several tools in their toolkits or there's several ways to calculate it.
One way is the biased sample variance, the non unbiased estimator of the population variance. And that's denoted, usually denoted, by s with a subscript n. And what is the biased estimator, how we calculate it? Well, we would calculate it very similar to how we calculated the variance right over here. But what we would do it for our sample, not our population.
So for every data point in our sample --so we have n of them-- we take that data point. And from it, we subtract our sample mean. We subtract our sample mean, square it, and then divide by the number of data points that we have.
0コメント