TODO Redo this post. You’ve made some mistakes in it and you have a much better understanding of the concepts now.

  • : Population mean
  • : Population standard deviation
  • : Sample size
  • : A single data point
  • : Sample mean
  • : Sample standard deviation
  • : Z-score
  • : Standard error

Let’s talk a little bit about how you can figure out what is the chance that something you are seeing is just due to chance.

Central Limit Theorem

Very briefly, the central limit theorem states that if you keep on resampling:

  • The distribution of the means of these sample will be normally distributed (regardless of the distribution of the original data)
  • The mean of this distribution will be the same as the population mean of the original data
  • The standard deviation of this distribution will depend on the standard deviation of the original data (population std) and the size of the sample that you keep on resampling. This standard deviation is called the standard error.

The standard error (SE) is the standard deviation of the distribution of the means of repeated samples.

By the way, the distribution of means of repeated samples is sometimes called the sample mean distribution or even just the sample mean sometimes.

Anyhow, the central limit theorem (especially bullet 3) basically tells you that if your sample size is large enough, the distribution of the means of repeated samples will be very tight around the population mean.

P-Value

P-value is the probability that you see something just due to chance. More formally, it is the probability of observing a test statistic as extreme as the one you have observed, given that the null hypothesis is true.

The null hypothesis is the hypothesis that there is no effect or no difference. The alternative hypothesis is the hypothesis that there is an effect or a difference.

The p-value is a measure of how much evidence you have for/against the null hypothesis.

I like to think of p-value as the probability of seeing what you see just due to chance.

Z-Test (single data)

Let’s take a concrete example.

Let’s say you have some random variable (with some mean/std) and someone gives you a single data point. You want to know if this data point is “significantly” different from the mean of the random variable.

First, you calculate z-score for your data point:

Then, you can calculate the p-value using the z-score:

where is the cumulative distribution function of the standard normal distribution.

is basically the area under the curve of the standard normal distribution to the left of a particular z-score. So is the area under the curve to the right of the z-score.

This covers all the values that are more extreme than the z-score but only the positive ones, if we want the extreme negative ones as well, we multiply it by 2.

This just depends on if you only wanna know what is the probability of seeing a positive/negative value as extreme as the one you observed or either a positive or a negative value (basically any extreme). So in yet other words, if you wanna know “any extreme” then you multiply by 2.

Z-Test (sample mean)

Now, let’s say you have a sample of data points and you want to know if the sample mean is “significantly” different from the population mean.

In this case, the z-score formula is:

where

Then, you can calculate the p-value using the z-score, as before.

T-Test

The t-test is used when you don’t know the population standard deviation and so you have to estimate it using the sample standard deviation.

The formula for the t-score is:

where is the sample standard deviation.

Then, you can calculate the p-value using the t-score. However, the p-value calculation is a little bit different for the t-test. You have to use the t-distribution instead of the normal distribution:

where is the cumulative distribution function of the t-distribution with degrees of freedom.

It is also worth writing down the formula for the sample standard deviation:

This looks a lot like the formula for the population standard deviation but with in the denominator instead of . There is a well established fact that the sample standard deviation underestimates the population standard deviation, so we account for this bias by making the sample standard deviation a little bit bigger (by dividing it by a little bit smaller number).

Confidence Interval

The confidence interval is a range of values that you can be confident (to a certain degree) contains the population mean.

This is when you are trying to estimate the population mean from a sample mean. In other words, you are trying to make a generalization about the population based on a sample.

The formula for the confidence interval is:

where is the z-score that corresponds to the desired confidence level.

How do you find a mapping of confidence level to z-score? You can use a z-table or a z-score calculator.

For example, if you want a 95% confidence interval, you would use a z-score of 1.96 (according to the table).

The SE used is often:

Notice this is slightly different from the SE we used previously. This is because we substitute the sample standard deviation for the population standard deviation (because we often don’t know the population standard deviation).

This is a general rule by the way, if you don’t know the population std, you can use the sample std in a lot of tests/situations, however you have to use the updated std formula (with in the denominator). And remember, you do this because the sample std underestimates the population std.