Central Limit Theorem

The Central Limit Theorem, or CLT for short, is an important finding and pillar in the fields of statistics and probability

The idea -> let’s collect x samples each of size n and let’s compute the sample mean for each sample. Then, (under some assumptions we are going to see in a minute) if we plot all the sample means, they should be following a Normal distribution. Furthermore, the mean of all the sample means should be almost equal to the real parameter of the whole population.

The sampling procedure must be randomly executed
Samples have to be independent among each other
Sample size should be no more than 10% of the population when sampling is done without replacement
The sample size should be sufficiently large (normally, a size of n=30 is considered to be sufficiently large, even though it really depends on the initial population)

If those assumptions are considered true, the CLT allows you to make inferences about your initial population. Furthermore, the larger the size of your samples, the more evident will be the bell-shape of your sample mean distribution.

To fully appreciate this theorem, let’s visualize it in Python. What I’m about to do is creating random samples of men’s weights (imagining they range between 60 and 90 kg), each of size n=50. Then, I will run this simulation multiple times and see whether the sample means distribution resembles a Normal distribution.

from numpy.random import seed
from numpy.random import randint
from numpy import mean
# seed the random number generator, so that the experiment is #replicable
seed(1)
# generate a sample of men's weights
weights = randint(60, 90, 50)
print(weights)
print('The average weight is {} kg'.format(mean(weights)))

Now lets repear the same for 1000 times

import matplotlib.pyplot as plt
# seed the random number generator, so that the experiment is replicable
seed(1)
# calculate the mean of 50 men's weights 1000 times
means = [mean(randint(60, 90, 50)) for _i in range(1000)]
# plot the distribution of sample means
plt.hist(means)
plt.show()
print('The mean of the sample means is {}'.format(mean(means)))

According to the CLT, the mean of the sample means (74.54) should be a good estimate of the real parameter (which is unknown).

To be sure of our result, let’s run a normality test on our data. For this purpose, I’m going to use the Shapiro-Wilk normality test (you can read more about this test here), where the hypotheses are:

H0: data follow a Normal distribution

H1: data do not follow a Normal distribution

So if our sample means follow a normal distribution, we are going to not reject the null.

from scipy.stats import shapiro

stat, p = shapiro(means)
print('Statistics={}, p={}'.format(stat, p))
alpha = 0.05
if p > alpha:
    print('Sample looks Normal (do not reject H0)')
else:
    print('Sample does not look Normal (reject H0)')

Since the p-value is far greater than our confidence level alpha (specifically, it is greater of any significant level of alpha), we do not reject H0.

Now let’s see what happens if we increase the sample size from 50 to, respectively, 80, 90 and 100:





As you can see, the higher the sample size n, the higher the p-value, 
the higher the confidence with which we do not reject the null 
hypothesis of normality

Central Limit Theorem - Intuition

Post a Comment

Contact form