ENV710 Statistics Review Website

Sampling

Sampling and the Central Limit Theorem

Learning objectives

1. Understand the why and how of simple random sampling.

2. Understand the Central Limit Theorem and its profundity in statistics.

3. Understand the Sampling Distribution of the Sample means.

4. Be able to calculate and apply the concept of standard error.

Simple Random Sampling:

The Why and the How!

There are several reasons in statistics why we sample instead of analyzing the whole of a population. Studying a whole population is often cumbersome and resource extensive (expensive!). It may not be possible to identify and measure each individual in the population, or it may be destructive to do so. Luckily, the mathematics of statistics (probability!) allows us to take a sample from a population and make inferences to a population. Simple random sampling is one of the core concepts to much of data collection and analysis. In simple random sampling, each individual or object in a population has an equal probability of being selected into the sample. Often times the first step in simple random sampling is to define the population of concern (often called a sampling frame). Individuals or objects are then selected through a random process from this population (often using a random number generator). Other methods of sampling include systematic random sampling, cluster sampling and stratified random sampling.

Sampling Error and the Sampling Distribution of the Sample Mean

Because we sample (at random), the distribution of each sample we collect will differ from the population (and each other) due to the random processes of sampling. Our samples will not exactly mimic our population of concern. The difference between a sample statistic (such as a mean, xbar) and the true population parameter (such as mu), is called the SAMPLING ERROR. We can develop a sampling distribution of the sample means to see the distribution of means of multiple samples. For example, from a population of incoming MEM students, we might sample 50 students and ask them their age. Our first sample gives us a mean of 25.1. We then repeat this 99 more times (100 different samples, each with a sample size of 50).

We can then develop a histogram of the sample means! This is what is called the ‘sampling distribution of the sample mean’. We would expect the distribution of sample means to be less dispersed than the distribution of the ages of ALL incoming MEM students. If the sample size is large enough, we would expect than the mean of the sample means would approach the true population mean. In addition, as the sample size increases, the distribution of the means will approach the normal distribution. With a large sample size, the sample means are normally distributed with a mean of μ (mu) and a standard deviation of σ/sqrt(n). The standard deviation of the sample means is called the standard error of the mean (σ/sqrt(n)). In statistical notation:

The profundity of the Central Limit Theorem: As sample size gets larger, even if you start with a non-normal distribution, the sampling distribution approaches a normal distribution.

The Central Limit Theorem!

The essence of the Central Limit Theorem: As the sample size increases, the sampling distribution of the sample mean ( xbar ) concentrates more and more around µ (the population mean). The shape of the distribution also gets closer and closer to the normal distribution as sample size n increases. An example follows. The figure below shows a histogram of our (make believe) population.

Typically the exact distribution of a population is UNKNOWN, but to demonstrate the Central Limit Theorem, we will start with this known distribution (in blue). This population has a mean (mu) of 2.25 and a standard deviation of 3.93. From this population distribution, I randomly selected a sample of 2 (n=2) and calculated an average (xbar). I then repeated this for a total of 1,000 times and made a histogram of the 1,000 sample means.

As you can see in the red histogram (sample size n=2), the dispersion of the distribution of sample means is less than the parent population (a greater concentration of values around the mean). The empirical mean of this distribution is 2.31 with a standard deviation of 2.79. I repeated this sampling process three more times with sample sizes of 5, 20 and 100 (see the histograms below). As you can see, as sample size increases, the distribution gets increasingly narrow and increasingly approaches a normal distribution. This is the essence of the Central Limit Theorem.

Calculating Z-Scores with the Sampling Distribution of the Sample Means

The formula to calculate a z-score based on a sample mean is listed above. The z-score of a sample mean equals the sample mean minus the population mean, all divided by the standard error of the sample mean. Remember, when dealing with sampling, and therefore sampling error, you need to use the standard error in you z-calculations!

So now on to an example……let’s say we have a production process that produces long-lasting light bulbs. The average lifespan of the bulbs produced is 1,500 hours with a population standard deviation of 300 hours. So…….

(1) We select 100 light bulbs at random. What is the standard deviation of the sample means?

(2) What is the probability that one bulb, selected at random, will last longer than 1,800 hours?

(3) What is the probability that the average of 100 randomly selected bulbs is greater than 1,800 hours?

Solution

(1) The standard deviation of the sample means equals the known population standard deviation divided by the square root of the sample size (n). Therefore the SD(xbar)= 300/sqrt(100) = 300/10= 30 hours.

(2) Because we are interested in just one bulb (not an average), we use this z-score formula:

Therefore, Z=(1,800 – 1,500)/300 = 300/300 = 1. And we want to know the probability that X>1,800 which translates into Z>1. We look this up in our z table and find that p(Z>1) = 1- 0.8413=0.1587. There is approximately 15.9% chance that one light bulb chosen at random will last longer than 1,800 hours.

(3) Because we are interested in the average, we use the following z-score formula:

Therefore, z=(1,800 – 1,500)/(300/sqrt(100))=300/30 = 10. And the probability that z>10 is practically zero.

Sample Problems

1. True or False: As sample size doubles, its standard error halves, holding the standard deviation constant.

2. True or False: Sample means calculated from random samples from a given population will always be normally distributed.

3. Suppose that the IQs of Duke University students can be described by a normal distribution with mean 130 and standard deviation of eight points (this is the population).

a. We select one Duke student at random. What is the probability that this student’s IQ is less than 120?

b. We select 5 Duke students at random. What is the probability that their AVERAGE IQ is less than 120?

SOLUTIONS

Return to the Statistics Review home page.

This website was developed by Elizabeth A. Albright, PhD of the Nicholas School of the Environment, Duke University.

Follow Elizabeth A.Albright, PhD on Twitter @enviro_prof. If you found these pages useful, please link or share via Facebook or Twitter. Thanks! Elizabeth A. Albright, PhD

Photo credit: Elizabeth A. Albright, PhD, A Croaker from the Neuse River at Minesott Beach, NC.