**Abstract:**

I found it very difficult to comprehend the concept of sampling, sampling distribution of mean and the confidence interval. These concepts plays a very important role in inferential statistics, which is a integral part of six sigma tools. This is an attempt to simplify the concept of confidence interval or simply CI.

In our practical life we need to take the decisions about the population based on the analysis of a sample drawn from that sample. Example, a batch of one million tablets is to be qualified by QA based on a sample of say, 50 tablets. The confidence interval from the samples, enables us to have an interval estimate that may contain the population parameter with some degree of error α.

If we assume that the population average is a golden fish that we want to catch from a pond using 100 types of nets of different sizes (equivalent to CI). Then 95 of those nets (out of 100 nets) would capture the golden fish if we consider an error of 5%.

**Introduction to confidence interval**

Let’s assume that we are in the fishing business and we have our own farm where we are raising fish right from the egg stage to the mature fish.

Traditionally, we will incubate the eggs for 10-15 days to get the larvae and then that would be transferred to the juvenile tank where, they would be fed and monitored for another 3 months. Once they are 3 months old we would again transfer all fish from the juvenile tank to a larger tank where they would be on different diet for another 3 months. After this all fish would be sold to the wholesaler.

Now problem is that we would like to sell only those fish which are having weight around 900-950 Gm. In order to maximize the profit per fish. Any fish less than that would be a loss to us as, I would we giving away more number of fish for a given order from the wholesaler. You can say that let’s weigh each and every fish and sell only those between 925-950 Gm. Yes, that can be a solution but, imagine the efforts required to take out every fish, weighing them and making arrangement to keep them in some other tank. This is very difficult as, at a given time we are having 1000-1500 fishes in the tank. Is there a solution by which we can estimate the average weight of all the fishes in the tank?

**Yes, Statistics does that. But how?**

Statistically it can be done by drawing a random sample and calculating the average of the sample and then trying to estimate the average of the population. But, we need to understand a very important point here, there are chances that the sample thus collected might not be representing the entire population of fish (this is called as sampling) hence, there would be some margin of error in calculating the population average.

Therefore, the average weight of all fish in the pond can be expressed as

= average weight of the sample ± margin of error

Or simply

Average population parameter = sample’s average ± margin of error

**Sample Average**: This average will vary from sample to sample but, as the sample size increases and then the average of the sample will be closer to the population average.

**Margin of Error:** This margin of error is calculated statistically. Right now it is sufficient to know that the margin of error is directly proportional to the standard deviation of the sample and inversely proportional to the sample size. It means that if we want to have a narrow interval for the estimation of the population average then, we need to take a larger sample with small variation.

*This term is generally known as standard error*

There is also a statistical constant involved in the calculation of margin of error, this is called as critical t-value. This critical t-value depends on the presumed error α and the degree of freedom.

**What is this error α?**

This error or risk and is denoted by α. Let’s assume, in above fish example, we draw 100 samples of 5 fish each and calculate all 100 interval using equation-1. We are assuming that all these interval would contain the population average, but there are chances that some intervals might not contain the population average. This because we are working with samples and sampling error bound to happen. So we assume an error or a risk α that, out of 100 interval (calculated from 100 sample) there will be α number of intervals that would not contain the population average. This α is denoted in % or in the probability terms. For example α = 0.05 mean that there is 5% chances or 0.05 probability that 5 out of 100 intervals thus calculated might not contain the population average.

This α, is decided prior to starting any experimentation using samples. It’s purely a business decision based on the risk appetite of the company. We can work with α = 0.05 or 0.1 or 0.15. Generally we work with 0.05.

**Once we have defined α, we can now discuss t-critical. **

The t-critical is the threshold value (like z-value, discussed earlier) on the t-distribution beyond which process is no longer the same i.e. if the observations are falling in the region < t_{critical} (in below figure), then we would say that the samples are coming from the same parent population otherwise it is coming from the different sample. This t-value is characterized by two parameter α and degree of freedom (df = number of observations-1 or simply n-1) and its value can be obtained from the t-distribution table.

**When to take ****a or** **α****/2?**

As we have considered the total acceptable error of 5%, now based on the scenario there are chances that the interval calculated might miss the population parameter on either side of the interval. Hence the total error is distributed at both end of the interval equally. For example, say the interval calculated is 915 to 935 Gm. Now, the chances are there that the actual population mean might be less than 915 or more than 935. So if the total error we started with is 5% then, 2.5% is distributed at both the end. This is a case of two tailed test and error on single side is represented by α/2. If this has been a one tail test, then there is no need to do this and the error is represented by α.

Now, we are ready estimate the population parameter

Average population parameter = sample’s average ± margin of error

Now we can see that, if we want to have a narrower interval then we need to decrease the term “margin of error” and for that, we have to increase the sample size or decrease the σ. Since controlling σ is not in our hand (it is the characteristics of the random sample), we can increase the sample size in order to reach closer to the population parameter.

Interval calculated above is called as CONFIDENCE INTERVAL or simply CI.

Concept of CI obtained from a group of samples is illustrated below

**Example:**

We have no idea about the average weight of all fishes in the pond and we also don’t have any idea about the standard deviation of the weights of all fishes in the tank. In that case we have to estimate the average weight of all fishes in the pond based on the sample’s average and its standard deviation.

Let’s take out the first sample of five fish from the pond and calculate its CI.

Methodology to be applied

Average weight of five fish from first sample = 928 Gm

Standard deviation of five fish from first sample = 27.97

Now calculate t_{α/2,df}

As α/2 = 0.025 and df = 5-1 = 4

Therefore from t-distribution table

Margin of error

Confidence Interval of the first sample

**Inference from the above CI**

From the first sample, we got a CI of 892 to 964, it means that the average weight of all fishes in the pond is between 892 to 964 Gm. But, still we can’t pin point the exact average of all fishes in the pond!

Further, if we draw 99 more samples of 5 fishes each and calculate the corresponding CI then we will find that 95 CI out of 100 CI would contain the population average. Only 5 CI would not contain the population average. But, still we can’t pin point the exact average of all fishes in the pond!

Let’s draw some more samples and calculate their CI

**How we can use this concept in production:**

Suppose we made a lot of a product (be a million tablets, bulbs etc.) and QA need to qualify that batch. What he does is to take a random sample and calculates its CI. If this CI contains the population mean (specification), he would pass the lot. Have a look at the following blogs for application part.

Related Blog for the utility of CI

*How to provide a realistic range for a CQAs during product development to avoid unwanted OOS-1.*

But always remember this

If we assume that the population average is a golden fish that we want to catch from a pond using 100 types of nets of different sizes (equivalent to CI). Then 95 of those nets out of 100 nets) would capture the golden fish if we consider an error of 5%.

###### Common Misconception about CI

The biggest mistake we make while interpreting the confidence intervals is that we think CI represents the percentage of the data from a given sample that falls between two limits. For example, in above example, the first CI was found to be 893-963 Gms. People would make a mistake of assuming that there is 95% chance that the mean of all fishes would fall within this range. This is incorrect!

Following books gives an excellent presentation of confidence interval, sampling distribution of mean through cartoons