Is it Difficult for you to Comprehend the Concept of Confidence Interval? Try this out

    Amrendra Roy

    for posts


    Abstract:

    I found it very difficult to comprehend the concept of sampling, sampling distribution of mean and the confidence interval. These concepts plays a very important role in inferential statistics, which is a integral part of six sigma tools. This is an attempt to simplify the concept of confidence interval or simply CI.

    In our practical life we need to take the decisions about the population based on the analysis of a sample drawn from that sample. Example, a batch of one million tablets is to be qualified by QA based on a sample of say, 50 tablets. The confidence interval from the samples, enables us to have an interval estimate that may contain the population parameter with some degree of error α.

    If we assume that the population average is a golden fish that we want to catch from a pond using 100 types of nets of different sizes (equivalent to CI). Then 95 of those nets (out of 100 nets) would capture the golden fish if we consider an error of 5%.


    Introduction to confidence interval

    Let’s assume that we are in the fishing business and we have our own farm where we are raising fish right from the egg stage to the mature fish.

    picture1

    Traditionally, we will incubate the eggs for 10-15 days to get the larvae and then that would be transferred to the juvenile tank where, they would be fed and monitored for another 3 months. Once they are 3 months old we would again transfer all fish from the juvenile tank to a larger tank where they would be on different diet for another 3 months. After this all fish would be sold to the wholesaler.

    Now problem is that we would like to sell only those fish which are having weight around 900-950 Gm. In order to maximize the profit per fish. Any fish less than that would be a loss to us as, I would we giving away more number of fish for a given order from the wholesaler. You can say that let’s weigh each and every fish and sell only those between 925-950 Gm. Yes, that can be a solution but, imagine the efforts required to take out every fish, weighing them and making arrangement to keep them in some other tank. This is very difficult as, at a given time we are having 1000-1500 fishes in the tank. Is there a solution by which we can estimate the average weight of all the fishes in the tank?

    Yes, Statistics does that. But how?

    Statistically it can be done by drawing a random sample and calculating the average of the sample and then trying to estimate the average of the population. But, we need to understand a very important point here, there are chances that the sample thus collected might not be representing the entire population of fish (this is called as sampling) hence, there would be some margin of error in calculating the population average.

    Therefore, the average weight of all fish in the pond can be expressed as

    = average weight of the sample ± margin of error

    Or simply

    Average population parameter = sample’s average ± margin of error

    Sample Average: This average will vary from sample to sample but, as the sample size increases and then the average of the sample will be closer to the population average.

    Margin of Error: This margin of error is calculated statistically. Right now it is sufficient to know that the margin of error is directly proportional to the standard deviation of the sample and inversely proportional to the sample size. It means that if we want to have a narrow interval for the estimation of the population average then, we need to take a larger sample with small variation.

    picture10a

    This term is generally known as standard error

    There is also a statistical constant involved in the calculation of margin of error, this is called as critical t-value. This critical t-value depends on the presumed error α and the degree of freedom.

    What is this error α?

    This error or risk and is denoted by α. Let’s assume, in above fish example, we draw 100 samples of 5 fish each and calculate all 100 interval using equation-1. We are assuming that all these interval would contain the population average, but there are chances that some intervals might not contain the population average. This because we are working with samples and sampling error bound to happen. So we assume an error or a risk α that, out of 100 interval (calculated from 100 sample) there will be α number of intervals that would not contain the population average. This α is denoted in % or in the probability terms. For example α = 0.05 mean that there is 5% chances or 0.05 probability that 5 out of 100 intervals thus calculated might not contain the population average.

    picture3

    This α, is decided prior to starting any experimentation using samples. It’s purely a business decision based on the risk appetite of the company. We can work with α = 0.05 or 0.1 or 0.15. Generally we work with 0.05.

    Once we have defined α, we can now discuss t-critical.

    The t-critical is the threshold value (like z-value, discussed earlier) on the t-distribution beyond which process is no longer the same i.e. if the observations are falling in the region < tcritical (in below figure), then we would say that the samples are coming from the same parent population otherwise it is coming from the different sample. This t-value is characterized by two parameter α and degree of freedom (df = number of observations-1 or simply n-1) and its value can be obtained from the t-distribution table.

    picture5

    When to take a or α/2?

    As we have considered the total acceptable error of 5%, now based on the scenario there are chances that the interval calculated might miss the population parameter on either side of the interval. Hence the total error is distributed at both end of the interval equally. For example, say the interval calculated is 915 to 935 Gm. Now, the chances are there that the actual population mean might be less than 915 or more than 935. So if the total error we started with is 5% then, 2.5% is distributed at both the end. This is a case of two tailed test and error on single side is represented by α/2. If this has been a one tail test, then there is no need to do this and the error is represented by α.

    picture6

    Now, we are ready estimate the population parameter

    Average population parameter = sample’s average ± margin of error

    Now we can see that, if we want to have a narrower interval then we need to decrease the term “margin of error” and for that, we have to increase the sample size or decrease the σ. Since controlling σ is not in our hand (it is the characteristics of the random sample), we can increase the sample size in order to reach closer to the population parameter.

    Interval calculated above is called as CONFIDENCE INTERVAL or simply CI.

    Concept of CI obtained from a group of samples is illustrated below

    picture8

    Example:

    We have no idea about the average weight of all fishes in the pond and we also don’t have any idea about the standard deviation of the weights of all fishes in the tank. In that case we have to estimate the average weight of all fishes in the pond based on the sample’s average and its standard deviation.

    Let’s take out the first sample of five fish from the pond and calculate its CI.

    picture7

    Methodology to be applied

    picture2

    Average weight of five fish from first sample = 928 Gm

    Standard deviation of five fish from first sample = 27.97

    Now calculate tα/2,df

    As α/2 = 0.025 and df = 5-1 = 4

    Therefore from t-distribution table

     

    picture4a1

    Margin of error

    picture12a

    Confidence Interval of the first sample

    picture13

    Inference from the above CI

    From the first sample, we got a CI of 892 to 964, it means that the average weight of all fishes in the pond is between 892 to 964 Gm. But, still we can’t pin point the exact average of all fishes in the pond!

    Further, if we draw 99 more samples of 5 fishes each and calculate the corresponding CI then we will find that 95 CI out of 100 CI would contain the population average. Only 5 CI would not contain the population average. But, still we can’t pin point the exact average of all fishes in the pond!

    Let’s draw some more samples and calculate their CI

    picture9a

    How we can use this concept in production:

    Suppose we made a lot of a product (be a million tablets, bulbs etc.) and QA need to qualify that batch. What he does is to take a random sample and calculates its CI. If this CI contains the population mean (specification), he would pass the lot. Have a look at the following blogs for application part.

    Related Blog for the utility of CI

    How to provide a realistic range for a CQAs during product development to avoid unwanted OOS-1.

    How to provide a realistic range for a CQAs during product development to avoid Unwanted OOS-2 Case Study

    But always remember this

    If we assume that the population average is a golden fish that we want to catch from a pond using 100 types of nets of different sizes (equivalent to CI). Then 95 of those nets out of 100 nets) would capture the golden fish if we consider an error of 5%.

    Common Misconception about CI

    The biggest mistake we make while interpreting the confidence intervals is that we think CI represents the percentage of the data from a given sample that falls between two limits. For example, in above example, the first CI was found to be 893-963 Gms. People would make a mistake of assuming that there is 95% chance that the mean of all fishes would fall within this range. This is incorrect!

    Following books gives an excellent presentation of confidence interval, sampling distribution of mean through cartoons

     

    (Visited 1,413 times, 1 visits today)
    You can share this Post By:Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedIn

    Comments

    Leave a Reply

    Your email address will not be published. Required fields are marked *