Fishing and Hypothesis Testing!

for posts

Let’s assume that we are in the business of fish farming. We want to make sure that the mean weight of all the fishes in the pond must be 2 Kg before we sell them into the market. It is not possible to take out all of them (we don’t know how many fishes are there!) from the pond and measure their individual weight. So what we do is to take out a sample of say, 25 fishes and measure their weight. If the mean weight of the 25 fish is near to 2 Kg (more precisely, if it is between 2 ± 0.15 Kg), we assume that lot is ready to be sold in the market. This is hypothesis testing. We are trying to estimate the population parameter (mean weight of all fishes in the pond) based on the mean weight of the sample of 25 fishes with an acceptance criterion in our mind (between 2 ± 0.15 Kg).

In Inferential statistics we try to estimate the population parameter by studying a sample drawn from that population. It is not always possible to study the whole population (called as census). Take an example of rice being cooked in a restaurant. The entire lot of rice taken for the cooking may be considered as the population. Let’s further assume that there is a set protocol for cooking and after a predetermined cooking time, chef hypothesizes that the whole lot might have been cooked properly. In order to check his hypothesis, he takes out few grains of the cooked rice (sample) and then he test his hypothesis by subjecting the sample to some test (pressing them between the fingers). Finally based on the sample’s result, chef takes the decision whether the whole lot of rice is cooked or not.

Picture1

We have executed following steps in order to make an estimate about the degree of cooking of the entire lot of rice based on the small sample drawn from the pot.

  • We select the population to be studied (the entire lot of rice under cooking)
  • A cooking protocol is followed by the chef and he makes a hypothesis that the whole lot of rice (population) might have been cooked properly → trying to make an estimate about the population parameter
  • Then we draw a sample from the pot → sample
  • We have a criterion in our mind to say that the rice is overcooked or undercooked or properly cooked as per expectation → we set a threshold limit (confidence interval CI) within which we assume that rice is properly cooked. Less than that, it is undercooked and more than that, it is overcooked.
  • We test the sample of the cooked rice by pressing them between our fingers → Test statistics
  • The results of the test statistics is compared with the threshold limit set prior to conducting the experiments and based on this comparison, decision is taken whether the whole lot of rice is cooked properly or not → inference about the population parameter.

It must be somewhat clear from the above discussion that, we use hypothesis testing to challenge whether some claim about a population is true or not utilizing the sample information, for example,

  • The mean height of all the students in the high school in a given state is 160 Cm.
  • The mean salary of the fresh MBA graduates is $65000
  • The mean mileage of a particular brand of car is 15 Km/liter of the gasoline.

All of the above statement is some kind of population parameter that we hypothesized to be true. To test these hypothesis, we take a sample (say height of 100 students selected at random from the high school or salary data of 25 students selected at random from the MBA class) and subject it to some statistical tests called as test statistics (equivalent to pressing the rice between the fingers) to conclude the statement made about the population parameter is true or not.

But, before we go any further, it is important to understand that we are using a sample for estimating the population parameter and the size of the sample is too less (relative to the size of population). As a result, estimating population parameter based on the sample statistics would involve some uncertainty or the error. This is represented by following equation

Population Parameter = Sample statistics ± margin of error

The above equation gives an interval (because of ± sign) between which a population parameter is expected to be found. This interval is called as confidence interval (CI). This means every sample drawn from the population would give different Confidence Interval!!

Suppose we are trying to estimate the population mean (which is usually unknown) and we draw 100 samples from the population, all those 100 samples would give 100 different CI because all samples would have different sample mean and different “margin of error”. Now question arises “does all 100 CI thus obtained would contain the population parameter?” As stated earlier, because of the sampling error, we can never accurately estimate the population parameter, in other words, we understand that there will be some degree of error in estimating the population parameter based on the sample statistics. Hence, we should be wise enough to accept an inherent error rate prior to conducting any hypothesis testing. Let’s say that if I collect 100 samples from the population and obtained 100 different CI, then there are chances that 5 CI thus obtained might not contain the population parameter. This is called as error α or type-I error. This α represents the acceptable error or the level of significance and it has to be determined prior to conducting any hypothesis testing. Usually it is a management decision. For more detail see “Is it difficult for you to comprehend confidence interval?

Based on the above discussion, a 7-Step Process for the Hypothesis Testing is used (note: step-2 is described before step-1, this is done because it helps us in writing the hypothesis correctly)

Step 2: State the Alternate Hypothesis.

This is denoted by Ha and this is the real thing about the population that we want to test. In other words, Ha denotes what we want to prove.

For example:

  • The mean height of all students in the high school is 160 Cm.
    • Ha: μ ≠ 160 Cm
  • The mean salary of the fresh MBA graduates is $65000
    • Ha: μ ≠ $65000
  • The mean mileage of a particular brand of car is greater than 15 Km/liter of the gasoline.
    • Ha: μ > 15 Km/liter

Step 1: State the Null Hypothesis.

This is denoted by Ho. We state the null hypothesis as if we are extremely lazy persons and we don’t want to do any work! For example if new gasoline is claiming to have an average mileage of greater than 15Km/liter then my null hypothesis would be “it is less than or equal to 15 Km/liter” hence, by doing so, we would not take any pain in testing the new gasoline. We are happy with status quo!

So the null hypothesis in all of the above cases are

  • The mean height of all students in the high school is 160 Cm.
    • Ho: μ = 160 Cm
  • The mean salary of the fresh MBA graduates is $65000
    • Ho: μ = $65000
  • The mean mileage of a particular brand of car is greater than 15 Km/liter of the gasoline.
    • Ha: μ ≤ 15 Km/liter

Therefore, if you want me to work, first you make an effort to reject the null hypothesis!

Step 3: Set α

But, before we go any further, it is important to understand that we are using a sample for estimating the population parameter and the size of the sample is very less than then the size of the population. And because of this sampling error, estimating population parameter would contain some uncertainty or error. There is two types of error that can occur that we can make in hypothesis testing.

Following is the contingency table for the null hypothesis. We can make two errors, first rejecting the null hypothesis when it is true (α error) accepting the null hypothesis when it is false (β error). Hence, the acceptance limit for both the error is decided prior to hypothesis testing.

Picture5

Using z-transformation or t-test, we determine the critical value (threshold) corresponding to error α.

The level of significance α is the probability of rejecting the null hypothesis when it is true. This is like rejecting a good lot of material by mistake.

Picture6

Whereas the β is the called as type-II error and it is the probability of accepting the null hypothesis when it is false. This is like accepting a bad lot of material by mistake.

Let’s understand null and alternate hypothesis graphically

Following are the distribution of Ho and Ha with means μa & μb respectively and both having a variance of σ2.

Picture4

We also have an error term α, representing a threshold value on the distribution of Ho beyond which, we would fail to accept the Ho (in other words, we are would reject the Ho).

Now the issue that is to be resolved is “how we can say that the two distributions represented by Ho and Ha are same or not”

It is usually done by measuring the extent of overlap between the two distributions. This we do by measuring the distance between the mean of the two distributions (of course we need to consider the inherent variance in the system). There are statistical tools like z-test, t-tests, ANOVA etc. which helps us in concluding, whether the two distributions are significantly overlapping or not.

Step 4: Collect the Data

Step 5: Calculate a test statistic.

The test statistic is a numerical measure that is computed from the sample data which, is then compared with the critical value to determine whether or not the null hypothesis should be accepted. Another way of doing is to convert the test statistics to a probability value called as p-value, which is then compared with α, to conclude whether the hypothesis that was made about the population is to be accepted or rejected.

Also See “p-value, what the hell is it?”

Conceptualizing “Distribution” Will Help You in Understanding Your Problem in a Much Better Way

Is it difficult for you to comprehend confidence interval?

Step 6: Construct Acceptance / Rejection regions.

 

Picture2

The critical value is used as a benchmark or used as a threshold limit to determine whether the test statistic is too extreme to be consistent with the null hypothesis.

Picture8

Step 7: Based on steps 5 and 6, draw a conclusion about H0.

The decision, whether to accept or reject the null hypothesis is based on following criterion:

  • If the absolute value of the test statistic exceeds the absolute value of the critical value in, the null hypothesis is rejected.
  • Otherwise, the null hypothesis fails to be rejected (or simply Ho is accepted)
  • Simplest way is to compare α and the p-value. If p-value is < α, reject the Ho.

 

Summary:

The null and alternative hypotheses are competing statements made about the population based on the sample. Either the null hypothesis (H0) is true or the alternative hypothesis (Ha) is true, but not both. Ideally the hypothesis testing procedure should lead to the acceptance of H0 when H0 is true and the rejection of H0 when Ha is true. Unfortunately, the correct conclusions are not always possible because hypothesis tests are based on sample information therefore, we must allow or we must have a provision for the possibility of type-I and type-II errors.

 

Is it Difficult for you to Comprehend the Concept of Confidence Interval? Try this out

for posts


Abstract:

I found it very difficult to comprehend the concept of sampling, sampling distribution of mean and the confidence interval. These concepts plays a very important role in inferential statistics, which is a integral part of six sigma tools. This is an attempt to simplify the concept of confidence interval or simply CI.

In our practical life we need to take the decisions about the population based on the analysis of a sample drawn from that sample. Example, a batch of one million tablets is to be qualified by QA based on a sample of say, 50 tablets. The confidence interval from the samples, enables us to have an interval estimate that may contain the population parameter with some degree of error α.

If we assume that the population average is a golden fish that we want to catch from a pond using 100 types of nets of different sizes (equivalent to CI). Then 95 of those nets (out of 100 nets) would capture the golden fish if we consider an error of 5%.


Introduction to confidence interval

Let’s assume that we are in the fishing business and we have our own farm where we are raising fish right from the egg stage to the mature fish.

picture1

Traditionally, we will incubate the eggs for 10-15 days to get the larvae and then that would be transferred to the juvenile tank where, they would be fed and monitored for another 3 months. Once they are 3 months old we would again transfer all fish from the juvenile tank to a larger tank where they would be on different diet for another 3 months. After this all fish would be sold to the wholesaler.

Now problem is that we would like to sell only those fish which are having weight around 900-950 Gm. In order to maximize the profit per fish. Any fish less than that would be a loss to us as, I would we giving away more number of fish for a given order from the wholesaler. You can say that let’s weigh each and every fish and sell only those between 925-950 Gm. Yes, that can be a solution but, imagine the efforts required to take out every fish, weighing them and making arrangement to keep them in some other tank. This is very difficult as, at a given time we are having 1000-1500 fishes in the tank. Is there a solution by which we can estimate the average weight of all the fishes in the tank?

Yes, Statistics does that. But how?

Statistically it can be done by drawing a random sample and calculating the average of the sample and then trying to estimate the average of the population. But, we need to understand a very important point here, there are chances that the sample thus collected might not be representing the entire population of fish (this is called as sampling) hence, there would be some margin of error in calculating the population average.

Therefore, the average weight of all fish in the pond can be expressed as

= average weight of the sample ± margin of error

Or simply

Average population parameter = sample’s average ± margin of error

Sample Average: This average will vary from sample to sample but, as the sample size increases and then the average of the sample will be closer to the population average.

Margin of Error: This margin of error is calculated statistically. Right now it is sufficient to know that the margin of error is directly proportional to the standard deviation of the sample and inversely proportional to the sample size. It means that if we want to have a narrow interval for the estimation of the population average then, we need to take a larger sample with small variation.

picture10a

This term is generally known as standard error

There is also a statistical constant involved in the calculation of margin of error, this is called as critical t-value. This critical t-value depends on the presumed error α and the degree of freedom.

What is this error α?

This error or risk and is denoted by α. Let’s assume, in above fish example, we draw 100 samples of 5 fish each and calculate all 100 interval using equation-1. We are assuming that all these interval would contain the population average, but there are chances that some intervals might not contain the population average. This because we are working with samples and sampling error bound to happen. So we assume an error or a risk α that, out of 100 interval (calculated from 100 sample) there will be α number of intervals that would not contain the population average. This α is denoted in % or in the probability terms. For example α = 0.05 mean that there is 5% chances or 0.05 probability that 5 out of 100 intervals thus calculated might not contain the population average.

picture3

This α, is decided prior to starting any experimentation using samples. It’s purely a business decision based on the risk appetite of the company. We can work with α = 0.05 or 0.1 or 0.15. Generally we work with 0.05.

Once we have defined α, we can now discuss t-critical.

The t-critical is the threshold value (like z-value, discussed earlier) on the t-distribution beyond which process is no longer the same i.e. if the observations are falling in the region < tcritical (in below figure), then we would say that the samples are coming from the same parent population otherwise it is coming from the different sample. This t-value is characterized by two parameter α and degree of freedom (df = number of observations-1 or simply n-1) and its value can be obtained from the t-distribution table.

picture5

When to take a or α/2?

As we have considered the total acceptable error of 5%, now based on the scenario there are chances that the interval calculated might miss the population parameter on either side of the interval. Hence the total error is distributed at both end of the interval equally. For example, say the interval calculated is 915 to 935 Gm. Now, the chances are there that the actual population mean might be less than 915 or more than 935. So if the total error we started with is 5% then, 2.5% is distributed at both the end. This is a case of two tailed test and error on single side is represented by α/2. If this has been a one tail test, then there is no need to do this and the error is represented by α.

picture6

Now, we are ready estimate the population parameter

Average population parameter = sample’s average ± margin of error

Now we can see that, if we want to have a narrower interval then we need to decrease the term “margin of error” and for that, we have to increase the sample size or decrease the σ. Since controlling σ is not in our hand (it is the characteristics of the random sample), we can increase the sample size in order to reach closer to the population parameter.

Interval calculated above is called as CONFIDENCE INTERVAL or simply CI.

Concept of CI obtained from a group of samples is illustrated below

picture8

Example:

We have no idea about the average weight of all fishes in the pond and we also don’t have any idea about the standard deviation of the weights of all fishes in the tank. In that case we have to estimate the average weight of all fishes in the pond based on the sample’s average and its standard deviation.

Let’s take out the first sample of five fish from the pond and calculate its CI.

picture7

Methodology to be applied

picture2

Average weight of five fish from first sample = 928 Gm

Standard deviation of five fish from first sample = 27.97

Now calculate tα/2,df

As α/2 = 0.025 and df = 5-1 = 4

Therefore from t-distribution table

 

picture4a1

Margin of error

picture12a

Confidence Interval of the first sample

picture13

Inference from the above CI

From the first sample, we got a CI of 892 to 964, it means that the average weight of all fishes in the pond is between 892 to 964 Gm. But, still we can’t pin point the exact average of all fishes in the pond!

Further, if we draw 99 more samples of 5 fishes each and calculate the corresponding CI then we will find that 95 CI out of 100 CI would contain the population average. Only 5 CI would not contain the population average. But, still we can’t pin point the exact average of all fishes in the pond!

Let’s draw some more samples and calculate their CI

picture9a

How we can use this concept in production:

Suppose we made a lot of a product (be a million tablets, bulbs etc.) and QA need to qualify that batch. What he does is to take a random sample and calculates its CI. If this CI contains the population mean (specification), he would pass the lot. Have a look at the following blogs for application part.

Related Blog for the utility of CI

How to provide a realistic range for a CQAs during product development to avoid unwanted OOS-1.

How to provide a realistic range for a CQAs during product development to avoid Unwanted OOS-2 Case Study

But always remember this

If we assume that the population average is a golden fish that we want to catch from a pond using 100 types of nets of different sizes (equivalent to CI). Then 95 of those nets out of 100 nets) would capture the golden fish if we consider an error of 5%.

Common Misconception about CI

The biggest mistake we make while interpreting the confidence intervals is that we think CI represents the percentage of the data from a given sample that falls between two limits. For example, in above example, the first CI was found to be 893-963 Gms. People would make a mistake of assuming that there is 95% chance that the mean of all fishes would fall within this range. This is incorrect!

Following books gives an excellent presentation of confidence interval, sampling distribution of mean through cartoons

 

How to provide a realistic range for a CQAs during product development to avoid unwanted OOS-1.

 picture61
It is very important to understand the concept of CI/PI/TI before we can understand the reasons for OOS.

Let’s start from following situation

You have to reach the office before 9:30 AM. Now tell me how confident are you about reaching the office exactly between

(A) 9:10 to 9:15 (hmm…, such a narrow range, I am ~90% confident)

(B) 9:05 to 9:20 (a-haa.., now I am 95% confident)

(C) 9:00 to 9:25 (this is very easy, I am almost 99% confident)

The point to be noted here is that , your confidence increases with widening time interval (remember this for rest of the discussion).

More important thing is that, it is difficult to estimate the exact arrival time, but we can say with some confidence that my arrival time would be between some time interval.

Say my arrival time for last five days (assuming all other factors remains constant)  was 9:17 AM, so I can say with certain confidence (say 95%) that my arrival time would be given by

Average arrival time on (say 5 days) ± margin of error

The confidence we are showing is called as confidence level and the interval estimated by above equation at a given confidence level is called as CONFIDENCE INTERVAL (CI). This confidence interval may or may not contain my mean arrival time.

Now let’s go a manufacturing scenario

We all are aware of the diagram given below, the critical quality attribute (CQA or y) of any process is affected by many inputs like critical material attribute (CMA), critical process parameter (CPP) and other uncontrollable factors.

Picture21

Since, CQAs are affected by CPPs and CMAs, it is said that CQA or any output Y is a function of X (X = CPPs/CMAx).

Picture23

The relationship between Y and X is given by following regression equation

Picture33

Following points worth mentioning are

  1. Value of Y depends on the value of Y, it means that if there is deviation in X then there will be a corresponding deviation in Y. e.g. if the level of any impurity (y) is influenced by the temperature then any deviation in impurity level will be attributed to the change in temperature (x).
  2. If you hold X constant at some value and performs the process many times (say 100) then all 100 products (Y) would not be of same quality because of inherent variation/noise in the system which in turn is because of other uncontrollable factor. That’s why we have error term in our regression equation. If error term becomes zero, then the relationship would be described perfectly by a straight line y = mx + C. In this condition the regression line gives expected value of Y, represented by E(Y) = b0+b1X1.

Picture34

As we have seen that there will be a variation in Y even if you hold X constant. Hence, the term ‘expected value of Y’ represents the average value of Y for a given value of X.

picture2

It’s fine that for a given value of X, there will be a range of Y values because of inherent variation/noise in the process and the average of Y values is called as expected value of Y for a given value of X, but, tell how this is going to help me in investigating OOS/OOT?

Let’s come to the point, assume that we have manufactured one million tablets of 500 mg strength with a mixing time of 15 minutes (= x), Now I want to know the exact mean strength of all the tablets in the entire batch?

In statistical terms,

Picture24

It’s not possible to estimate the exact mean strength of all the tablets in the entire batch as it would require destructive analysis of the entire one million tablets.

Then, what is the way out? How we can estimate the mean strength of the entire batch?

Best thing we can do is to take out a sample and analyze it and based on the sample mean strength, we can make an intelligent guess about the mean strength of the entire batch … but it would be with some error, as we are using sample for the estimation. This error is called as sampling error. The sample data would give an interval that may contain the population mean is given by

Sample mean ± margin of error = confidence interval (CI)

The term “Sample mean ± margin of error ” is called as confidence interval which may or may not contains the population mean.

picture4

It is unlikely that two samples from a given population will yield identical confidence intervals (CI), it means that every sample would provide a different interval but, if we repeat the sampling many times and calculate all CI, then a certain percentage of the resulting confidence intervals would contain the unknown population parameter. The percentage of these CI that contain the parameter is called as confidence level of the interval. The interval estimated by the sample is called as confidence interval (CI). This CI is for a given value of X. This CI will change, with change in X.

Picture25

 Note: Don’t get afraid of the formulas, we will we covering it latter

If 100 samples are withdrawn then we can have following confidence level

A 90% confidence level would indicate that the confidence interval (CI) generated by 90 samples (out of 100) would contain the unknown population parameter.

A 95% confidence level indicates that the CI estimated by 95 samples (out of 100) would contain the unknown population parameter.

Picture30

To summarize, we can estimate the population mean by using confidence interval with certain degree of confidence level.

It’s fine that CI helps me in determining the range within which there is 95% or 99% probability of finding the mean strength of the entire batch. But I have an additional issue, I am also interested in knowing the number of tablets (out of one million tablets) that would be bracketed by this interval or any other interval and how many are outside this interval? This will help me in determining the failure rate once we compare this interval with customer’s specifications.

More precisely we want to know the interval which would contain the 99% of the tablets with desired strength and how confident we are about this interval that it will contain 99% of the population?

picture5

If we can get this interval, we can compare it with the customer’s specification which in turn would tell me something about the process capability. How this can be resolved?

Let’s understand the problem once again

If we understood the issue correctly, then we want to estimate an interval (with required characteristics) based on the sample data that will cover say 99% or 95% of the population and then we want to overlap this interval with the customer’s specification to check the capability of the process. This is represented by scenario-1 and scenario-2 (ideal) in the figure given below.

picture6

Having understood the issue, the solution lies in calculating another interval known as Tolerance Interval for the population with a desired characteristics (Y) for a given value of process parameter X.

Tolerance Interval: this interval captures the values of a specified proportion of all future observations of the response variable for a particular combination of the values of the predictor variables with some high confidence level.

We have seen that CI width is entirely due to the sampling error. As the sample size increases and approaches the entire population size, the width of the confidence interval approaches zero. This is because the term “margin of error” would become zero.

In contrast, the width of a tolerance interval is due to both sampling error and variance in the population. As the sample size approaches the entire population, the sampling error diminishes and the estimated percentiles approach the true population percentiles.

e.g. A 95% tolerance interval that captures 98 % of the population of a future batch of the tablets at a mixing time of 15 minutes is 485.221 to 505.579 (this is Y).

Now, if customer’s specification for the tablet strength is 497 to 502 then we are in trouble (representing scenario-1 in above figure) because, we need to work on the process (increase the mixing time) to reduce the variability.

Let’s assume that we increased the mixing time to 35 minutes and as a result, 95% tolerance interval which captures 99% of the population is given by 498.598 to 501.902. Now we are comfortable with the customer’s specification (scenario-2 in above figure). Hence, we need to blend the mixture for 35 minutes before compressing it into tablets.

We need to be careful while understanding the tolerance interval as it contains two types of percentage terms. The first one, 95% is the confidence level and the second term i.e. 98% is the proportion of the total population with required quality attributes that we want to bracket by the tolerance interval for a constant mixing time of 5 minutes.

To summarize: in order to generate tolerance intervals, we must specify both the proportion of the population to be covered and a confidence level. The confidence level is the likelihood that the interval actually covers the proportion.

This is what we wanted during the product development.

picture13

Let’s calculate the 95% CI using excel sheet

In next post we try to clarify the confusion that we have created in this post by a real time example. So, keep visiting us

Related posts:

Why We Have Out of Specifications (OOS) and Out of Trend (OOS) Batches?

Proposal for Six Sigma Way of Investigating OOT & OOS in Pharmaceutical Products-1

Proposal for Six Sigma Way of Investigating OOT & OOS in Pharmaceutical Products-2


Note on Regression Equation:

Regression line represents the expected value of y = E(yp) for a given value of x = xn. Hence, the point estimate of y for given value of x = xn s given by

Picture37

xn = given value of x

yn = Value of output y corresponding to xn

E(yp) = mean or expected value of y for given value of x = xn, it denotes the unknown mean value of all y’s where x = xn.

Theoretically, Picture38is the point estimate of E(yp) hence should be equal. But in general it seldom happens. If we want to measure, how close the true mean value E(yp) is to the point estimatorPicture38, then we need to measure the standard deviation of Picture38for given value xp.

Picture44

Confidence interval for the expected value E(yp) is given by

Picture42Why we need this equation right now? (I don’t want you to get terrified!)but, if you focus on the numerator part of the standard deviation formula, then one important observation is that if

then the standard deviation would be minimum and as you move away from the mean, the standard deviation goes on increasing. It implies that the CI would be narrower at Picture43and it would widen as you move away from the mean.

Hence, the width of the CI depends on the value of CPP (x)

Picture45