## Fishing and Hypothesis Testing!

 Let’s assume that we are in the business of fish farming. We want to make sure that the mean weight of all the fishes in the pond must be 2 Kg before we sell them into the market. It is not possible to take out all of them (we don’t know how many fishes are there!) from the pond and measure their individual weight. So what we do is to take out a sample of say, 25 fishes and measure their weight. If the mean weight of the 25 fish is near to 2 Kg (more precisely, if it is between 2 ± 0.15 Kg), we assume that lot is ready to be sold in the market. This is hypothesis testing. We are trying to estimate the population parameter (mean weight of all fishes in the pond) based on the mean weight of the sample of 25 fishes with an acceptance criterion in our mind (between 2 ± 0.15 Kg).

In Inferential statistics we try to estimate the population parameter by studying a sample drawn from that population. It is not always possible to study the whole population (called as census). Take an example of rice being cooked in a restaurant. The entire lot of rice taken for the cooking may be considered as the population. Let’s further assume that there is a set protocol for cooking and after a predetermined cooking time, chef hypothesizes that the whole lot might have been cooked properly. In order to check his hypothesis, he takes out few grains of the cooked rice (sample) and then he test his hypothesis by subjecting the sample to some test (pressing them between the fingers). Finally based on the sample’s result, chef takes the decision whether the whole lot of rice is cooked or not.

We have executed following steps in order to make an estimate about the degree of cooking of the entire lot of rice based on the small sample drawn from the pot.

• We select the population to be studied (the entire lot of rice under cooking)
• A cooking protocol is followed by the chef and he makes a hypothesis that the whole lot of rice (population) might have been cooked properly → trying to make an estimate about the population parameter
• Then we draw a sample from the pot → sample
• We have a criterion in our mind to say that the rice is overcooked or undercooked or properly cooked as per expectation → we set a threshold limit (confidence interval CI) within which we assume that rice is properly cooked. Less than that, it is undercooked and more than that, it is overcooked.
• We test the sample of the cooked rice by pressing them between our fingers → Test statistics
• The results of the test statistics is compared with the threshold limit set prior to conducting the experiments and based on this comparison, decision is taken whether the whole lot of rice is cooked properly or not → inference about the population parameter.

It must be somewhat clear from the above discussion that, we use hypothesis testing to challenge whether some claim about a population is true or not utilizing the sample information, for example,

• The mean height of all the students in the high school in a given state is 160 Cm.
• The mean salary of the fresh MBA graduates is \$65000
• The mean mileage of a particular brand of car is 15 Km/liter of the gasoline.

All of the above statement is some kind of population parameter that we hypothesized to be true. To test these hypothesis, we take a sample (say height of 100 students selected at random from the high school or salary data of 25 students selected at random from the MBA class) and subject it to some statistical tests called as test statistics (equivalent to pressing the rice between the fingers) to conclude the statement made about the population parameter is true or not.

But, before we go any further, it is important to understand that we are using a sample for estimating the population parameter and the size of the sample is too less (relative to the size of population). As a result, estimating population parameter based on the sample statistics would involve some uncertainty or the error. This is represented by following equation

Population Parameter = Sample statistics ± margin of error

The above equation gives an interval (because of ± sign) between which a population parameter is expected to be found. This interval is called as confidence interval (CI). This means every sample drawn from the population would give different Confidence Interval!!

Suppose we are trying to estimate the population mean (which is usually unknown) and we draw 100 samples from the population, all those 100 samples would give 100 different CI because all samples would have different sample mean and different “margin of error”. Now question arises “does all 100 CI thus obtained would contain the population parameter?” As stated earlier, because of the sampling error, we can never accurately estimate the population parameter, in other words, we understand that there will be some degree of error in estimating the population parameter based on the sample statistics. Hence, we should be wise enough to accept an inherent error rate prior to conducting any hypothesis testing. Let’s say that if I collect 100 samples from the population and obtained 100 different CI, then there are chances that 5 CI thus obtained might not contain the population parameter. This is called as error α or type-I error. This α represents the acceptable error or the level of significance and it has to be determined prior to conducting any hypothesis testing. Usually it is a management decision. For more detail see “Is it difficult for you to comprehend confidence interval?

Based on the above discussion, a 7-Step Process for the Hypothesis Testing is used (note: step-2 is described before step-1, this is done because it helps us in writing the hypothesis correctly)

Step 2: State the Alternate Hypothesis.

This is denoted by Ha and this is the real thing about the population that we want to test. In other words, Ha denotes what we want to prove.

For example:

• The mean height of all students in the high school is 160 Cm.
• Ha: μ ≠ 160 Cm
• The mean salary of the fresh MBA graduates is \$65000
• Ha: μ ≠ \$65000
• The mean mileage of a particular brand of car is greater than 15 Km/liter of the gasoline.
• Ha: μ > 15 Km/liter

Step 1: State the Null Hypothesis.

This is denoted by Ho. We state the null hypothesis as if we are extremely lazy persons and we don’t want to do any work! For example if new gasoline is claiming to have an average mileage of greater than 15Km/liter then my null hypothesis would be “it is less than or equal to 15 Km/liter” hence, by doing so, we would not take any pain in testing the new gasoline. We are happy with status quo!

So the null hypothesis in all of the above cases are

• The mean height of all students in the high school is 160 Cm.
• Ho: μ = 160 Cm
• The mean salary of the fresh MBA graduates is \$65000
• Ho: μ = \$65000
• The mean mileage of a particular brand of car is greater than 15 Km/liter of the gasoline.
• Ha: μ ≤ 15 Km/liter

Therefore, if you want me to work, first you make an effort to reject the null hypothesis!

Step 3: Set α

But, before we go any further, it is important to understand that we are using a sample for estimating the population parameter and the size of the sample is very less than then the size of the population. And because of this sampling error, estimating population parameter would contain some uncertainty or error. There is two types of error that can occur that we can make in hypothesis testing.

Following is the contingency table for the null hypothesis. We can make two errors, first rejecting the null hypothesis when it is true (α error) accepting the null hypothesis when it is false (β error). Hence, the acceptance limit for both the error is decided prior to hypothesis testing.

Using z-transformation or t-test, we determine the critical value (threshold) corresponding to error α.

The level of significance α is the probability of rejecting the null hypothesis when it is true. This is like rejecting a good lot of material by mistake.

Whereas the β is the called as type-II error and it is the probability of accepting the null hypothesis when it is false. This is like accepting a bad lot of material by mistake.

 Let’s understand null and alternate hypothesis graphically Following are the distribution of Ho and Ha with means μa & μb respectively and both having a variance of σ2. We also have an error term α, representing a threshold value on the distribution of Ho beyond which, we would fail to accept the Ho (in other words, we are would reject the Ho). Now the issue that is to be resolved is “how we can say that the two distributions represented by Ho and Ha are same or not” It is usually done by measuring the extent of overlap between the two distributions. This we do by measuring the distance between the mean of the two distributions (of course we need to consider the inherent variance in the system). There are statistical tools like z-test, t-tests, ANOVA etc. which helps us in concluding, whether the two distributions are significantly overlapping or not.

Step 4: Collect the Data

Step 5: Calculate a test statistic.

The test statistic is a numerical measure that is computed from the sample data which, is then compared with the critical value to determine whether or not the null hypothesis should be accepted. Another way of doing is to convert the test statistics to a probability value called as p-value, which is then compared with α, to conclude whether the hypothesis that was made about the population is to be accepted or rejected.

Also See “p-value, what the hell is it?”

Is it difficult for you to comprehend confidence interval?

Step 6: Construct Acceptance / Rejection regions.

The critical value is used as a benchmark or used as a threshold limit to determine whether the test statistic is too extreme to be consistent with the null hypothesis.

Step 7: Based on steps 5 and 6, draw a conclusion about H0.

The decision, whether to accept or reject the null hypothesis is based on following criterion:

• If the absolute value of the test statistic exceeds the absolute value of the critical value in, the null hypothesis is rejected.
• Otherwise, the null hypothesis fails to be rejected (or simply Ho is accepted)
• Simplest way is to compare α and the p-value. If p-value is < α, reject the Ho.

Summary:

The null and alternative hypotheses are competing statements made about the population based on the sample. Either the null hypothesis (H0) is true or the alternative hypothesis (Ha) is true, but not both. Ideally the hypothesis testing procedure should lead to the acceptance of H0 when H0 is true and the rejection of H0 when Ha is true. Unfortunately, the correct conclusions are not always possible because hypothesis tests are based on sample information therefore, we must allow or we must have a provision for the possibility of type-I and type-II errors.

## “p-value” What the Hell is it?

 When ever we go to the supermarket, say to buy tomatoes, we go to the vegetable section and by merely looking at them, we make a hypothesis in our mind that all tomatoes must be of good or bad quality. What we are doing is, we are intuitively providing a qualitative limit on the quality and we can call it as theoretical limit. Now we go to the shelf and pick a sample of tomatoes and press then between our fingers to check the hardness, we can take it as the experimental value on hardness. If this experimental value is better than the theoretical value  we end up in buying the tomatoes. In business decisions when we want to compare two processes, we have a theoretical limits represented by α and a corresponding experimental value represented by p-value.  If p-value (experimental or observed value) is found to be less then the α (theoretical value), then two processes are different.

You must have understood the following “normal distribution” after having gone through so many blogs on this site. Let’s revise what we know about the normal curve

If a process is stable, it will follow the bell shaped curve called as normal curve. It means that, if we plot all historical data obtained from a stable process – it will give a symmetrical curve as shown above. The distance from the mean (μ) in either direction is measured in the terms of σ. The σ represents the standard deviation (a measurement of variation)

The main characteristic of the above curve is the proportion of the population captured in-between any two σ values. For example μ±2σ would contain 95% of the total population and μ±3σ would contain 99.73% of the total population.

The normal curve doesn’t touches the x-axis i.e. it extend from – to + . This information is very much important for understanding the p-value concept. The implication of this statement is that, there is always a possibility of finding an observation between – to + or in other words “even a stable process can give a product with specification anywhere between – to + ”. But, as you move away from the mean, the probability of finding an observation decreases, for example, the total probability of finding an observation beyond μ±2σ is only 5% (2.5% on either side of the normal curve). This probability decreases to 0.3% for an interval μ±3σ.

Now, let’s understand this: I am manufacturing something and my process is quite robust and it follows the normal distribution. As per point number-2 (see above) there is always a possibility that the specification of the product can fall anywhere between – ∞ to +∞ . But, I can’t go to my customer and make this statement. The point that I want to make here is that, there has to be a THRESHOLD DISTANCE (control limits) from the mean (say μ ± xσ) as the acceptance criterion for the product and if specification falling beyond this threshold limit, would be rejected (will not be shipped to the customer).

In other words, if my process is giving me the products with sampling distribution of mean beyond the threshold limits, then I will assume that my process has deviated from the SOP (standard operating procedure) due to some assignable cause and now the current process is different from the earlier process! Or simply there are two processes that are running in the plant.

Generally μ±3σ is taken as the threshold limit. This threshold is represented by alpha (α) or the % acceptable error. In present case (μ ± 3σ), α = 0.3% or 0.003.

From the above point, it is clear that, as long as the process is giving me the sampling distribution of mean within μ±3σ, we would say that the products are coming from the same process. If the sampling distribution of mean of a batch of the manufactured product is falling beyond μ±3σ then, it would represent a different process.

Till now we have defined a theoretical threshold limit called as alpha (α). Now consider two sampling distribution of mean of two processes described below

In case-3, we can confidently say that the two processes are different as there is a minimum overlap of two sampling distribution of means. But, what about case-2 and case-1? In these two cases, taking decision would be difficult because there is significant overlap of two distributions! (At least appears to be). In these circumstances, we need a statistical tool to access whether the overlap is significant or not, in other words a tool is required to ascertain that the sampling distribution of mean of two processes are significantly apart to say that the two processes are different.

In order to do that we need to collect some data from both the processes and then subject them to some statistical tests (z-test, t-test, F-test etc.) to check whether the difference between the mean of two processes is significant or not. This significance obtained by collecting a samples from both the population followed by a statistical analysis, the result is obtained in the form of a probability term called as p-value. The point to be noted here is that, the p-value is generated from some statistical test (equivalent to an experiment value).

We can say that the α is the theoretical threshold limit and the p-value is the experimentally generated threshold limit and if the p-value is less than or equal to the theoretical threshold limit α then we would say two processes are really different.

When we say the p-value < α, it means that the sampling distribution of mean of the new process is significantly different from the existing process.

To summarize

1. In general a = 0.05, there is only 5% chance that two processes are same.
2. If p-value (experimental or observed value) is < α (theoretical value), then new process is different.

More details would be covered in hypothesis testing

 Abstract: You will be surprised that we all are aware of this concept of distribution and are using it intuitively, all the time! Don’t believe me? Let me ask you a simple question, to which income class do you belong? Let’s assume that your answer is middle income class. On what basis did you made this statement? Probably in your mind you have following distribution of income groups and based on this image in your mind, you are telling your position is towards the left side or towards the middle income group on this distribution.  Figure-1: How we are making use of distribution in our daily life, intuitively
 Note: This article gives a conceptual view of the tools that we use in inferential statistics. Here we are not explanting the concept of sampling or  the sampling distribution. Instead we are using distribution of individual values and assuming them to be normally distributed (which is not always the case) in order to explain the concept and also using it for the illustration purpose. We advise readers to read something on “sampling and sampling distribution” immediately after reading this article for better clarity as we are giving oversimplified version of the same in the present article. Don’t miss the “Central limit theorem”.

Introduction to the Concept of Distribution

When we say that my child is not good at studies, you are drawing a distribution of all students in your mind and implicitly trying to tell the position of your child towards the left of that distribution. Whenever we talk of adjectives like rich, poor, tall, handsome, beautiful, intelligent, cost of living etc., we subconsciously, associate a distribution to those adjective and we just try to pinpoint the position of a given subject onto this distribution.

What we are dealing here is called as inferential statistics because, it helps in drawing inferences about the population based on a sample data. This is just opposite of probability as shown below.

Figure-2: Difference between probability & statistics

This inferential statistics empower us to take a decision based on the small sample drawn from a population.

Why, it is so difficult to take decisions or what causes this difficulty?

This is because we are dealing with samples instead of population. Let’s assume, we are making a batch of one million tablets (population) of Lipitor and before releasing this batch in to the market, we want to make sure that each tablet must be having Lipitor content of 98-102%. Can we analyze all one million tablets? Absolutely not! What we actually do is to analyze, say 100 tablets (Sample) selected at random from one million tablets and based on the results, we accept or reject the whole lot of one million tablets (we usually use z-test or t-test for taking decisions)

BUT, there is a catch. Since we are working with small samples, there is always a chances of taking a wrong decision because the sample thus selected may not be homogeneous enough to represent the entire population (sampling error). This error is denoted by alpha (α) and is decided by the management prior to performing any study i.e. we are accepting an error of α. It means that there is a probability of α that we are accepting a failed batch of Lipitor. Since α is a theoretical threshold limit then, it must be vetted by some experimental probability value. This experimental or the observed probability value is called as p-value (see blog on p-value).

Another aspect of the above discussion arises if we draw two or more samples (of 100 tablets each) and try to analyze them. Let me make it more complicated for you. You are the analyst and I come to you with three samples and want to know from you, whether all these three samples are coming from a single batch (or belong to the same parent population) or not? Point I want to emphasize is that, even though multiple samples are withdrawn from the same population but they would seldom be exactly the same because of the sampling error. The concept is described in following figure-3. This type of decision where sample size ≥ 3, is taken by ANOVA.

Figure-3: The distribution overlap and the decision making (or inferential statistics)

We have seen earlier that α is the theoretical probability or a threshold limit beyond which we assume that the process is no longer the same. This theoretical limit is then tested by collecting a dataset followed by performing some statistical tests (t-test, z-test etc.) to obtain an experimental or observed probability value or the p-value and if, this p-value is found to be less than α, we say that samples are coming from two different populations. This concept is represented below

Figure-4: The relationship between p-value and the alpha value for taking statistical decisions.

Let’s remember the above diagram and try to visualize some more situations that we face every day, where we are supposed to take decisions. But before we do that, one important point, we must identify the target population correctly otherwise whole exercise would be a futile one.

For example

As a high end apparel store, I am interested in the monthly expenditure of females, but wait a second, shouldn’t we specify what kind of females? Yes, we require to study the females of following two categories

The employed and the self-employed females (great! at least we have identified the population categories to be compared). Now next dilemma is whether to consider the females of all age groups or the females below certain age? As my store is more interested in young professionals hence I would compare the above two groups of females but with an age restriction of less than or equal to thirty years.

Figure-5: Identifying the right population for study is important

Another important point, in order to compare two (using z-test or t-test) or more samples (using ANOVA), we also require information about the mean and standard deviation of the samples, before we can tell whether they are coming from same or different parent population.

For example, the mean monthly expenditure on apparel by a sample of 30 employed females is \$1500 and the mean expenditure by 30 self-employed females be \$1510. Immediately we will try to compare these two means and conclude that two means are almost the same. In back of our mind we are assuming that even though means are different but there will some variation in the data and if, we consider this variation then this difference is not significant. Remember! We have made some kind of distribution in our mind before making this statement. (statistically we do it by two sample t-test)

Figure-6: Significant Overlap between distributions indication no difference between them

What if, the mean expenditure by self-employed females be \$1525, then we can say it’s not a big difference to be significant (again we are assuming that there will be a variability in the data). What if, the mean expenditure by self-employed females is \$1600, in this case we are certain that the difference is significance. In all three cases discussed above, it is assumed that variance remained constant.

Figure-7: Insignificant Overlap between distributions indication that there is a difference between them

In real life, whenever we encounter two samples, we are tempted to compare the mean directly for taking decisions. But, in doing so, we forget to consider the standard deviation (variation) that is there in the data of two samples. If we consider the standard deviation and then if we find that there is no significant overlap between the distribution of the monthly expenditure by employed females and the distribution of the monthly expenditure by self-employed, then we can conclude that the expenditure behavior of the two groups are different (see figure-6 & 7 above).

Some other situations that could be understood by drawing the distribution. It will help us in comprehending the situation in a much better way.

Women workforce are protesting that there is a gender biasness in the pay scale in your company, is it so?

Once again, be careful about selecting the population for the study! We should only compare males/females of same designation or with same work experience. Let’s take the designation (males & females at manager and senior manager level) as a criterion for the comparison. Since, we have identified the population, we can now select some random samples from both genders belonging to manager and senior manager level. We can have two situations, either the two distribution overlaps or do not overlap. If there is a significant overlap (p-value > α) then there is no difference in salary based on the gender. On the other hand, if two distribution are far apart (p-value < α), then there is a gender bias.

Figure-8: Intuitive scenarios for taking decision, based on the degree of distribution overlap

Our new gasoline formula gives a better mileage than the other types of gasoline available in the market, should we start selling it?

This problem can be visualized by following diagram. But be careful! While measuring mileage, make sure you are taking same kind of car and testing them on the same road and running them for the same number of kilometers at a same constant speed! Since number of samples ≥ 3, use ANOVA.

Figure-9: Understanding the gasoline efficiency using distribution

New filament increase the life time of a bulb by 10%, should we commercialize it?

For this problem, let’s produce two sets of bulbs, first set with the old filament and second set with the new filament. This is followed by testing the samples from each group for their lifespan, what we are expecting is represented below

Figure-10: Understanding the filament efficiency using distribution

A new catalysts developed by R&D team can increase the yield of the process by 5%, should we scale-up the process?

Here we need to establish whether the 5% increase in yield is really higher or not. Can this case be represented by case-1 or by case-2 in above diagrams?

The efficacy of a new drug is 30% better than of the existing drug in the market, is it so?

The soap manufacturing plant finds that some of the soap are weighing 55 gm. instead of 50-53 gm t(he target weight)., should he reset the process for corrective actions?

A new production process claims to reduce the manufacturing time by 4 hrs, should we invest in this new process?

The students of ABC management school are offered better salary than that of the XYZ School, is it so? Colleges advertise like that!

Let’s have a look how the data is usually manipulated here. In order to promote a brand, companies usually distort the distribution when they compare their products with the other brands.

Figure-11: Misuse of statistics

ABC College or any other company promoting their brands would take samples from the upper band of their distribution and then they compare it with the distribution of the XYZ College or with other available brands. This gives a feeling that ABC College or a given brand in question is better than others. Alternatively, you can take competitor’s samples from the lower end of their distribution for comparison for getting the feel good factor about your brand!

Yield of a process has decreased from 90% to 87%, should we take it as a six sigma project?

Again, we need to establish whether the decrease of 3% yield is really significant or not. Can this case be represented by case-1 or by case-2 in above diagrams?

If we look at the situations described in points 4-8 above, we are forced to think “what is the minimum separation required between the mean of two sample, to tell whether there is significant overlap or not”

Figure-12: What should be the minimum separation between distribution?

This is usually done in following steps (this will be dealt separately in next blog on hypothesis testing)

1. Hypothesis Testing
1. Null and alternate hypothesis
2. Decide α
2. Test statistics
1. Use appropriate statistical test to estimate p-value like Z-test, t-test, F-test etc.
3. Compare p-value and α
4. Take decision based on whether p-value is < or > α

Concept of distribution and the hypothesis testing

Let’s see how the above concept of distribution helps in understanding the hypothesis testing. In hypothesis testing we make two statement about the same population based on the sample. These two statement are known as “null” and “alternate” hypothesis.

Null Hypothesis (H0): Mean mileage from a liter of new gasoline ≤ 20 Km (first distribution)

Alternate Hypothesis (Ha): Mean mileage from a liter of new gasoline > 20 Km (second distribution)

The above two statement can be represented by following two distribution

Figure-13: Distributions of null and alternate hypothesis

Now, if H0 is true i.e. new gasoline is no better than the existing one then, we would expect two distributions to overlap significantly (p-value > a)

Figure-14: Pictorial view of the condition when null hypothesis is true

On the other hand if H0 is false or Ha is true (new gasoline is really better than the existing one) then these two distribution will be far from each other or there would no significant overlap of the two distributions (p-value < a)

Figure-15: The pictorial view of the case when null hypothesis is not true

Above discussion can be extended to understand ANOVA, Regression analysis etc.

Summary

This article tries to give a pictorial view to a given statistical problem, we can call it as “The Tale of Two Distributions”.

Any business problem that requires decision making can be visualized in the form of a overlapping or a non-overlapping distributions. This will give a pictorial view of the problem to the management and would be easy for comprehending the problem.

Another point that is important here is the exercise if identifying the right target population i.e. we must make sure that an apple is compared to an apple!

Going forward, this understanding will help you in understanding hypothesis testing in upcoming blog.