## Understanding the Difference Between Long and Short Term Sigma We have seen that the main difference between Cpk and the Ppk is the way in which the value of sigma (standard deviation) is being calculated.

In Cpk, the value of sigma comes from the control chart and usually given by the formula Where  is the average of the absolute value of range (obtained as a difference of two consecutive points when, data is arranged in a time order). The term d2 is a statistical constant that depend on the sample size.

This sigma-short is affected by the time order to the data i.e. every time you change the time order, sigma-short would change.

Whereas, in Ppk the sigma is calculated using traditional formula and is also called as the overall sigma or sigma-long. In this case, sigma-long is not affected by the time order of the data points. This is called as overall standard deviation.

Usually, sigma-short is less than sigma-long.

Let’s do a simulation in R to check whether sigma-short is really affected by the time order or not

 #setting the seed for reproducibility set.seed(2307)  #load library QCC library(qcc)  # Generate a normal sample of 50 data points d<-rnorm(50,100,1.1)  # Generate a data set for storing output of the control chart, sigma-short and   sigma-long IMR<-list() sigma_short<-c() sigma_long<-c()  # Generate a blank matrix of 10 rows and 50 columns to store 10 10   random samples each having 50 data points. sam<-matrix(nrow=10,ncol=50,byrow = TRUE)  # Code for generating 10 random samples from the normal sample   generated as (d) above for(i in 1:10){ sam[i,]<-sample(d,50,replace=FALSE) #generate ith sample and store in   the matrix sam.#generate I-MR chart of the ith sample. IMR<-qcc(sam[i,],”xbar.one”,plot=FALSE) #calculate sigma-short of the ith sample. sigma_short[i]<-IMR\$std.dev #calculate sigma-long of the ith sample. sigma_long[i]<-sd(sam[i,]) } #print data frame   containing sigma-short and sigma-long of all 10 sample. (data_table<-cbind(sigma_short,sigma_long))

Table-1: Short and long sigma generated from the same simulated data but with different time order.

 sigma_short sigma_long 1.1168596 1.09059 1.1462365 1.09059 1.1023853 1.09059 0.9902320 1.09059 1.1419678 1.09059 1.2173854 1.09059 0.9941954 1.09059 1.0408088 1.09059 1.1038588 1.09059 1.2275286 1.09059

It is evident from the simulation that sigma-short do get affected by the time order of the data. Therefore, the sigma or the standard deviation calculated from the control charts (short sigma) and the overall sigma are different.

for more on Cpk and Ppk see below links

Car Parking & Six-Sigma

What Taguchi Loss Function has to do with Cpm?

What do we mean by garage’s width = 12σ and car’s width = 6σ?

## Fishing and Hypothesis Testing! Let’s assume that we are in the business of fish farming. We want to make sure that the mean weight of all the fishes in the pond must be 2 Kg before we sell them into the market. It is not possible to take out all of them (we don’t know how many fishes are there!) from the pond and measure their individual weight. So what we do is to take out a sample of say, 25 fishes and measure their weight. If the mean weight of the 25 fish is near to 2 Kg (more precisely, if it is between 2 ± 0.15 Kg), we assume that lot is ready to be sold in the market. This is hypothesis testing. We are trying to estimate the population parameter (mean weight of all fishes in the pond) based on the mean weight of the sample of 25 fishes with an acceptance criterion in our mind (between 2 ± 0.15 Kg).

In Inferential statistics we try to estimate the population parameter by studying a sample drawn from that population. It is not always possible to study the whole population (called as census). Take an example of rice being cooked in a restaurant. The entire lot of rice taken for the cooking may be considered as the population. Let’s further assume that there is a set protocol for cooking and after a predetermined cooking time, chef hypothesizes that the whole lot might have been cooked properly. In order to check his hypothesis, he takes out few grains of the cooked rice (sample) and then he test his hypothesis by subjecting the sample to some test (pressing them between the fingers). Finally based on the sample’s result, chef takes the decision whether the whole lot of rice is cooked or not. We have executed following steps in order to make an estimate about the degree of cooking of the entire lot of rice based on the small sample drawn from the pot.

• We select the population to be studied (the entire lot of rice under cooking)
• A cooking protocol is followed by the chef and he makes a hypothesis that the whole lot of rice (population) might have been cooked properly → trying to make an estimate about the population parameter
• Then we draw a sample from the pot → sample
• We have a criterion in our mind to say that the rice is overcooked or undercooked or properly cooked as per expectation → we set a threshold limit (confidence interval CI) within which we assume that rice is properly cooked. Less than that, it is undercooked and more than that, it is overcooked.
• We test the sample of the cooked rice by pressing them between our fingers → Test statistics
• The results of the test statistics is compared with the threshold limit set prior to conducting the experiments and based on this comparison, decision is taken whether the whole lot of rice is cooked properly or not → inference about the population parameter.

It must be somewhat clear from the above discussion that, we use hypothesis testing to challenge whether some claim about a population is true or not utilizing the sample information, for example,

• The mean height of all the students in the high school in a given state is 160 Cm.
• The mean salary of the fresh MBA graduates is \$65000
• The mean mileage of a particular brand of car is 15 Km/liter of the gasoline.

All of the above statement is some kind of population parameter that we hypothesized to be true. To test these hypothesis, we take a sample (say height of 100 students selected at random from the high school or salary data of 25 students selected at random from the MBA class) and subject it to some statistical tests called as test statistics (equivalent to pressing the rice between the fingers) to conclude the statement made about the population parameter is true or not.

But, before we go any further, it is important to understand that we are using a sample for estimating the population parameter and the size of the sample is too less (relative to the size of population). As a result, estimating population parameter based on the sample statistics would involve some uncertainty or the error. This is represented by following equation

Population Parameter = Sample statistics ± margin of error

The above equation gives an interval (because of ± sign) between which a population parameter is expected to be found. This interval is called as confidence interval (CI). This means every sample drawn from the population would give different Confidence Interval!!

Suppose we are trying to estimate the population mean (which is usually unknown) and we draw 100 samples from the population, all those 100 samples would give 100 different CI because all samples would have different sample mean and different “margin of error”. Now question arises “does all 100 CI thus obtained would contain the population parameter?” As stated earlier, because of the sampling error, we can never accurately estimate the population parameter, in other words, we understand that there will be some degree of error in estimating the population parameter based on the sample statistics. Hence, we should be wise enough to accept an inherent error rate prior to conducting any hypothesis testing. Let’s say that if I collect 100 samples from the population and obtained 100 different CI, then there are chances that 5 CI thus obtained might not contain the population parameter. This is called as error α or type-I error. This α represents the acceptable error or the level of significance and it has to be determined prior to conducting any hypothesis testing. Usually it is a management decision. For more detail see “Is it difficult for you to comprehend confidence interval?

Based on the above discussion, a 7-Step Process for the Hypothesis Testing is used (note: step-2 is described before step-1, this is done because it helps us in writing the hypothesis correctly)

Step 2: State the Alternate Hypothesis.

This is denoted by Ha and this is the real thing about the population that we want to test. In other words, Ha denotes what we want to prove.

For example:

• The mean height of all students in the high school is 160 Cm.
• Ha: μ ≠ 160 Cm
• The mean salary of the fresh MBA graduates is \$65000
• Ha: μ ≠ \$65000
• The mean mileage of a particular brand of car is greater than 15 Km/liter of the gasoline.
• Ha: μ > 15 Km/liter

Step 1: State the Null Hypothesis.

This is denoted by Ho. We state the null hypothesis as if we are extremely lazy persons and we don’t want to do any work! For example if new gasoline is claiming to have an average mileage of greater than 15Km/liter then my null hypothesis would be “it is less than or equal to 15 Km/liter” hence, by doing so, we would not take any pain in testing the new gasoline. We are happy with status quo!

So the null hypothesis in all of the above cases are

• The mean height of all students in the high school is 160 Cm.
• Ho: μ = 160 Cm
• The mean salary of the fresh MBA graduates is \$65000
• Ho: μ = \$65000
• The mean mileage of a particular brand of car is greater than 15 Km/liter of the gasoline.
• Ha: μ ≤ 15 Km/liter

Therefore, if you want me to work, first you make an effort to reject the null hypothesis!

Step 3: Set α

But, before we go any further, it is important to understand that we are using a sample for estimating the population parameter and the size of the sample is very less than then the size of the population. And because of this sampling error, estimating population parameter would contain some uncertainty or error. There is two types of error that can occur that we can make in hypothesis testing.

Following is the contingency table for the null hypothesis. We can make two errors, first rejecting the null hypothesis when it is true (α error) accepting the null hypothesis when it is false (β error). Hence, the acceptance limit for both the error is decided prior to hypothesis testing. Using z-transformation or t-test, we determine the critical value (threshold) corresponding to error α.

The level of significance α is the probability of rejecting the null hypothesis when it is true. This is like rejecting a good lot of material by mistake. Whereas the β is the called as type-II error and it is the probability of accepting the null hypothesis when it is false. This is like accepting a bad lot of material by mistake.

 Let’s understand null and alternate hypothesis graphically Following are the distribution of Ho and Ha with means μa & μb respectively and both having a variance of σ2. We also have an error term α, representing a threshold value on the distribution of Ho beyond which, we would fail to accept the Ho (in other words, we are would reject the Ho). Now the issue that is to be resolved is “how we can say that the two distributions represented by Ho and Ha are same or not” It is usually done by measuring the extent of overlap between the two distributions. This we do by measuring the distance between the mean of the two distributions (of course we need to consider the inherent variance in the system). There are statistical tools like z-test, t-tests, ANOVA etc. which helps us in concluding, whether the two distributions are significantly overlapping or not.

Step 4: Collect the Data

Step 5: Calculate a test statistic.

The test statistic is a numerical measure that is computed from the sample data which, is then compared with the critical value to determine whether or not the null hypothesis should be accepted. Another way of doing is to convert the test statistics to a probability value called as p-value, which is then compared with α, to conclude whether the hypothesis that was made about the population is to be accepted or rejected.

Also See “p-value, what the hell is it?”

Conceptualizing “Distribution” Will Help You in Understanding Your Problem in a Much Better Way

Is it difficult for you to comprehend confidence interval?

Step 6: Construct Acceptance / Rejection regions. The critical value is used as a benchmark or used as a threshold limit to determine whether the test statistic is too extreme to be consistent with the null hypothesis. Step 7: Based on steps 5 and 6, draw a conclusion about H0.

The decision, whether to accept or reject the null hypothesis is based on following criterion:

• If the absolute value of the test statistic exceeds the absolute value of the critical value in, the null hypothesis is rejected.
• Otherwise, the null hypothesis fails to be rejected (or simply Ho is accepted)
• Simplest way is to compare α and the p-value. If p-value is < α, reject the Ho.

Summary:

The null and alternative hypotheses are competing statements made about the population based on the sample. Either the null hypothesis (H0) is true or the alternative hypothesis (Ha) is true, but not both. Ideally the hypothesis testing procedure should lead to the acceptance of H0 when H0 is true and the rejection of H0 when Ha is true. Unfortunately, the correct conclusions are not always possible because hypothesis tests are based on sample information therefore, we must allow or we must have a provision for the possibility of type-I and type-II errors.

## “p-value” What the Hell is it? When ever we go to the supermarket, say to buy tomatoes, we go to the vegetable section and by merely looking at them, we make a hypothesis in our mind that all tomatoes must be of good or bad quality. What we are doing is, we are intuitively providing a qualitative limit on the quality and we can call it as theoretical limit. Now we go to the shelf and pick a sample of tomatoes and press then between our fingers to check the hardness, we can take it as the experimental value on hardness. If this experimental value is better than the theoretical value  we end up in buying the tomatoes. In business decisions when we want to compare two processes, we have a theoretical limits represented by α and a corresponding experimental value represented by p-value.  If p-value (experimental or observed value) is found to be less then the α (theoretical value), then two processes are different.

You must have understood the following “normal distribution” after having gone through so many blogs on this site. Let’s revise what we know about the normal curve If a process is stable, it will follow the bell shaped curve called as normal curve. It means that, if we plot all historical data obtained from a stable process – it will give a symmetrical curve as shown above. The distance from the mean (μ) in either direction is measured in the terms of σ. The σ represents the standard deviation (a measurement of variation) The main characteristic of the above curve is the proportion of the population captured in-between any two σ values. For example μ±2σ would contain 95% of the total population and μ±3σ would contain 99.73% of the total population.

The normal curve doesn’t touches the x-axis i.e. it extend from – to + . This information is very much important for understanding the p-value concept. The implication of this statement is that, there is always a possibility of finding an observation between – to + or in other words “even a stable process can give a product with specification anywhere between – to + ”. But, as you move away from the mean, the probability of finding an observation decreases, for example, the total probability of finding an observation beyond μ±2σ is only 5% (2.5% on either side of the normal curve). This probability decreases to 0.3% for an interval μ±3σ. Now, let’s understand this: I am manufacturing something and my process is quite robust and it follows the normal distribution. As per point number-2 (see above) there is always a possibility that the specification of the product can fall anywhere between – ∞ to +∞ . But, I can’t go to my customer and make this statement. The point that I want to make here is that, there has to be a THRESHOLD DISTANCE (control limits) from the mean (say μ ± xσ) as the acceptance criterion for the product and if specification falling beyond this threshold limit, would be rejected (will not be shipped to the customer). In other words, if my process is giving me the products with sampling distribution of mean beyond the threshold limits, then I will assume that my process has deviated from the SOP (standard operating procedure) due to some assignable cause and now the current process is different from the earlier process! Or simply there are two processes that are running in the plant.

Generally μ±3σ is taken as the threshold limit. This threshold is represented by alpha (α) or the % acceptable error. In present case (μ ± 3σ), α = 0.3% or 0.003. From the above point, it is clear that, as long as the process is giving me the sampling distribution of mean within μ±3σ, we would say that the products are coming from the same process. If the sampling distribution of mean of a batch of the manufactured product is falling beyond μ±3σ then, it would represent a different process.

Till now we have defined a theoretical threshold limit called as alpha (α). Now consider two sampling distribution of mean of two processes described below In case-3, we can confidently say that the two processes are different as there is a minimum overlap of two sampling distribution of means. But, what about case-2 and case-1? In these two cases, taking decision would be difficult because there is significant overlap of two distributions! (At least appears to be). In these circumstances, we need a statistical tool to access whether the overlap is significant or not, in other words a tool is required to ascertain that the sampling distribution of mean of two processes are significantly apart to say that the two processes are different. In order to do that we need to collect some data from both the processes and then subject them to some statistical tests (z-test, t-test, F-test etc.) to check whether the difference between the mean of two processes is significant or not. This significance obtained by collecting a samples from both the population followed by a statistical analysis, the result is obtained in the form of a probability term called as p-value. The point to be noted here is that, the p-value is generated from some statistical test (equivalent to an experiment value).

We can say that the α is the theoretical threshold limit and the p-value is the experimentally generated threshold limit and if the p-value is less than or equal to the theoretical threshold limit α then we would say two processes are really different.

When we say the p-value < α, it means that the sampling distribution of mean of the new process is significantly different from the existing process.

To summarize

1. In general a = 0.05, there is only 5% chance that two processes are same.
2. If p-value (experimental or observed value) is < α (theoretical value), then new process is different.

More details would be covered in hypothesis testing

## Conceptualizing “Distribution” Will Help You in Understanding Your Problem in a Much Better Way Abstract: You will be surprised that we all are aware of this concept of distribution and are using it intuitively, all the time! Don’t believe me? Let me ask you a simple question, to which income class do you belong? Let’s assume that your answer is middle income class. On what basis did you made this statement? Probably in your mind you have following distribution of income groups and based on this image in your mind, you are telling your position is towards the left side or towards the middle income group on this distribution. Figure-1: How we are making use of distribution in our daily life, intuitively
 Note: This article gives a conceptual view of the tools that we use in inferential statistics. Here we are not explanting the concept of sampling or  the sampling distribution. Instead we are using distribution of individual values and assuming them to be normally distributed (which is not always the case) in order to explain the concept and also using it for the illustration purpose. We advise readers to read something on “sampling and sampling distribution” immediately after reading this article for better clarity as we are giving oversimplified version of the same in the present article. Don’t miss the “Central limit theorem”.

Introduction to the Concept of Distribution

When we say that my child is not good at studies, you are drawing a distribution of all students in your mind and implicitly trying to tell the position of your child towards the left of that distribution. Whenever we talk of adjectives like rich, poor, tall, handsome, beautiful, intelligent, cost of living etc., we subconsciously, associate a distribution to those adjective and we just try to pinpoint the position of a given subject onto this distribution.

What we are dealing here is called as inferential statistics because, it helps in drawing inferences about the population based on a sample data. This is just opposite of probability as shown below. Figure-2: Difference between probability & statistics

This inferential statistics empower us to take a decision based on the small sample drawn from a population.

Why, it is so difficult to take decisions or what causes this difficulty?

This is because we are dealing with samples instead of population. Let’s assume, we are making a batch of one million tablets (population) of Lipitor and before releasing this batch in to the market, we want to make sure that each tablet must be having Lipitor content of 98-102%. Can we analyze all one million tablets? Absolutely not! What we actually do is to analyze, say 100 tablets (Sample) selected at random from one million tablets and based on the results, we accept or reject the whole lot of one million tablets (we usually use z-test or t-test for taking decisions)

BUT, there is a catch. Since we are working with small samples, there is always a chances of taking a wrong decision because the sample thus selected may not be homogeneous enough to represent the entire population (sampling error). This error is denoted by alpha (α) and is decided by the management prior to performing any study i.e. we are accepting an error of α. It means that there is a probability of α that we are accepting a failed batch of Lipitor. Since α is a theoretical threshold limit then, it must be vetted by some experimental probability value. This experimental or the observed probability value is called as p-value (see blog on p-value).

Another aspect of the above discussion arises if we draw two or more samples (of 100 tablets each) and try to analyze them. Let me make it more complicated for you. You are the analyst and I come to you with three samples and want to know from you, whether all these three samples are coming from a single batch (or belong to the same parent population) or not? Point I want to emphasize is that, even though multiple samples are withdrawn from the same population but they would seldom be exactly the same because of the sampling error. The concept is described in following figure-3. This type of decision where sample size ≥ 3, is taken by ANOVA. Figure-3: The distribution overlap and the decision making (or inferential statistics)

We have seen earlier that α is the theoretical probability or a threshold limit beyond which we assume that the process is no longer the same. This theoretical limit is then tested by collecting a dataset followed by performing some statistical tests (t-test, z-test etc.) to obtain an experimental or observed probability value or the p-value and if, this p-value is found to be less than α, we say that samples are coming from two different populations. This concept is represented below Figure-4: The relationship between p-value and the alpha value for taking statistical decisions.

Let’s remember the above diagram and try to visualize some more situations that we face every day, where we are supposed to take decisions. But before we do that, one important point, we must identify the target population correctly otherwise whole exercise would be a futile one.

For example

As a high end apparel store, I am interested in the monthly expenditure of females, but wait a second, shouldn’t we specify what kind of females? Yes, we require to study the females of following two categories

The employed and the self-employed females (great! at least we have identified the population categories to be compared). Now next dilemma is whether to consider the females of all age groups or the females below certain age? As my store is more interested in young professionals hence I would compare the above two groups of females but with an age restriction of less than or equal to thirty years. Figure-5: Identifying the right population for study is important

Another important point, in order to compare two (using z-test or t-test) or more samples (using ANOVA), we also require information about the mean and standard deviation of the samples, before we can tell whether they are coming from same or different parent population.

For example, the mean monthly expenditure on apparel by a sample of 30 employed females is \$1500 and the mean expenditure by 30 self-employed females be \$1510. Immediately we will try to compare these two means and conclude that two means are almost the same. In back of our mind we are assuming that even though means are different but there will some variation in the data and if, we consider this variation then this difference is not significant. Remember! We have made some kind of distribution in our mind before making this statement. (statistically we do it by two sample t-test) Figure-6: Significant Overlap between distributions indication no difference between them

What if, the mean expenditure by self-employed females be \$1525, then we can say it’s not a big difference to be significant (again we are assuming that there will be a variability in the data). What if, the mean expenditure by self-employed females is \$1600, in this case we are certain that the difference is significance. In all three cases discussed above, it is assumed that variance remained constant. Figure-7: Insignificant Overlap between distributions indication that there is a difference between them

In real life, whenever we encounter two samples, we are tempted to compare the mean directly for taking decisions. But, in doing so, we forget to consider the standard deviation (variation) that is there in the data of two samples. If we consider the standard deviation and then if we find that there is no significant overlap between the distribution of the monthly expenditure by employed females and the distribution of the monthly expenditure by self-employed, then we can conclude that the expenditure behavior of the two groups are different (see figure-6 & 7 above).

Some other situations that could be understood by drawing the distribution. It will help us in comprehending the situation in a much better way.

Women workforce are protesting that there is a gender biasness in the pay scale in your company, is it so?

Once again, be careful about selecting the population for the study! We should only compare males/females of same designation or with same work experience. Let’s take the designation (males & females at manager and senior manager level) as a criterion for the comparison. Since, we have identified the population, we can now select some random samples from both genders belonging to manager and senior manager level. We can have two situations, either the two distribution overlaps or do not overlap. If there is a significant overlap (p-value > α) then there is no difference in salary based on the gender. On the other hand, if two distribution are far apart (p-value < α), then there is a gender bias. Figure-8: Intuitive scenarios for taking decision, based on the degree of distribution overlap

Our new gasoline formula gives a better mileage than the other types of gasoline available in the market, should we start selling it?

This problem can be visualized by following diagram. But be careful! While measuring mileage, make sure you are taking same kind of car and testing them on the same road and running them for the same number of kilometers at a same constant speed! Since number of samples ≥ 3, use ANOVA. Figure-9: Understanding the gasoline efficiency using distribution

New filament increase the life time of a bulb by 10%, should we commercialize it?

For this problem, let’s produce two sets of bulbs, first set with the old filament and second set with the new filament. This is followed by testing the samples from each group for their lifespan, what we are expecting is represented below Figure-10: Understanding the filament efficiency using distribution

A new catalysts developed by R&D team can increase the yield of the process by 5%, should we scale-up the process?

Here we need to establish whether the 5% increase in yield is really higher or not. Can this case be represented by case-1 or by case-2 in above diagrams?

The efficacy of a new drug is 30% better than of the existing drug in the market, is it so?

The soap manufacturing plant finds that some of the soap are weighing 55 gm. instead of 50-53 gm t(he target weight)., should he reset the process for corrective actions?

A new production process claims to reduce the manufacturing time by 4 hrs, should we invest in this new process?

The students of ABC management school are offered better salary than that of the XYZ School, is it so? Colleges advertise like that!

Let’s have a look how the data is usually manipulated here. In order to promote a brand, companies usually distort the distribution when they compare their products with the other brands. Figure-11: Misuse of statistics

ABC College or any other company promoting their brands would take samples from the upper band of their distribution and then they compare it with the distribution of the XYZ College or with other available brands. This gives a feeling that ABC College or a given brand in question is better than others. Alternatively, you can take competitor’s samples from the lower end of their distribution for comparison for getting the feel good factor about your brand!

Yield of a process has decreased from 90% to 87%, should we take it as a six sigma project?

Again, we need to establish whether the decrease of 3% yield is really significant or not. Can this case be represented by case-1 or by case-2 in above diagrams?

If we look at the situations described in points 4-8 above, we are forced to think “what is the minimum separation required between the mean of two sample, to tell whether there is significant overlap or not” Figure-12: What should be the minimum separation between distribution?

This is usually done in following steps (this will be dealt separately in next blog on hypothesis testing)

1. Hypothesis Testing
1. Null and alternate hypothesis
2. Decide α
2. Test statistics
1. Use appropriate statistical test to estimate p-value like Z-test, t-test, F-test etc.
3. Compare p-value and α
4. Take decision based on whether p-value is < or > α

Concept of distribution and the hypothesis testing

Let’s see how the above concept of distribution helps in understanding the hypothesis testing. In hypothesis testing we make two statement about the same population based on the sample. These two statement are known as “null” and “alternate” hypothesis.

Null Hypothesis (H0): Mean mileage from a liter of new gasoline ≤ 20 Km (first distribution)

Alternate Hypothesis (Ha): Mean mileage from a liter of new gasoline > 20 Km (second distribution)

The above two statement can be represented by following two distribution Figure-13: Distributions of null and alternate hypothesis

Now, if H0 is true i.e. new gasoline is no better than the existing one then, we would expect two distributions to overlap significantly (p-value > a) Figure-14: Pictorial view of the condition when null hypothesis is true

On the other hand if H0 is false or Ha is true (new gasoline is really better than the existing one) then these two distribution will be far from each other or there would no significant overlap of the two distributions (p-value < a) Figure-15: The pictorial view of the case when null hypothesis is not true

Above discussion can be extended to understand ANOVA, Regression analysis etc.

Summary

This article tries to give a pictorial view to a given statistical problem, we can call it as “The Tale of Two Distributions”.

Any business problem that requires decision making can be visualized in the form of a overlapping or a non-overlapping distributions. This will give a pictorial view of the problem to the management and would be easy for comprehending the problem.

Another point that is important here is the exercise if identifying the right target population i.e. we must make sure that an apple is compared to an apple!

Going forward, this understanding will help you in understanding hypothesis testing in upcoming blog.

## Is it Difficult for you to Comprehend the Concept of Confidence Interval? Try this out Abstract:

I found it very difficult to comprehend the concept of sampling, sampling distribution of mean and the confidence interval. These concepts plays a very important role in inferential statistics, which is a integral part of six sigma tools. This is an attempt to simplify the concept of confidence interval or simply CI.

In our practical life we need to take the decisions about the population based on the analysis of a sample drawn from that sample. Example, a batch of one million tablets is to be qualified by QA based on a sample of say, 50 tablets. The confidence interval from the samples, enables us to have an interval estimate that may contain the population parameter with some degree of error α.

If we assume that the population average is a golden fish that we want to catch from a pond using 100 types of nets of different sizes (equivalent to CI). Then 95 of those nets (out of 100 nets) would capture the golden fish if we consider an error of 5%.

Introduction to confidence interval

Let’s assume that we are in the fishing business and we have our own farm where we are raising fish right from the egg stage to the mature fish. Traditionally, we will incubate the eggs for 10-15 days to get the larvae and then that would be transferred to the juvenile tank where, they would be fed and monitored for another 3 months. Once they are 3 months old we would again transfer all fish from the juvenile tank to a larger tank where they would be on different diet for another 3 months. After this all fish would be sold to the wholesaler.

Now problem is that we would like to sell only those fish which are having weight around 900-950 Gm. In order to maximize the profit per fish. Any fish less than that would be a loss to us as, I would we giving away more number of fish for a given order from the wholesaler. You can say that let’s weigh each and every fish and sell only those between 925-950 Gm. Yes, that can be a solution but, imagine the efforts required to take out every fish, weighing them and making arrangement to keep them in some other tank. This is very difficult as, at a given time we are having 1000-1500 fishes in the tank. Is there a solution by which we can estimate the average weight of all the fishes in the tank?

Yes, Statistics does that. But how?

Statistically it can be done by drawing a random sample and calculating the average of the sample and then trying to estimate the average of the population. But, we need to understand a very important point here, there are chances that the sample thus collected might not be representing the entire population of fish (this is called as sampling) hence, there would be some margin of error in calculating the population average.

Therefore, the average weight of all fish in the pond can be expressed as

= average weight of the sample ± margin of error

Or simply

Average population parameter = sample’s average ± margin of error

Sample Average: This average will vary from sample to sample but, as the sample size increases and then the average of the sample will be closer to the population average.

Margin of Error: This margin of error is calculated statistically. Right now it is sufficient to know that the margin of error is directly proportional to the standard deviation of the sample and inversely proportional to the sample size. It means that if we want to have a narrow interval for the estimation of the population average then, we need to take a larger sample with small variation. This term is generally known as standard error

There is also a statistical constant involved in the calculation of margin of error, this is called as critical t-value. This critical t-value depends on the presumed error α and the degree of freedom.

What is this error α?

This error or risk and is denoted by α. Let’s assume, in above fish example, we draw 100 samples of 5 fish each and calculate all 100 interval using equation-1. We are assuming that all these interval would contain the population average, but there are chances that some intervals might not contain the population average. This because we are working with samples and sampling error bound to happen. So we assume an error or a risk α that, out of 100 interval (calculated from 100 sample) there will be α number of intervals that would not contain the population average. This α is denoted in % or in the probability terms. For example α = 0.05 mean that there is 5% chances or 0.05 probability that 5 out of 100 intervals thus calculated might not contain the population average. This α, is decided prior to starting any experimentation using samples. It’s purely a business decision based on the risk appetite of the company. We can work with α = 0.05 or 0.1 or 0.15. Generally we work with 0.05.

Once we have defined α, we can now discuss t-critical.

The t-critical is the threshold value (like z-value, discussed earlier) on the t-distribution beyond which process is no longer the same i.e. if the observations are falling in the region < tcritical (in below figure), then we would say that the samples are coming from the same parent population otherwise it is coming from the different sample. This t-value is characterized by two parameter α and degree of freedom (df = number of observations-1 or simply n-1) and its value can be obtained from the t-distribution table. When to take a or α/2?

As we have considered the total acceptable error of 5%, now based on the scenario there are chances that the interval calculated might miss the population parameter on either side of the interval. Hence the total error is distributed at both end of the interval equally. For example, say the interval calculated is 915 to 935 Gm. Now, the chances are there that the actual population mean might be less than 915 or more than 935. So if the total error we started with is 5% then, 2.5% is distributed at both the end. This is a case of two tailed test and error on single side is represented by α/2. If this has been a one tail test, then there is no need to do this and the error is represented by α. Now, we are ready estimate the population parameter

Average population parameter = sample’s average ± margin of error Now we can see that, if we want to have a narrower interval then we need to decrease the term “margin of error” and for that, we have to increase the sample size or decrease the σ. Since controlling σ is not in our hand (it is the characteristics of the random sample), we can increase the sample size in order to reach closer to the population parameter.

Interval calculated above is called as CONFIDENCE INTERVAL or simply CI.

Concept of CI obtained from a group of samples is illustrated below Example:

We have no idea about the average weight of all fishes in the pond and we also don’t have any idea about the standard deviation of the weights of all fishes in the tank. In that case we have to estimate the average weight of all fishes in the pond based on the sample’s average and its standard deviation.

Let’s take out the first sample of five fish from the pond and calculate its CI. Methodology to be applied Average weight of five fish from first sample = 928 Gm

Standard deviation of five fish from first sample = 27.97

Now calculate tα/2,df

As α/2 = 0.025 and df = 5-1 = 4

Therefore from t-distribution table Margin of error Confidence Interval of the first sample Inference from the above CI

From the first sample, we got a CI of 892 to 964, it means that the average weight of all fishes in the pond is between 892 to 964 Gm. But, still we can’t pin point the exact average of all fishes in the pond!

Further, if we draw 99 more samples of 5 fishes each and calculate the corresponding CI then we will find that 95 CI out of 100 CI would contain the population average. Only 5 CI would not contain the population average. But, still we can’t pin point the exact average of all fishes in the pond!

Let’s draw some more samples and calculate their CI How we can use this concept in production:

Suppose we made a lot of a product (be a million tablets, bulbs etc.) and QA need to qualify that batch. What he does is to take a random sample and calculates its CI. If this CI contains the population mean (specification), he would pass the lot. Have a look at the following blogs for application part.

Related Blog for the utility of CI

How to provide a realistic range for a CQAs during product development to avoid unwanted OOS-1.

How to provide a realistic range for a CQAs during product development to avoid Unwanted OOS-2 Case Study

But always remember this

If we assume that the population average is a golden fish that we want to catch from a pond using 100 types of nets of different sizes (equivalent to CI). Then 95 of those nets out of 100 nets) would capture the golden fish if we consider an error of 5%.

###### Common Misconception about CI

The biggest mistake we make while interpreting the confidence intervals is that we think CI represents the percentage of the data from a given sample that falls between two limits. For example, in above example, the first CI was found to be 893-963 Gms. People would make a mistake of assuming that there is 95% chance that the mean of all fishes would fall within this range. This is incorrect!

Following books gives an excellent presentation of confidence interval, sampling distribution of mean through cartoons

## Why Standard Normal Distribution Table is so important? ###### Abstract

Since we are entering the technical/statistical part of the subject hence, it would be better for us to understand the concept first

For many business decisions, we need to calculate the likelihood or probability of an event to occur. Histograms along with relative frequency of a dataset can be used to some extent.. But for every problem we come across we need to draw the histogram and relative frequency to find the probability using area under the curve (AUC).

In order to overcome this limitation a standard normal distribution or Z-distribution or Gaussian distribution was developed and the AUC or probability between any two points on this distribution is well documented in the statistical tables or can be easily found by using excel sheet.

But in order to use standard normal distribution table, we need to convert the parent dataset (irrespective of the unit of measurement) into standard normal distribution using Z-transformation. Once it is done, we can look into the standard normal distribution table to calculate the probabilities. From my experience, I found the books belonging to category “statistics for business & economics” are much better for understanding the 6sigma concepts rather than a pure statistical book. Try any of these books as a reference guide.

Introduction

Let’s understand by this example

A company is trying to make a job description for the manager level position and most important criterion was the years of experience a person should possess. They collected a sample of ten manager from their company, data is tabulated below along with its histogram. As a HR person, I want to know the mean years of experience of a manager and the various probabilities as discussed below

Average experience = 3.9 years

What is the probability that X ≤ 4 years?

What is the probability that X ≤ 5 years?

What is the probability that 3 < X ≤ 5 years?

In order to calculate the above probabilities, we need to calculate the relative frequency and cumulative frequency Now we can answer above questions

What is the probability that X ≤ 4 years? = 0.7 (see cumulative frequency)

What is the probability that X ≤ 5 years? = 0.9

What is the probability that 3 < X ≤ 5 years? = (probability X ≤ 5) – (probability X < 3) = 0.9-0.3 = 0.6 i.e. 60% of the managers have experience between 3 to 5 years.

Area under the curve (AUC) as a measure of probability:

Width of a bar in the histogram = 1 unit

Height of the bar = frequency of the class

Area under the curve for a given bar = 1x frequency of the class

Total area under the curve (AUC) = total area under all bars = 1×1+1×2+1×4+1×2+1×1 = 10

Total area under the curve for class 3 < x ≤ 5 = (AUC of 3rd class + AUC of 4th class) /total AUC = (4+2)/10 = 0.6 = probability of finding x between 3 and 5 (excluding 3)

Now, what about the probability of (3.2 < x ≤ 4.3) =? It will be difficult to calculate by this method, as requires the use of calculus.

Yes, we can use calculus for calculating various probabilities or AUC for this problem. Are we going to do this whole exercise again and again for each and every problem we come across?

With God’s grace, our ancestors gave us the solution in the form of Z-distribution or Standard normal distribution or Gaussian distribution, where the AUC between any two points is already documented.

This Standard normal distribution or Gaussian distribution is widely used in the scientific measurements and for drawing statistical inferences. This normal curve is shown by a perfectly symmetrical and bell shaped curve.

The Standard normal probability distribution has following characteristics

1. The normal curve is defined by two parameters, µ = 0 and σ = 1. They determine the location and shape of the normal distribution.
2. The highest point on the normal curve is at the mean which is also the median and mode.
3. The normal distribution is symmetrical and tails of the curve extend to infinity i.e. it never touches the x-axis.
4. Probabilities of the normal random variable are given by the AUC. The total AUC for normal distribution is 1. The AUC to the right of the mean = AUC to the left of mean = 0.5.
5. Percentage of observations within a given interval around the mean in a standard normal distribution is shown below The AUC for standard normal distribution have been calculated for all given value of p ≥ z and are available in tables that can be used for calculating probabilities. Note: be careful whenever you are using this table as some table give area for ≤ z and some gives area between two z-values.

Let’s try to calculate some of the probabilities using above table

Problem-1:

Probability p(z ≥ 1.25). This problem is depicted below Look for z = 1.2 in vertical column and then look for z = 0.05 for second decimal place in horizontal row of the z-table, p(z ≤ -1.25) = 0.8944

Note! The z-distribution table given above give the cumulative probability for p(z ≤ 1.25), but here we want p(z ≥ 1.25). Since total probability or AUC = 1, p(z ≥ 1.25) will be given by 1- p(z ≤ 1.25)

Therefore

p(z ≥ 1.25) = 1- p(z ≤ -1.25) = 1-0.8944 = 0.1056

Problem-2:

Probability p(z ≤ -1.25). This problem is depicted below Note! Since above z-distribution table doesn’t contain -1.25 but the p(z ≤ -1.25) = p(z ≥ 1.25) as standard normal curve is symmetrical.

Therefore

Probability p(z ≤ -1.25) = 0.1056

Problem-3:

Probability p(-1.25 ≤ z ≤ 1.25). This problem is depicted below For the obvious reasons, this can be calculated by subtracting the AUC of yellow region from one.

p(-1.25 ≤ z ≤ 1.25) = 1- {p(z ≤ -1.25) + p(z ≥ 1.25)} = 1 – (2 x 0.1056) = 0.7888

From the above discussion, we learnt that a standard normal distribution table (which is readily available) could be used for calculating the probabilities.

Now comes the real problem! Somehow I have to convert my original dataset into the standard normal distribution, so that calculating any probabilities becomes easy. In simple words, my original dataset has a mean of 3.9 years with σ = 1.37 years and we need to convert it into the standard normal distribution with a mean of 0 and σ = 1.

The formula for converting any normal random variable x with mean µ and standard deviation σ to the standard normal distribution is by z-transformation and the value so obtained is called as z-score. Note that the numerator in the above equation = distance of a data point from the mean. The distance so obtained is divided by σ, giving distance of a data point from the mean in terms of σ i.e. now we can say that a particular data is 1.25σ away from the mean. Now the data becomes unit less! Let’s do it for the above example discussed earlier Note: Z-distribution table is used only in the cases where number of observations ≥ 30. Here we are using it to demonstrate the concept. Actually we should be using t-distribution in this case.

We can say that the managers with 4 years of experience are 0.073σ away from the mean and on the right hand side. Whereas the managers with 3 years of experience are -0.657σ away from the mean on left hand side.

Now, if you look at the distribution of the Z-scores, it resembles the standard normal distribution with mean = 0 and standard deviation =1.

But, still one question need to be answered. What is the advantage of converting a given data set into standard normal distribution?

There are three advantages, first being, it enables us to calculate the probability between any two points instantaneously. Secondly, once you convert your original data into standard normal distribution, you are ending in a unit less distribution (both numerator & denominator in Z-transformation formula has same units)! Hence, it makes possible to compare an orange with an apple. For example, I wish to compare the variation in the salary of the employees with the variation in their years of experience. Since, salary and experience has different unit of measurements, it is not possible to compare them but, once both distributions are converted to standard normal distribution, we can compare them (now both are unit less).

Third advantage is that, while solving problems, we needn’t to convert everything to z-scores as explained by following example

Historical 100 batches from the plant has given a mean yield of 88% with a standard deviation of 2.1. Now I want to know the various probabilities

Probability of batches having yield between 85% and 90%

Step-1: Transform the yield (x) data into z-scores

What we are looking for is the probability of yield between 85 and 90% i.e. p(85 ≤ x ≤ 90)  Step-2: Always draw rough the standard normal curve and preempt what area one is interested in Step-3: Use the Z-distribution table for calculating probabilities.

The Z-distribution table given above can be used in following way to calculate p(-1.43 ≤ z ≤ 0.95)

Diagrammatically, p(-1.43 ≤ z ≤ 0.95) = p(z ≤ 0.95) – p(z ≤ -1.43), is represented below p(-1.43 ≤ z ≤ 0.95) = p(z ≤ 0.95) – p(z ≤ -1.43)= 0.83-0.076 = 0.75

75% of the batches or there is a probability of 0.75 that the yield will be between 85 and 90%.

It can also be interpreted as “probability of getting a sample mean between 85 and 90 given that population mean is 88% with standard deviation of 2.1”.

Probability of yield ≥ 90%

What we are looking for is the probability of yield ≥ 90% i.e. p(x ≥ 90) = p(z ≥ 0.95) Diagrammatically, p(-1.43 ≤ z ≤ 0.95) = p(z ≤ 0.95) – p(z ≤ -1.43), is represented below p(x ≥ 90) = p(z ≥ 0.95) = 1-p(z ≤ 0.95) = 1- 0.076 = 0.17, there is only 17% probability of getting yield ≥ 90%

Probability of yield between ≤ 90%

This is very easy, just subtract p(x ≥ 90) from 1

Therefore,

p(x ≤ 90) = 1- p(x ≥ 90) = 1- 0.17 = 0.83 or 83% of the batches would be having yield ≤ 90%.

Now let’s work the problem in reverse way, I want to know the yield corresponding to the probability of ≥ 0.85.

Graphically it can be represented as Since the table that we are using gives the probability value ≤ z value hence, first we need to find the z-value corresponding to the probability of 0.85. Let’s look into the z-distribution table and find the probability close to 0.85 The probability of 0.8508 correspond to the z-value of 1.04

Now we have z-value of 1.04 and we need find corresponding x-value (i.e. yield) using the Z-transformation formula  Solving for x

x = 90.18

Therefore, there is 0.85 probability of getting yield ≤ 90.18% (as z-distribution table we are using give probability for ≤ z) hence, there is only 0.15 probability that yield would be greater than 90.18%.

Above problem can be represented by following diagram Exercise:

The historical data shows that the average time taken to complete the BB exam is 135 minutes with a standard deviation of 15 minutes.

Fins the probability that

1. Exam is completed in less than 140 minutes
2. Exam is completed between 135 and 145 minutes
3. Exam takes more than 150 minutes

Summary:

This articles shows the limitations of histogram and relative frequency methods in calculating probabilities, as for every problem we need to draw them. To overcome this challenge, a standardized method of using standard normal distribution is adopted where, the AUC between any two points on the curve gives the corresponding probability can easily be calculated using excel sheet or by using z-distribution table. The only thing we need to do is to convert the given data into standard normal distribution using Z-transformation. This also enables us to compare two unrelated things as the Z-distribution is a unit less with mean = 0 and standard deviation = 1. If the population standard deviation is known, we can use z-distribution otherwise we have to work with sample’s standard deviation and we have to use Student’s t-distribution.

## Why Do Pharmaceutical Industry Requires Quality by Design (QBD) (This article is a part of PhD thesis of Mr. Abdul Qayum Mohammed, who is my PhD student)

Authors:

Abdul Qayum Mohammed, Phani Kiran Sunkari, Amrendra Kumar Roy*

*Corresponding Author, email: Amrendra@6sigma-concepts.com

KEYWORDS: QbD, 4A’s, DoE, FMEA, Design space, control strategy

 ABSTRACT QbD is of paramount importance for the patient safety but there is another side of the coin. QbD is also required for timely and uninterrupted supply of medicines into the market. This timely uninterrupted supply is required to fulfill the 4A’s requirement of any Regulatory body as it is their main KRA. But the manufacturers are given an impression that the patients are their main customer, which is not true. Due to which QbD implementation by generic API manufacturers has not picked up. This article tries to tell that the real customer is not patients but the Regulatory bodies who on the behalf of patients are dealing with manufacturer. Hence Regulators need to tell the manufacturer that QbD is required not only for the patient safety but also for meeting the 4A’s requirement, which is equally important. This article tries to correlate the effect of inconsistent manufacturing process on the KRA of the Regulatory bodies and makes a business case out of it. It will help in developing a strong customer-supplier relationship between the two parties and can trigger the smooth acceptance of QbD by generic players. This article also presents the detail sequence of steps involved in QbD using by a process flow diagram.

Introduction:

Nowadays, Quality by design (QbD) is an essential and interesting topic in the pharmaceutical development, be it for drug substance or drug product. Various guidelines have been published by different Regulatory agencies.[i] There is a plethora of literature available on the QbD approach for the process development[ii],[iii] of drug substance, drug product and analytical method development[iv]. Most of the available literature mainly focus on patient safety (QTPP) but if QbD has to sails through, then the generic manufacturer must know why and for whom it is required (apart from patients) and what is there in for them? They should not be taking regulators as an obstacle to their business but as a part of their business itself. There has to be business perspective behind QbD, as everything in this world is driven by economics. It has to be win-win situation for Regulators and the manufacturers. This means that, there has to be synchronization of each other’s expectation. This synchronization will be most effective if API manufacturer’s (i.e. supplier’s) consider Regulators as their customer and try to understand their requirement. In this context it is very important to understand the Regulator’s expectation and their responsibility towards their fellow countrymen.

Regulators Expectations:

Sole responsibility of any Regulator towards its country is to ensure not only acceptable (quality, safety and efficacy) and affordable medicines but also they need to ensure its availability (no shortage) in their country all the time. Even that is not enough for them; those medicines must be easily accessible to patients at their local pharmacies. These may be called as 4A’s and are the KRA of any Regulatory body. If they miss any one of the above ‘4As’, they will be held accountable by their Government for endangering the life of the patients.

In earlier days when the penetration of health services to large section of the society was not there, the main focus of Regulators was on the quality and price of the medicines. During those days margins were quite high and the effect of reprocessing and reworks on manufacturer’s margins were not much. So Regulators were happy as they were getting good quality at best price for their citizens. Gradually the health services gained penetration in to the large section of the society in developed countries and as a result they needed more and more quantities of medicine at affordable price. The KRA of Regulators changed from “high quality and low price” to “quality medicine at affordable price which is available all the time at the doorstep of patients”. Another event that led to the further cost erosion was the arrival of medical insurance and tender based purchasing system in hospitals. Increased demand made manufacturer to increase their batch size but because of insurance and tender based purchasing system, now they don’t have the advantage of high margins and couldn’t afford batch failures/reprocessing anymore. But now, these wastages led to erratic production and irregular supply of medicine in the market, thereby creating a shortage. This affected the KRA (4A’s) of the regulatory bodies; hence they were forced to interfere with the supplier’s system. They realized that in order to ensure their 4A’s, there has to be a robust process at manufacturer’s site and if it is done the medicines would automatically be available in their country (no shortages) and will be accessible to all patients at affordable price. This process robustness is possible with the use of some proven statistical tools like six sigma and QbD during the manufacturing of an API. This path to robust process was shown by the Regulators in the form of Q8/Q9/Q10/Q11 guidelines1 where QbD was made mandatory for formulators but and it is strongly recommended for API manufacturer and soon it would be made mandatory. While making QbD mandatory, they are emphasizing on how QbD is related to patient safety and how it will make the process robust for the manufacturers which in turn would eliminate the fear of audits. Regulators are right but somewhere they missed to communicate the business perspective, that was behind the QbD implementation i.e. manufacturers were not having much clue about the Regulator’s KRA and as a result a customer-supplier relationship never developed. Figure 1: Regulator’s unsaid expectations

Manufacturer’s point of view

As Regulators were insisting on QbD, manufacturers have their own constraints in plant due to inconsistency of the process (Figure 2). As Regulator’s emphasis was on the patient’s safety rather than 4A’s, manufacturer took patients as their customer instead of Regulators and they make sure that there is no compromise with the quality of the medicines to delight the customer ie, patients. It doesn’t matter to manufacturer, if the quality is achieved by reprocessing/rework as far as the material is of acceptable quality to the customers. Due to this misconception about who the real customer is, 4A’s got neglected by the manufacturer. Another problem is the definition of quality perceived by two parties. Quality of an API from the customer’s perspective has always been defined with respect to the patient safety (i.e. QTPPs which is indeed very important) but for the manufacturer quality meant only the purity of the product as he enjoyed handsome margin. Profit = MP – COGS                                                                         Eq-1

MP                =  market Price

COGS             = genuine manufacturing cost + waste cost (COPQ)

COPQ             = Variation/Batch failure/Reprocessing & rework /product    recall = increase in drug product/drug substance cost = loosing customer faith (intangible cost)

Coming to prevailing market scenario, the manufacturers doesn’t have luxury to define the selling price, now the market is very competitive and the price of goods and services are dictated by the market, hence it is called as market price (MP) instead of selling price (SP). This lead to the change in the perception of quality, now quality was defined as producing goods and services meeting customer’s specification at the right price. The manufacturers are now forced to sell their goods and services at the market rate. As a result the profit is now defined as the difference of market rate and cost of goods sold (COGS). If manufacturing process is not robust enough then COPQ will be high resulting in high COGS and either (patient or manufacturer) of the party has to bear the cost. According to Taguchi, it is a loss to the society as a whole as neither of the party is getting benefitted. If these failures are more frequent it leads to production loss and as a result timely availability of the product in the market is not there and manufacturer is not able to fulfill the 4A’s criteria of the customer. This not only leads to loss of market share but also loss of customer’s confidence and customer in turn would look for other suppliers who can fulfill their requirements. This is an intangible loss to the manufacturer.

The COPQ has direct relationship with the way in which process has been developed. There are two ways in which a process could be optimized (Figure 3). It is clear from the Figure 3 that if one focus on the process optimization, it will lead to less COPQ and process would be more robust in terms of quality, quantity and timelines thereby reducing the COGS by elimination COPQ. This raises another question, how process optimization is different from product optimization and how it is going to solve all problems related to inconsistency? This can be understood by understanding the relationship between QTPPs/CQAs and CPPs/CMAs. As a manufacturer we must realize that any CQA (y) is a function of CPPs & CMAs (x) i.e. the value of CQA is dictated by the CPPs/CMAs and not vice versa (Figure 4 & 7). It means that by controlling CPPs/CMAs we can control CQAs but in order to do this we need to study and understand the process very well. This will help in quantifying the effect of CPPs/CMAs on CQAs and once it is done, it is possible to control the CQAs at a desired level just by controlling the CPPS/CMAs. This way of process development is called as process optimization and QbD insists on it. Another important concept associated with process optimization is the way in which in-process monitoring of the reaction is done. Traditionally, a desired CQA is monitored for any abnormality during the reaction whereas process optimization methodology it is required to monitor the CPP/CMA (Figure 4) which is responsible for that CQA. Hence it requires a paradigm shift in which the process is developed and control strategy is formulated by a manufacturer if the focus is on the process optimization. Figure 3: Two ways of optimization

From the above discussion, it is clear that the real customer for a generic manufacturer is not the patients but the Regulators. This is because patients can’t decide and they don’t have capability to test the quality of the medicines, for them all brands are same. Hence Regulators comes into the pictures, who on the behalf of patients are dealing with manufacturers because they have all means and capability of doing so. Going by the Figure 5, patients are the real customer for the Regulators and who in turn are the customer for the manufacturer. In business sense, patients are just the end user of the manufacturer’s product once the product is approved by Regulators for use. Figure 4: Relationship between CQAs and CPPs/CMAs

As it is clear that the Regulators are the real customers for the manufacturer and with the current inefficient process, manufacturer is not helping his customer in meeting their goal (4A’s). They can now understand the relationship between his inefficient manufacturing process and the customer’s KRA (Figure 6). In addition, they can clearly visualize the advantage of the process optimization over product optimization and how QbD can act as an enabler in developing a robust process thereby fulfilling the requirement of 4A’s . This will encourage manufacturer to adopt QbD because now it makes a strong business case for them for retaining the existing market and also as a strategy for entering the new market. This is a win-win situation for both the parties. Therefore, QbD should be pursued by manufacturer not because of the regulatory fear but as a tool for fulfilling the customer’s KRA which in-turn would benefit manufacturer by minimizing COPQ. In addition, it helps in building customer’s trust which is an intangible asset for any manufacturer. This will enable the manufacturers to accept Regulators as their customer rather than as an obstacle. This would result in better commitment from manufacturers about implementing QbD because the definition of customer as defined by Mahatma Gandhi is very relevant even today.

 “A customer is the most important visitor on our premises. He is not dependent on us. We are dependent on him. He is not an interruption in our work. He is the purpose of it. He is not an outsider in our business. He is part of it. We are not doing him a favor by serving him. He is doing us a favor by giving us an opportunity to do so.” ― Mahatma Gandhi Figure 5: Dynamic Customer-Suppliers relationship throughout the supply chain Figure 6: Manufacturer perception after understanding customer-supplier relationship

Manufacturer in customer’s shoes:

Another reason provided by the manufacturer for inconsistency is the quality of KSM supplied by their vendors and any quality issue with KSM will affect the quality of the API as shown by Figure 7 and equation 2. Till now manufacturer was acting as a supplier to Regulators but now manufacturer is in the shoes of a customer and can understand the problem faced by him because of the inconsistent quality of KSM from his supplier (Figure 5, Table 1). Now manufacturer can empathize with Regulatory bodies and is in a position to understand the effect of their process on his customer’s KRA(Figure 6). Table 1 is equally applicable to the relationship between manufacturer and the Regulatory bodies.

Table 1: Effect of process inconsistency from supplier/manufacturer on API quality Consider Case-1 (Table 1) which represents the ideal condition, where process is robust at both sides. Whereas Case-2 and Case-3 represents an inconsistent process at either of the party and this inconsistency would reflect as an inconsistency in the quality of the API at manufacturer’s site. This would result in an unsatisfied customer (Regulator) and loss of market to someone else. Lastly, an inconsistent process from both the side (Case-4) would result in a disaster situation where it would be difficult for a manufacturer to control the quality of the API because the variance from both the sides would just add up (equation 2). In this case customer can’t even think of getting material from manufacturer as it would pose a threat to the patient’s life and no regulatory body would allow that. Someone can argue that if consistency is an issue from supplier (Case 3) then they would negotiate with them for cherry-picking the good batches, but no supplier would do the cherry-picking without any extra cost, which in turn would increase the cost of the API. Another consequence of this handpicking is the interruption in the timely supply of KSM which will result in delay in the production at manufacturer’s site. This would result in increased idle time of resources thereby increased overheads which ultimately would reflect in increased API cost. Apart from increased cost it would also result in sporadic supply to the customer. Another viable option for circumventing the inconsistency at supplier’s end is to do a reprocessing of KSM at the manufacturer’s site. Obviously this is not the viable solution as it would escalate the COGS. Hence there is no choice but to take your supplier in confidence and make him understand the implication of his product quality on your business and how his business in-turn would get affected by it. Best solution is to discuss with the supplier and ask him for improving his process (if supplier has the capability) or help them in improving his KSM process (if manufacturer has the capability).

Note: Apart from robust process, Regulators are also auditing the manufacturer’s site for the safety and the ETP facility. It is being done again for the same reason of ensuring the continuous supply of medicines to their country.

How inconsistency of the process affects the quality? And How QbD will help in getting rid of this inconsistency?

Realizing that we need to have a consistent quality and uninterrupted production is not enough, as a manufacturer we must understand the various sources of inconsistency and how it can affects the quality of the API.

Any chemical reaction that is happening in a reactor is a black box (Figure 7) for us and there are three kinds of inputs that go into the reactor. The first input known as MAs are chemical entities that go into the reactor (KSM, reagents and other chemicals). The second input known as PPs are the reaction/process parameters that can be controlled by the manufacturer and third being the environmental/external factors like room temperature, age of the equipment, operators etc. that cannot be controlled. As variance (σ2) has an additive property, hence inconsistency from all the three types of factors amplifies the inconsistency of the product quality. The variation caused by the third type i.e. by external factors is called as inherent variation and we have to live with it. At most the effect of these nuisance factors could be nullified by blocking and randomization during DoE studies. Because of this inherent variation, yield or any other quality parameters are reported as a range instead of a single number. But the variation due to other two types of factors (MAs and PPs) could be controlled by studying its effect on product attributes (QAs) by using a combination of some risk analysis tools and some statistical tools for optimization. The combination of risk based assessment of MAs and PPs and use of statistical tools as DoE/MVA for optimizing the effect of MAs and PPs on QAs is called as QbD. Hence QbD is the tool that manufacturers are looking for, to eliminate the inconsistency in their product thereby fulfilling the customer’s expectations.

The variance that is being shown by Figure 7 represents the variation only at a single stage. Consider a multi-step synthesis (most common scenario) and in such scenarios the total variance at the API stage would be the culmination of variance from all the stages, resulting in a total out of control process as shown below by equation 3.  Figure 7: Cumulative effect of variance from various sources on the variance of API quality

At what stage of product development QbD required to be applied?

The traditional approach of process development of any API is focused more on filing the DMF at earliest. As a result of this improper process development there are failures at commercial scale and process comes back to R&D for fine tuning. But if the process is developed with QbD approach at R&D stage itself, certainly it would take more time initially, but its worth investing the time as there will be less failures or no failures at commercial scale and process could be scaled up in very less time. This will reduce the reprocessing and rework at commercial scale thereby minimizing the COPQ, a win-win situation for all as depicted in Figure 8. Figure 8: Risk and reward associated with QbD and traditional approach

[i].  (a) ICH Q8 Pharmaceutical Development, (R2); U.S. Department of Health and Human Services, Food and Drug Administration, Center for Drug Evaluation and Research (CDER): Rockville, MD, Aug 2009. (b) ICH Q9 Quality Risk Management; U.S. Department of Health and Human Services, Food and Drug Administration, Center for Drug Evaluation and Research (CDER): Rockville, MD, June 2006. (c) ICH Q10 Pharmaceutical Quality System; U.S. Department of Health and Human Services, Food and Drug Administration, Center for Drug Evaluation and Research (CDER): Rockville, MD, April 2009. Understanding Challenges to Quality by Design, Final deliverable for FDA Understanding Challenges to QbD Project, December 18, 2009.

[ii]. (a) Jacky Musters, Leendert van den Bos, Edwin Kellenbach, Org. Process Res. Dev., 2013, 17, 87. (b) Zadeo Cimarosti, Fernando Bravo, Damiano Castoldi, Francesco Tinazzi, Stefano Provera, Alcide Perboni, Damiano Papini, Pieter Westerduin, Org. Process Res. Dev., 2010, 14, 805. (c) Fernando Bravo, Zadeo Cimarosti, Francesco Tinazzi, Gillian E. Smith, Damiano Castoldi, Stefano Provera, Pieter Westerduin, Org. Process Res. Dev., 2010, 14, 1162.

[iii]. (a) Sandeep Mohanty, Amrendra Kumar Roy, Vinay K. P. Kumar, Sandeep G. Reddy, Arun Chandra Karmakar, Tetrahedron Letters, 2014, 55, 4585. (b) Sandeep Mohanty, Amrendra Kumar Roy, S. Phani Kiran, G. Eduardo Rafael, K. P. Vinay Kumar, A. Chandra Karmakar, Org. Process Res. Dev., 2014, 18, 875.

[iv]. Girish R. Deshpande, Amrendra K. Roy, N. Someswara Rao, B. Mallikarjuna Rao, J. Rudraprasad Reddy, Chromatographia, 2011, 73, 639.  ## Concept of Quality — We Must Understand this before Learning 6sigma! Before we try to understand the 6sigma concept, we need to define the term “quality”.

##### What is Quality?

The term “quality” has many interpretations, but this by the ISO definition, quality is defined as: “The totality of features and characteristics of a product or service that bear on its ability to satisfy stated or implied needs”.

If we read between the lines, then the definition varies with the reference frame we use to define the “quality”. The reference frame that we are using here are the manufacturers (who is supplying the product) and the customer (who is using the product). Hence the definition of quality with respect to above two reference frame can be defined as This “goal post” approach to quality is graphically presented below, where a product is deemed pass or fail. It didn’t matter even if the quality is on the borderline (football just missed the goalpost and luckily a goal was scored). This definition was applicable till the time there was a monopoly for the manufacturers or having a limited competition in the market. The manufacturers were not worried about the failures as they can easily pass on the cost to the customer. Having no choice, customer has to bear the cost. This is because of the traditional definition of profit shown below. Coming to current business scenario, the manufacturers doesn’t have luxury to define the selling price, now the market is very competitive and the price of goods and services are dictated by the market, hence it is called as market price instead of selling price. This lead to the change in the perception of quality, now quality was defined as producing goods and services meeting customer’s specification at the right price. The manufacturers are now forced to sell their goods and services at the market rate. As a result the profit is now defined as the difference of market rate and cost of goods sold (COGS).

In current scenario if a manufacturer wants to make a profit, the only option he has is to reduce COGS. In order to do so, one has to understand the components that makes up COGS. The COGS in has many components as shown below. The COGS consist of genuine cost of COGS and the cost of quality. The genuine COGS will always be same (nearly) for all manufacturers, but the real differentiator would be the cost of quality. The manufacturer with lowest cost of quality would enjoy highest profit and can influence the market price to keep the competition at bay. But in order to keep cost of quality at its lowest possible level, the manufacturer has to hit the football, right at the center of the goalpost every time! The cost of quality involves the cost incurred to monitor and ensure the quality (cost of conformance) and the cost of non-conformance or cost of poor quality (COPQ). The cost of conformance is a necessary evil whereas the COPQ is a waste or opportunity lost. Coming to the present scenario, with increasing demand of goods and services, manufacturers required to fulfill their delivery commitment on time otherwise their customers would lose market share to the competitors. The manufacturers has realized that their business depends on the business prospects of their customers hence, timely supply of products and services is very important. This can be understood in a much better way using pharmaceutical industry

Sole responsibility of any Regulator (say FDA) towards its country is to ensure not only the acceptable (quality, safety and efficacy) and affordable medicines but they also need to ensure its availability (no shortage) in their country all the time. Even that is not enough for them; those medicines must be easily accessible to patients at their local pharmacies. These may be called as 4A’s and are the KRA of any Regulatory body. If they miss any one of the above ‘4As’, they will be held accountable by their Government for endangering the life of the patients. The point that need to be emphasized here is the importance of TIMELY SUPPLY of the medicines besides other parameters like quality and price.

Hence, the definition of quality again got modified as “producing goods and services in desired quantity which is delivered on time meeting all customer’s specification of quality and price.” A term used in operational excellence called as OTIF is acronym for “on time in full” meaning delivering goods and services meeting customer’s specification on time and in full quantity.

Coming once again to the definition of profit in present day scenario

Profit=MP-COGS

We have seen that the selling price is driven by the market and hence manufacturer can’t control it beyond an extent. So what he can do to increase his margin or profit? The only option he has is to reduce his COGS. We have seen that COGS has two components, genuine GOGS and COPQ. The manufacturers have little scope to reduce the genuine COGS as it is a necessary evil to produce goods and services. We will see latter in LEAN manufacturing how this genuine COGS can be reduced to some extent (wait till then!) e.g. if we can increase the throughput, we can bring down genuine COGS (if throughput or the yield of the process is improved, which results in less scrap would decrease the RM cost per unit of the goods produced).

But the real culprit for the high COGS is the unwarranted high COPQ.

The main reasons for high COPQ are

1. Low throughput or yield
2. More out of specifications (OOS) products which required to be either
1. Reprocessed
2. Reworked or
3. Has to be scraped
3. Inconsistent quality leading to more after sales& service and warranty costs
4. Biggest of all loses would be the customer’s confidence in you, which is intangible.

If we look at the outcomes of COPQ (discussed above), we can conclude one thing and that is “the process is not robust enough to meet customer’s specifications” and because of this manufacturers faces the problem of COPQ. All these wastages are called as “mudas” in Lean terminology hence, would be dealt in detail latter. But the important

What causes COPQ?

Before we can answer this important question, we need to understand the concept of variance. Let’s take a simple example, say you start from the home for office on exactly the same time every day, do you reach the office daily on exactly same time? Answer will be a big no or a better answer would be, it will take anywhere between 40-45 minutes to react the office if I start exactly at 7:30 AM. This variation in office arrival time can be attributed to many reasons like variation in starting time itself (I just can start exactly at 7:30 every day), variation in traffic conditions etc. There will always be a variation in any process and we need to control that variation. Even in the manufacturing atmosphere there are sources of variation like wear and tear of machine, change of operators etc. Because of this variation, there will always be a variation in the output (goods and services produced by the process). Hence, we will not get a product with a fixed quality attributes, but that quality attribute will have a range (called as process control limits) which need to be compared with the customer’s specification limits (goal post).

If my process control limits are towards the goal post (boundaries of the customer’s specification limits) represented by the goal post, then my failure rate would be quite high resulting in more failures, scrap, rework, warranty cost. This is nothing but COPQ.

Alternatively if my aim (process limits) are well within the goal posts (case-2), my success rate are much higher and I would be have less, scrap and rework thereby decreasing my COPQ.  ###### Taguchi Loss Function

A paradigm shift in the definition of quality was given by Taguchi, where he gave the concept of producing products with quality targeted at the center of the customer’s specifications (a mutually agreed target). He stated that as we move away from the center of the specification, we incur cost either at the producer’s end or at the consumer’s end in the form of re-work and re-processing. Holistically, it’s a loss to the society. It states that even producing goods and services beyond customer’s specification is a loss to the society as customer will not be willing to pay for it. There is a sharp increase in the COGS as we try to improve the quality of goods and services beyond the specification. For example;

The purity of medicine I am producing is > 99.5 (say specification) and if I try to improve it to 99.8, it will decrease my throughput as we need to perform one extra purification that will result in yield loss and increased COGS.

Buying a readymade suit, it is very difficult to find a suit that perfectly matches your body’s contour, hence you end up going for alterations. This incurs cost. Whereas, if you get a suit stitched by a tailor that fits your body contour (specification), it would not incur any extra cost in rework.

###### Six Sigma and COPQ

It is apparent from the above discussion that “variability in the process” is the single most culprit for the failures resulting in high cost of goods produced. This variability is the single most important concept in six sigma that required to be comprehended very well. We will encounter this monster (variability) everywhere when we will be dealing with six sigma tools like histogram, normal distribution, sampling distribution of mean, ANOVA, DoE, Regression analysis and most importantly the statistical process control (SPC).

Hence, a tool was required by the industry to study the variability and to find the ways to reduce it. The six sigma methodology was developed to fulfill this requirement. We will look into the detail why it is called as six sigma and not five or seven sigma latter on.

Before we go any further, we must understand one very important thing and must always remember this “any goods and services produced is an outcome of a process” also “there are many input that goes into the process, like raw materials, technical procedures, men etc”.

Hence, any variation in the input (x) to a given process will cause a variation in the output (y) quality. Another important aspect is that the variance has an additive property i.e. the variance from all input is added to give the variance in the output. ###### How Six Sigma works?

Six sigma works by decreasing the variation coming from the different sources to reduce the overall variance in the system as shown below. It is a continuous improvement journey. ###### Summary:
1. Definition of Quality has changed drastically over the time, it’s no more “fit for purpose” but also include on time and in full (OTIF).
2. In this world of globalization, market place determines the selling price and manufacturers either have to reduce their COPQ or perish.
3. There is a customer specification and a process capability. The aim is to bring the process capability well within the customer’s specifications.
4. Main culprit of out of specification product is the unstable process which in turn is because of variability in the process coming from different sources.
5. Variance has an additive property.
6. Lean is tool to eliminate the wastages in the system and six sigma is a tool to reduce the defects from the process.

References

1.  In order to understand the consequences of a bad process, see red bead experiment designed by Deming on Youtube  https://www.youtube.com/watch?v=JeWTD-0BRS4
2. For different definition of quality see http://www.qualitydigest.com/magazine/2001/nov/article/definition-quality.html#