## Understanding the Difference Between Long and Short Term Sigma

We have seen that the main difference between Cpk and the Ppk is the way in which the value of sigma (standard deviation) is being calculated.

In Cpk, the value of sigma comes from the control chart and usually given by the formula

Where  is the average of the absolute value of range (obtained as a difference of two consecutive points when, data is arranged in a time order). The term d2 is a statistical constant that depend on the sample size.

This sigma-short is affected by the time order to the data i.e. every time you change the time order, sigma-short would change.

Whereas, in Ppk the sigma is calculated using traditional formula and is also called as the overall sigma or sigma-long.

In this case, sigma-long is not affected by the time order of the data points. This is called as overall standard deviation.

Usually, sigma-short is less than sigma-long.

Let’s do a simulation in R to check whether sigma-short is really affected by the time order or not

 #setting the seed for reproducibility set.seed(2307)  #load library QCC library(qcc)  # Generate a normal sample of 50 data points d<-rnorm(50,100,1.1)  # Generate a data set for storing output of the control chart, sigma-short and   sigma-long IMR<-list() sigma_short<-c() sigma_long<-c()  # Generate a blank matrix of 10 rows and 50 columns to store 10 10   random samples each having 50 data points. sam<-matrix(nrow=10,ncol=50,byrow = TRUE)  # Code for generating 10 random samples from the normal sample   generated as (d) above for(i in 1:10){ sam[i,]<-sample(d,50,replace=FALSE) #generate ith sample and store in   the matrix sam.#generate I-MR chart of the ith sample. IMR<-qcc(sam[i,],”xbar.one”,plot=FALSE) #calculate sigma-short of the ith sample. sigma_short[i]<-IMR\$std.dev #calculate sigma-long of the ith sample. sigma_long[i]<-sd(sam[i,]) } #print data frame   containing sigma-short and sigma-long of all 10 sample. (data_table<-cbind(sigma_short,sigma_long))

Table-1: Short and long sigma generated from the same simulated data but with different time order.

 sigma_short sigma_long 1.1168596 1.09059 1.1462365 1.09059 1.1023853 1.09059 0.9902320 1.09059 1.1419678 1.09059 1.2173854 1.09059 0.9941954 1.09059 1.0408088 1.09059 1.1038588 1.09059 1.2275286 1.09059

It is evident from the simulation that sigma-short do get affected by the time order of the data. Therefore, the sigma or the standard deviation calculated from the control charts (short sigma) and the overall sigma are different.

for more on Cpk and Ppk see below links

Car Parking & Six-Sigma

What Taguchi Loss Function has to do with Cpm?

What do we mean by garage’s width = 12σ and car’s width = 6σ?

 Abstract: You will be surprised that we all are aware of this concept of distribution and are using it intuitively, all the time! Don’t believe me? Let me ask you a simple question, to which income class do you belong? Let’s assume that your answer is middle income class. On what basis did you made this statement? Probably in your mind you have following distribution of income groups and based on this image in your mind, you are telling your position is towards the left side or towards the middle income group on this distribution.  Figure-1: How we are making use of distribution in our daily life, intuitively
 Note: This article gives a conceptual view of the tools that we use in inferential statistics. Here we are not explanting the concept of sampling or  the sampling distribution. Instead we are using distribution of individual values and assuming them to be normally distributed (which is not always the case) in order to explain the concept and also using it for the illustration purpose. We advise readers to read something on “sampling and sampling distribution” immediately after reading this article for better clarity as we are giving oversimplified version of the same in the present article. Don’t miss the “Central limit theorem”.

Introduction to the Concept of Distribution

When we say that my child is not good at studies, you are drawing a distribution of all students in your mind and implicitly trying to tell the position of your child towards the left of that distribution. Whenever we talk of adjectives like rich, poor, tall, handsome, beautiful, intelligent, cost of living etc., we subconsciously, associate a distribution to those adjective and we just try to pinpoint the position of a given subject onto this distribution.

What we are dealing here is called as inferential statistics because, it helps in drawing inferences about the population based on a sample data. This is just opposite of probability as shown below.

Figure-2: Difference between probability & statistics

This inferential statistics empower us to take a decision based on the small sample drawn from a population.

Why, it is so difficult to take decisions or what causes this difficulty?

This is because we are dealing with samples instead of population. Let’s assume, we are making a batch of one million tablets (population) of Lipitor and before releasing this batch in to the market, we want to make sure that each tablet must be having Lipitor content of 98-102%. Can we analyze all one million tablets? Absolutely not! What we actually do is to analyze, say 100 tablets (Sample) selected at random from one million tablets and based on the results, we accept or reject the whole lot of one million tablets (we usually use z-test or t-test for taking decisions)

BUT, there is a catch. Since we are working with small samples, there is always a chances of taking a wrong decision because the sample thus selected may not be homogeneous enough to represent the entire population (sampling error). This error is denoted by alpha (α) and is decided by the management prior to performing any study i.e. we are accepting an error of α. It means that there is a probability of α that we are accepting a failed batch of Lipitor. Since α is a theoretical threshold limit then, it must be vetted by some experimental probability value. This experimental or the observed probability value is called as p-value (see blog on p-value).

Another aspect of the above discussion arises if we draw two or more samples (of 100 tablets each) and try to analyze them. Let me make it more complicated for you. You are the analyst and I come to you with three samples and want to know from you, whether all these three samples are coming from a single batch (or belong to the same parent population) or not? Point I want to emphasize is that, even though multiple samples are withdrawn from the same population but they would seldom be exactly the same because of the sampling error. The concept is described in following figure-3. This type of decision where sample size ≥ 3, is taken by ANOVA.

Figure-3: The distribution overlap and the decision making (or inferential statistics)

We have seen earlier that α is the theoretical probability or a threshold limit beyond which we assume that the process is no longer the same. This theoretical limit is then tested by collecting a dataset followed by performing some statistical tests (t-test, z-test etc.) to obtain an experimental or observed probability value or the p-value and if, this p-value is found to be less than α, we say that samples are coming from two different populations. This concept is represented below

Figure-4: The relationship between p-value and the alpha value for taking statistical decisions.

Let’s remember the above diagram and try to visualize some more situations that we face every day, where we are supposed to take decisions. But before we do that, one important point, we must identify the target population correctly otherwise whole exercise would be a futile one.

For example

As a high end apparel store, I am interested in the monthly expenditure of females, but wait a second, shouldn’t we specify what kind of females? Yes, we require to study the females of following two categories

The employed and the self-employed females (great! at least we have identified the population categories to be compared). Now next dilemma is whether to consider the females of all age groups or the females below certain age? As my store is more interested in young professionals hence I would compare the above two groups of females but with an age restriction of less than or equal to thirty years.

Figure-5: Identifying the right population for study is important

Another important point, in order to compare two (using z-test or t-test) or more samples (using ANOVA), we also require information about the mean and standard deviation of the samples, before we can tell whether they are coming from same or different parent population.

For example, the mean monthly expenditure on apparel by a sample of 30 employed females is \$1500 and the mean expenditure by 30 self-employed females be \$1510. Immediately we will try to compare these two means and conclude that two means are almost the same. In back of our mind we are assuming that even though means are different but there will some variation in the data and if, we consider this variation then this difference is not significant. Remember! We have made some kind of distribution in our mind before making this statement. (statistically we do it by two sample t-test)

Figure-6: Significant Overlap between distributions indication no difference between them

What if, the mean expenditure by self-employed females be \$1525, then we can say it’s not a big difference to be significant (again we are assuming that there will be a variability in the data). What if, the mean expenditure by self-employed females is \$1600, in this case we are certain that the difference is significance. In all three cases discussed above, it is assumed that variance remained constant.

Figure-7: Insignificant Overlap between distributions indication that there is a difference between them

In real life, whenever we encounter two samples, we are tempted to compare the mean directly for taking decisions. But, in doing so, we forget to consider the standard deviation (variation) that is there in the data of two samples. If we consider the standard deviation and then if we find that there is no significant overlap between the distribution of the monthly expenditure by employed females and the distribution of the monthly expenditure by self-employed, then we can conclude that the expenditure behavior of the two groups are different (see figure-6 & 7 above).

Some other situations that could be understood by drawing the distribution. It will help us in comprehending the situation in a much better way.

Women workforce are protesting that there is a gender biasness in the pay scale in your company, is it so?

Once again, be careful about selecting the population for the study! We should only compare males/females of same designation or with same work experience. Let’s take the designation (males & females at manager and senior manager level) as a criterion for the comparison. Since, we have identified the population, we can now select some random samples from both genders belonging to manager and senior manager level. We can have two situations, either the two distribution overlaps or do not overlap. If there is a significant overlap (p-value > α) then there is no difference in salary based on the gender. On the other hand, if two distribution are far apart (p-value < α), then there is a gender bias.

Figure-8: Intuitive scenarios for taking decision, based on the degree of distribution overlap

Our new gasoline formula gives a better mileage than the other types of gasoline available in the market, should we start selling it?

This problem can be visualized by following diagram. But be careful! While measuring mileage, make sure you are taking same kind of car and testing them on the same road and running them for the same number of kilometers at a same constant speed! Since number of samples ≥ 3, use ANOVA.

Figure-9: Understanding the gasoline efficiency using distribution

New filament increase the life time of a bulb by 10%, should we commercialize it?

For this problem, let’s produce two sets of bulbs, first set with the old filament and second set with the new filament. This is followed by testing the samples from each group for their lifespan, what we are expecting is represented below

Figure-10: Understanding the filament efficiency using distribution

A new catalysts developed by R&D team can increase the yield of the process by 5%, should we scale-up the process?

Here we need to establish whether the 5% increase in yield is really higher or not. Can this case be represented by case-1 or by case-2 in above diagrams?

The efficacy of a new drug is 30% better than of the existing drug in the market, is it so?

The soap manufacturing plant finds that some of the soap are weighing 55 gm. instead of 50-53 gm t(he target weight)., should he reset the process for corrective actions?

A new production process claims to reduce the manufacturing time by 4 hrs, should we invest in this new process?

The students of ABC management school are offered better salary than that of the XYZ School, is it so? Colleges advertise like that!

Let’s have a look how the data is usually manipulated here. In order to promote a brand, companies usually distort the distribution when they compare their products with the other brands.

Figure-11: Misuse of statistics

ABC College or any other company promoting their brands would take samples from the upper band of their distribution and then they compare it with the distribution of the XYZ College or with other available brands. This gives a feeling that ABC College or a given brand in question is better than others. Alternatively, you can take competitor’s samples from the lower end of their distribution for comparison for getting the feel good factor about your brand!

Yield of a process has decreased from 90% to 87%, should we take it as a six sigma project?

Again, we need to establish whether the decrease of 3% yield is really significant or not. Can this case be represented by case-1 or by case-2 in above diagrams?

If we look at the situations described in points 4-8 above, we are forced to think “what is the minimum separation required between the mean of two sample, to tell whether there is significant overlap or not”

Figure-12: What should be the minimum separation between distribution?

This is usually done in following steps (this will be dealt separately in next blog on hypothesis testing)

1. Hypothesis Testing
1. Null and alternate hypothesis
2. Decide α
2. Test statistics
1. Use appropriate statistical test to estimate p-value like Z-test, t-test, F-test etc.
3. Compare p-value and α
4. Take decision based on whether p-value is < or > α

Concept of distribution and the hypothesis testing

Let’s see how the above concept of distribution helps in understanding the hypothesis testing. In hypothesis testing we make two statement about the same population based on the sample. These two statement are known as “null” and “alternate” hypothesis.

Null Hypothesis (H0): Mean mileage from a liter of new gasoline ≤ 20 Km (first distribution)

Alternate Hypothesis (Ha): Mean mileage from a liter of new gasoline > 20 Km (second distribution)

The above two statement can be represented by following two distribution

Figure-13: Distributions of null and alternate hypothesis

Now, if H0 is true i.e. new gasoline is no better than the existing one then, we would expect two distributions to overlap significantly (p-value > a)

Figure-14: Pictorial view of the condition when null hypothesis is true

On the other hand if H0 is false or Ha is true (new gasoline is really better than the existing one) then these two distribution will be far from each other or there would no significant overlap of the two distributions (p-value < a)

Figure-15: The pictorial view of the case when null hypothesis is not true

Above discussion can be extended to understand ANOVA, Regression analysis etc.

Summary

This article tries to give a pictorial view to a given statistical problem, we can call it as “The Tale of Two Distributions”.

Any business problem that requires decision making can be visualized in the form of a overlapping or a non-overlapping distributions. This will give a pictorial view of the problem to the management and would be easy for comprehending the problem.

Another point that is important here is the exercise if identifying the right target population i.e. we must make sure that an apple is compared to an apple!

Going forward, this understanding will help you in understanding hypothesis testing in upcoming blog.

## 7QC Tools: Why do we Require to Plot X-bar and R-charts Simultaneously

 Abstract: The main purpose of the control charts is to monitor the health of the process and this is done by monitoring both, accuracy and the precision of the process. The control charts is a tool that helps us in doing so by plotting following two control charts simultaneously for accuracy and precision. Control chart for mean (for accuracy of the process) Control chart of variability (for Precision of the process) E.g. X-bar  and R chart (also called averages and range chart) and X-bar  and s chart

The Accuracy and the precision

We all must be aware of the following diagram that explains the concept of precision and accuracy in that analytical development.

Case-1:

If you are hitting the target all the time at the bull’s eye is called as  accuracy and if all your shots are concentrated at the same point then it is called as Precision.

Figure-1: Accuracy and precision

Case-2:

You are off the target (inaccurate) all the time but your shots are concentrated at the same point i.e. there is not much variation (Precision)

Case-2:

It is an interesting case. Your shots are scattered around the bull’s eye but, on an average your shots are on the target (Accuracy), this is because of the average effect. But your shots are wide spread around the center (Imprecision).

Case-4

In this case all your shots are off target and precision is also lost.

Before we could correlate the above concept with the manufacturing process, we must have a look at the following diagram that explains the characteristics of a given manufacturing process.

Figure-2: Precision and Accuracy of a manufacturing process

The distance between the average of the process control limits and the target value (average of the specification limits) represents the accuracy of the process or how much the process mean is deviating from the target value.

Whereas the spread of the process i.e. the difference between LCL and UCL of the process represents the precision of the process or how much variation is there in the process.

Having understood the above two diagrams, it would be interesting to visualize the control chart patterns in all of the four cases discussed above. But, before that let’s have a look at the effect of time on a given process i.e. what happens to the process with respect to the time?

As the process continue to run, there will be wear and tear of machines, change of operators etc. and because of that there will be shift and drift in the process as represented by four scenarios described in the following diagram.

Figure-3: Process behavior in a long run

A shift in the process mean from the target value is the loss of accuracy and change in the process control limits is the loss of precision. A process shift of ±1.5σ is acceptable in the long run.

If we combine figure-1 and figure-3, we get the figure-4, which enable us to comprehend the control charts in a much better way. This gives picture of the manufacturing process in the form of control charts in four scenarios discussed above.

Figure-4: Control chart pattern in case of precision and accuracy issue

Above discussion is useful in understanding the reasons behind the importance of the control charts.

1. Most processes don’t run under statistical control for long time. There are drifts and shift in the process with respect to the time, hence process needs adjustment at regular interval.
2. Process deviation is caused by assignable and common factors/causes. Hence a monitoring tool is required to identify the assignable causes. This tool is called as control charts
3. These control charts helps in determining whether the abnormality in the process is due to assignable causes or due to common causes
4. It enables timely detection of abnormality prompt us to take timely corrective action
5. It provides an online test of hypothesis that the process is under control
1. Helps in taking decision whether to interfere with process or not.
1. H0: Process is under control (common causes)
2. Ha: Process is out of control (assignable causes)

6.  Helps in continuous improvement:

Figure-5: Control Charts provide an opportunity for continuous improvement

## Why Standard Normal Distribution Table is so important?

###### Abstract

Since we are entering the technical/statistical part of the subject hence, it would be better for us to understand the concept first

For many business decisions, we need to calculate the likelihood or probability of an event to occur. Histograms along with relative frequency of a dataset can be used to some extent.. But for every problem we come across we need to draw the histogram and relative frequency to find the probability using area under the curve (AUC).

In order to overcome this limitation a standard normal distribution or Z-distribution or Gaussian distribution was developed and the AUC or probability between any two points on this distribution is well documented in the statistical tables or can be easily found by using excel sheet.

But in order to use standard normal distribution table, we need to convert the parent dataset (irrespective of the unit of measurement) into standard normal distribution using Z-transformation. Once it is done, we can look into the standard normal distribution table to calculate the probabilities.

From my experience, I found the books belonging to category “statistics for business & economics” are much better for understanding the 6sigma concepts rather than a pure statistical book. Try any of these books as a reference guide.

Introduction

Let’s understand by this example

A company is trying to make a job description for the manager level position and most important criterion was the years of experience a person should possess. They collected a sample of ten manager from their company, data is tabulated below along with its histogram.

As a HR person, I want to know the mean years of experience of a manager and the various probabilities as discussed below

Average experience = 3.9 years

What is the probability that X ≤ 4 years?

What is the probability that X ≤ 5 years?

What is the probability that 3 < X ≤ 5 years?

In order to calculate the above probabilities, we need to calculate the relative frequency and cumulative frequency

Now we can answer above questions

What is the probability that X ≤ 4 years? = 0.7 (see cumulative frequency)

What is the probability that X ≤ 5 years? = 0.9

What is the probability that 3 < X ≤ 5 years? = (probability X ≤ 5) – (probability X < 3) = 0.9-0.3 = 0.6 i.e. 60% of the managers have experience between 3 to 5 years.

Area under the curve (AUC) as a measure of probability:

Width of a bar in the histogram = 1 unit

Height of the bar = frequency of the class

Area under the curve for a given bar = 1x frequency of the class

Total area under the curve (AUC) = total area under all bars = 1×1+1×2+1×4+1×2+1×1 = 10

Total area under the curve for class 3 < x ≤ 5 = (AUC of 3rd class + AUC of 4th class) /total AUC = (4+2)/10 = 0.6 = probability of finding x between 3 and 5 (excluding 3)

Now, what about the probability of (3.2 < x ≤ 4.3) =? It will be difficult to calculate by this method, as requires the use of calculus.

Yes, we can use calculus for calculating various probabilities or AUC for this problem. Are we going to do this whole exercise again and again for each and every problem we come across?

With God’s grace, our ancestors gave us the solution in the form of Z-distribution or Standard normal distribution or Gaussian distribution, where the AUC between any two points is already documented.

This Standard normal distribution or Gaussian distribution is widely used in the scientific measurements and for drawing statistical inferences. This normal curve is shown by a perfectly symmetrical and bell shaped curve.

The Standard normal probability distribution has following characteristics

1. The normal curve is defined by two parameters, µ = 0 and σ = 1. They determine the location and shape of the normal distribution.
2. The highest point on the normal curve is at the mean which is also the median and mode.
3. The normal distribution is symmetrical and tails of the curve extend to infinity i.e. it never touches the x-axis.
4. Probabilities of the normal random variable are given by the AUC. The total AUC for normal distribution is 1. The AUC to the right of the mean = AUC to the left of mean = 0.5.
5. Percentage of observations within a given interval around the mean in a standard normal distribution is shown below

The AUC for standard normal distribution have been calculated for all given value of p ≥ z and are available in tables that can be used for calculating probabilities.

Note: be careful whenever you are using this table as some table give area for ≤ z and some gives area between two z-values.

Let’s try to calculate some of the probabilities using above table

Problem-1:

Probability p(z ≥ 1.25). This problem is depicted below

Look for z = 1.2 in vertical column and then look for z = 0.05 for second decimal place in horizontal row of the z-table, p(z ≤ -1.25) = 0.8944

Note! The z-distribution table given above give the cumulative probability for p(z ≤ 1.25), but here we want p(z ≥ 1.25). Since total probability or AUC = 1, p(z ≥ 1.25) will be given by 1- p(z ≤ 1.25)

Therefore

p(z ≥ 1.25) = 1- p(z ≤ -1.25) = 1-0.8944 = 0.1056

Problem-2:

Probability p(z ≤ -1.25). This problem is depicted below

Note! Since above z-distribution table doesn’t contain -1.25 but the p(z ≤ -1.25) = p(z ≥ 1.25) as standard normal curve is symmetrical.

Therefore

Probability p(z ≤ -1.25) = 0.1056

Problem-3:

Probability p(-1.25 ≤ z ≤ 1.25). This problem is depicted below

For the obvious reasons, this can be calculated by subtracting the AUC of yellow region from one.

p(-1.25 ≤ z ≤ 1.25) = 1- {p(z ≤ -1.25) + p(z ≥ 1.25)} = 1 – (2 x 0.1056) = 0.7888

From the above discussion, we learnt that a standard normal distribution table (which is readily available) could be used for calculating the probabilities.

Now comes the real problem! Somehow I have to convert my original dataset into the standard normal distribution, so that calculating any probabilities becomes easy. In simple words, my original dataset has a mean of 3.9 years with σ = 1.37 years and we need to convert it into the standard normal distribution with a mean of 0 and σ = 1.

The formula for converting any normal random variable x with mean µ and standard deviation σ to the standard normal distribution is by z-transformation and the value so obtained is called as z-score.

Note that the numerator in the above equation = distance of a data point from the mean. The distance so obtained is divided by σ, giving distance of a data point from the mean in terms of σ i.e. now we can say that a particular data is 1.25σ away from the mean. Now the data becomes unit less!

Let’s do it for the above example discussed earlier

Note: Z-distribution table is used only in the cases where number of observations ≥ 30. Here we are using it to demonstrate the concept. Actually we should be using t-distribution in this case.

We can say that the managers with 4 years of experience are 0.073σ away from the mean and on the right hand side. Whereas the managers with 3 years of experience are -0.657σ away from the mean on left hand side.

Now, if you look at the distribution of the Z-scores, it resembles the standard normal distribution with mean = 0 and standard deviation =1.

But, still one question need to be answered. What is the advantage of converting a given data set into standard normal distribution?

There are three advantages, first being, it enables us to calculate the probability between any two points instantaneously. Secondly, once you convert your original data into standard normal distribution, you are ending in a unit less distribution (both numerator & denominator in Z-transformation formula has same units)! Hence, it makes possible to compare an orange with an apple. For example, I wish to compare the variation in the salary of the employees with the variation in their years of experience. Since, salary and experience has different unit of measurements, it is not possible to compare them but, once both distributions are converted to standard normal distribution, we can compare them (now both are unit less).

Third advantage is that, while solving problems, we needn’t to convert everything to z-scores as explained by following example

Historical 100 batches from the plant has given a mean yield of 88% with a standard deviation of 2.1. Now I want to know the various probabilities

Probability of batches having yield between 85% and 90%

Step-1: Transform the yield (x) data into z-scores

What we are looking for is the probability of yield between 85 and 90% i.e. p(85 ≤ x ≤ 90)

Step-2: Always draw rough the standard normal curve and preempt what area one is interested in

Step-3: Use the Z-distribution table for calculating probabilities.

The Z-distribution table given above can be used in following way to calculate p(-1.43 ≤ z ≤ 0.95)

Diagrammatically, p(-1.43 ≤ z ≤ 0.95) = p(z ≤ 0.95) – p(z ≤ -1.43), is represented below

p(-1.43 ≤ z ≤ 0.95) = p(z ≤ 0.95) – p(z ≤ -1.43)= 0.83-0.076 = 0.75

75% of the batches or there is a probability of 0.75 that the yield will be between 85 and 90%.

It can also be interpreted as “probability of getting a sample mean between 85 and 90 given that population mean is 88% with standard deviation of 2.1”.

Probability of yield ≥ 90%

What we are looking for is the probability of yield ≥ 90% i.e. p(x ≥ 90)

= p(z ≥ 0.95)

Diagrammatically, p(-1.43 ≤ z ≤ 0.95) = p(z ≤ 0.95) – p(z ≤ -1.43), is represented below

p(x ≥ 90) = p(z ≥ 0.95) = 1-p(z ≤ 0.95) = 1- 0.076 = 0.17, there is only 17% probability of getting yield ≥ 90%

Probability of yield between ≤ 90%

This is very easy, just subtract p(x ≥ 90) from 1

Therefore,

p(x ≤ 90) = 1- p(x ≥ 90) = 1- 0.17 = 0.83 or 83% of the batches would be having yield ≤ 90%.

Now let’s work the problem in reverse way, I want to know the yield corresponding to the probability of ≥ 0.85.

Graphically it can be represented as

Since the table that we are using gives the probability value ≤ z value hence, first we need to find the z-value corresponding to the probability of 0.85. Let’s look into the z-distribution table and find the probability close to 0.85

The probability of 0.8508 correspond to the z-value of 1.04

Now we have z-value of 1.04 and we need find corresponding x-value (i.e. yield) using the Z-transformation formula

Solving for x

x = 90.18

Therefore, there is 0.85 probability of getting yield ≤ 90.18% (as z-distribution table we are using give probability for ≤ z) hence, there is only 0.15 probability that yield would be greater than 90.18%.

Above problem can be represented by following diagram

Exercise:

The historical data shows that the average time taken to complete the BB exam is 135 minutes with a standard deviation of 15 minutes.

Fins the probability that

1. Exam is completed in less than 140 minutes
2. Exam is completed between 135 and 145 minutes
3. Exam takes more than 150 minutes

Summary:

This articles shows the limitations of histogram and relative frequency methods in calculating probabilities, as for every problem we need to draw them. To overcome this challenge, a standardized method of using standard normal distribution is adopted where, the AUC between any two points on the curve gives the corresponding probability can easily be calculated using excel sheet or by using z-distribution table. The only thing we need to do is to convert the given data into standard normal distribution using Z-transformation. This also enables us to compare two unrelated things as the Z-distribution is a unit less with mean = 0 and standard deviation = 1. If the population standard deviation is known, we can use z-distribution otherwise we have to work with sample’s standard deviation and we have to use Student’s t-distribution.

## Concept of Quality — We Must Understand this before Learning 6sigma!

Before we try to understand the 6sigma concept, we need to define the term “quality”.

##### What is Quality?

The term “quality” has many interpretations, but this by the ISO definition, quality is defined as: “The totality of features and characteristics of a product or service that bear on its ability to satisfy stated or implied needs”.

If we read between the lines, then the definition varies with the reference frame we use to define the “quality”. The reference frame that we are using here are the manufacturers (who is supplying the product) and the customer (who is using the product). Hence the definition of quality with respect to above two reference frame can be defined as

This “goal post” approach to quality is graphically presented below, where a product is deemed pass or fail. It didn’t matter even if the quality is on the borderline (football just missed the goalpost and luckily a goal was scored).

This definition was applicable till the time there was a monopoly for the manufacturers or having a limited competition in the market. The manufacturers were not worried about the failures as they can easily pass on the cost to the customer. Having no choice, customer has to bear the cost. This is because of the traditional definition of profit shown below.

Coming to current business scenario, the manufacturers doesn’t have luxury to define the selling price, now the market is very competitive and the price of goods and services are dictated by the market, hence it is called as market price instead of selling price. This lead to the change in the perception of quality, now quality was defined as producing goods and services meeting customer’s specification at the right price. The manufacturers are now forced to sell their goods and services at the market rate. As a result the profit is now defined as the difference of market rate and cost of goods sold (COGS).

In current scenario if a manufacturer wants to make a profit, the only option he has is to reduce COGS. In order to do so, one has to understand the components that makes up COGS. The COGS in has many components as shown below. The COGS consist of genuine cost of COGS and the cost of quality. The genuine COGS will always be same (nearly) for all manufacturers, but the real differentiator would be the cost of quality. The manufacturer with lowest cost of quality would enjoy highest profit and can influence the market price to keep the competition at bay. But in order to keep cost of quality at its lowest possible level, the manufacturer has to hit the football, right at the center of the goalpost every time!

The cost of quality involves the cost incurred to monitor and ensure the quality (cost of conformance) and the cost of non-conformance or cost of poor quality (COPQ). The cost of conformance is a necessary evil whereas the COPQ is a waste or opportunity lost.

Coming to the present scenario, with increasing demand of goods and services, manufacturers required to fulfill their delivery commitment on time otherwise their customers would lose market share to the competitors. The manufacturers has realized that their business depends on the business prospects of their customers hence, timely supply of products and services is very important. This can be understood in a much better way using pharmaceutical industry

Sole responsibility of any Regulator (say FDA) towards its country is to ensure not only the acceptable (quality, safety and efficacy) and affordable medicines but they also need to ensure its availability (no shortage) in their country all the time. Even that is not enough for them; those medicines must be easily accessible to patients at their local pharmacies. These may be called as 4A’s and are the KRA of any Regulatory body. If they miss any one of the above ‘4As’, they will be held accountable by their Government for endangering the life of the patients. The point that need to be emphasized here is the importance of TIMELY SUPPLY of the medicines besides other parameters like quality and price.

Hence, the definition of quality again got modified as “producing goods and services in desired quantity which is delivered on time meeting all customer’s specification of quality and price.” A term used in operational excellence called as OTIF is acronym for “on time in full” meaning delivering goods and services meeting customer’s specification on time and in full quantity.

Coming once again to the definition of profit in present day scenario

Profit=MP-COGS

We have seen that the selling price is driven by the market and hence manufacturer can’t control it beyond an extent. So what he can do to increase his margin or profit? The only option he has is to reduce his COGS. We have seen that COGS has two components, genuine GOGS and COPQ. The manufacturers have little scope to reduce the genuine COGS as it is a necessary evil to produce goods and services. We will see latter in LEAN manufacturing how this genuine COGS can be reduced to some extent (wait till then!) e.g. if we can increase the throughput, we can bring down genuine COGS (if throughput or the yield of the process is improved, which results in less scrap would decrease the RM cost per unit of the goods produced).

But the real culprit for the high COGS is the unwarranted high COPQ.

The main reasons for high COPQ are

1. Low throughput or yield
2. More out of specifications (OOS) products which required to be either
1. Reprocessed
2. Reworked or
3. Has to be scraped
3. Inconsistent quality leading to more after sales& service and warranty costs
4. Biggest of all loses would be the customer’s confidence in you, which is intangible.

If we look at the outcomes of COPQ (discussed above), we can conclude one thing and that is “the process is not robust enough to meet customer’s specifications” and because of this manufacturers faces the problem of COPQ. All these wastages are called as “mudas” in Lean terminology hence, would be dealt in detail latter. But the important

What causes COPQ?

Before we can answer this important question, we need to understand the concept of variance. Let’s take a simple example, say you start from the home for office on exactly the same time every day, do you reach the office daily on exactly same time? Answer will be a big no or a better answer would be, it will take anywhere between 40-45 minutes to react the office if I start exactly at 7:30 AM. This variation in office arrival time can be attributed to many reasons like variation in starting time itself (I just can start exactly at 7:30 every day), variation in traffic conditions etc. There will always be a variation in any process and we need to control that variation. Even in the manufacturing atmosphere there are sources of variation like wear and tear of machine, change of operators etc. Because of this variation, there will always be a variation in the output (goods and services produced by the process). Hence, we will not get a product with a fixed quality attributes, but that quality attribute will have a range (called as process control limits) which need to be compared with the customer’s specification limits (goal post).

If my process control limits are towards the goal post (boundaries of the customer’s specification limits) represented by the goal post, then my failure rate would be quite high resulting in more failures, scrap, rework, warranty cost. This is nothing but COPQ.

Alternatively if my aim (process limits) are well within the goal posts (case-2), my success rate are much higher and I would be have less, scrap and rework thereby decreasing my COPQ.

###### Taguchi Loss Function

A paradigm shift in the definition of quality was given by Taguchi, where he gave the concept of producing products with quality targeted at the center of the customer’s specifications (a mutually agreed target). He stated that as we move away from the center of the specification, we incur cost either at the producer’s end or at the consumer’s end in the form of re-work and re-processing. Holistically, it’s a loss to the society. It states that even producing goods and services beyond customer’s specification is a loss to the society as customer will not be willing to pay for it. There is a sharp increase in the COGS as we try to improve the quality of goods and services beyond the specification.

For example;

The purity of medicine I am producing is > 99.5 (say specification) and if I try to improve it to 99.8, it will decrease my throughput as we need to perform one extra purification that will result in yield loss and increased COGS.

Buying a readymade suit, it is very difficult to find a suit that perfectly matches your body’s contour, hence you end up going for alterations. This incurs cost. Whereas, if you get a suit stitched by a tailor that fits your body contour (specification), it would not incur any extra cost in rework.

###### Six Sigma and COPQ

It is apparent from the above discussion that “variability in the process” is the single most culprit for the failures resulting in high cost of goods produced. This variability is the single most important concept in six sigma that required to be comprehended very well. We will encounter this monster (variability) everywhere when we will be dealing with six sigma tools like histogram, normal distribution, sampling distribution of mean, ANOVA, DoE, Regression analysis and most importantly the statistical process control (SPC).

Hence, a tool was required by the industry to study the variability and to find the ways to reduce it. The six sigma methodology was developed to fulfill this requirement. We will look into the detail why it is called as six sigma and not five or seven sigma latter on.

Before we go any further, we must understand one very important thing and must always remember this “any goods and services produced is an outcome of a process” also “there are many input that goes into the process, like raw materials, technical procedures, men etc”.

Hence, any variation in the input (x) to a given process will cause a variation in the output (y) quality.

Another important aspect is that the variance has an additive property i.e. the variance from all input is added to give the variance in the output.

###### How Six Sigma works?

Six sigma works by decreasing the variation coming from the different sources to reduce the overall variance in the system as shown below. It is a continuous improvement journey.

###### Summary:
1. Definition of Quality has changed drastically over the time, it’s no more “fit for purpose” but also include on time and in full (OTIF).
2. In this world of globalization, market place determines the selling price and manufacturers either have to reduce their COPQ or perish.
3. There is a customer specification and a process capability. The aim is to bring the process capability well within the customer’s specifications.
4. Main culprit of out of specification product is the unstable process which in turn is because of variability in the process coming from different sources.
5. Variance has an additive property.
6. Lean is tool to eliminate the wastages in the system and six sigma is a tool to reduce the defects from the process.

References

2. For different definition of quality see http://www.qualitydigest.com/magazine/2001/nov/article/definition-quality.html#

## 7QC Tools: Interpretation of Control Charts Made Easy

Visual Inspection of the Control Charts for Unnatural Patterns

Besides above famous rules, there are patterns on the control charts that needs to be understood by every quality professionals. Let’s understand these patterns using following examples. It would be easier to understand them if we can imagine the type of distribution of the data displayed on the control chart.

###### Case-1: Non-overlapping distribution

As a production-in-charge, I am using two different grades of raw material with different quality attributes (non-overlapping but at the edge of the specification limits) and I am assuming that the quality attributes of the final product will be normally distributed i.e. I am assuming that most of final product will hit the center of the process control limits.

If the quality of the raw material is detrimental to the quality of the final product then my assumption about the output is wrong. Because the distribution of the final product quality would take a bimodal shape with only few data at the junction of the distribution. Same information would be reflected onto the control chart with high concentration of data points near the control limits and fewer or no points near the center. Here is the control chart of the final product

In this completely non-overlapping distribution, there will be unusual long connecting arms in the control charts. There will be absence of points near the central line.

If we plot the histogram of this data set and go on increasing the number of classes, the two distribution would get separated.

So, whenever we see a control charts with the data points concentrated towards the control limits and no points at the center of the control charts, immediately we should assume that it is a mixture of two non-overlapping distribution. Remember long connecting arms and few data points at the center of the control chart.

###### Case-2: Partially overlapping distribution

Assume this scenario: A product is being produced in my facility in two shifts by two different operators. Each day I have two batches, one in each shift. There is a well written batch manufacturing record indicating that the temperature of the reactor should be between 50 to 60 °C. The control chart of a quality attribute of the product is represented by following control chart.

We can see that the data points on the control chart are arranged in an alternate fashion around the central line. The first batch (from the 1st shift) is below the central line and next batch (from the 2nd shift) is above the central line. This control chart shows that even we are following the same manufacturing process, there is a slight difference in the process. It was found that the 1st shift in-charge was operating towards 50 °C and the 2nd shift in-charge was operating towards 60 °C. This type of alternate arrangement is indication of stratification (due to operators, machines etc.) and is characterized by short connecting arms.

There are the cases of partially overlapping distribution resulting in a bimodal distribution, which means that there will be few points in the central region of the control charts but, majority of the data points would be distributed in zone C or B. In such cases, it would be appropriate to plot the histogram with groups (like operator, shift etc).

###### Case-3: Significant Overlapping distribution

If there is significant overlap between the two input distributions then it would be difficult to differentiate them in the final product and the combined distribution would give a picture of a single normal distribution. Suppose the operators in the above case-2 were performing the activity at 55 °C and 60 °C respectively. This would result in an overlapping distribution as shown below

###### Case-4: Mixture of unequal proportion

As a shift-in-charge, I am running short of the production target. What I did to meet the production target was to mix the current batch with some of the material produced earlier for some other customer with slightly different specification. I hoped that it wouldn’t be caught by the QA!. The final control chart of the process looked like

We can see from the control chart that if two distributions are mixed in an unequal proportions then the combined distribution would be an unsymmetrical distribution. In this case one-half of the control chart (in present case the lower half) would have maximum data points and other half would have less data points.

###### Case-5: Cyclic trends

If one observe a repetition of the trend on the control chart, then there is a cyclic effect like sales per month of the year. Sales in some of the specific months are higher than the sales in some other months.

###### Case-6: Gradual shift in the trend

A gradual change in the process is indicated by the change in the location of the data points on the control charts. This chart is most commonly encountered during the continuous improvement programs when we compare the process performance before and after the improvement program.

If it is observed that this shift is gradual on the control charts, then there must be a reason for the same, like wear and tear of machine, problem with the calibration of the gauges etc.

###### Case-7: Trend

If one observe that the data points on the control charts are gradually moving up or down, then it is a case of trend. This is usually cause by gradual shift in the operating procedure due to wear and tear of machines, gauges going out of calibration etc.

###### Summary of unnatural pattern on the control charts
 Unnatural pattern Pattern Description Symptom in control chart Large shift (strays, freaks) Sudden and high change Points near and or beyond control limits Smaller sustained shift Sustained smaller change Series of points on the same side of the central line Trends A continuous changes in one direction Steadily increasing or decreasing run of points Stratification Small differences between values in a long run, absence of points near the control limits A long run of points near the central line on the both sides Mixture Saw-tooth effect, absence of points near the central line A run of consecutive points on both sides of central line, all far from the central line Systematic Variation or stratification Regular alternation of high and low values A long run of consecutive points alternating up and down Cycle Recurring periodic movement Cyclic recurring patterns of points

For the case study see next blog

## 7QC Tools: My bitter experience with statistical Process Control (SPC)!

I just want to share my experience in SPC.

In general, I have seen that people are plotting the control chart of the final critical quality attribute of a product (or simply a CQA). But the information displayed by these control charts is historical in nature i.e. the entire process has already taken place. Hence, even if the control chart is showing a out of control point, I can’t do anything about it except for the reprocessing and rework. We often forget that these CQAs are affected by some critical process parameters (CPPs) and I can’t go back in time to correct that CPPs. The only thing we can do is to start a investigation.

HENCE PLOTTING CONTROL CHARTS IS LIKE DOING A POSTMORTEM OF A DEAD (FAILED) BATCH.

Instead, if we can plot the control chart of CPPs and if these control charts shows any out of control points, IMMEDIATLY WE CAN FORECAST THAT THIS BATCH IS GOING TO FAIL or WE CAN TAKE A CORRECTIVE ACTION THEN AND THERE ITSELF. This is because CPPs and CQA are highly correlated and if CPPs shows an out of control point on its control chart, then we are sure that that batch is going to fail.

Hence, the control charts of CPPs would help us in forecasting about the output quality (CQA) of the batch because, the CPP would fail first before a batch fails. This will also help us in saving the time that goes into the investigation. This is very important for the pharmaceutical industry as everyone in the pharmaceutical industry knows, how much time and resource goes into the investigation!

I feel that we need to plot the control chart of CPPs along with the control chart of CQA, with more focus on the control chart of CPPs. This will help us in taking timely corrective actions (if available) or we can scrap the batch, saving downstream time and resource (in case no corrective action available).

Another advantage of plotting the CPP is for looking for the evidence that a CPP is showing a trend and in near future it will cross the control limits as shown below, this will warrant a timely corrective action of process or machine.

CQA: Critical Quality attribute

CPP: Critical Process Parameter

OOS: out of specification

## A Way to Establish Cause & Effect Relationship …..Design of Experiments or DoE

Mostly what happens during any investigation is that, we collect lot of data to prove or disapprove our assumption. Problem with this methodology is that, we can have false correlation between variables

e.g. increase in the internet connection and death due to cancer over last 4 decades!

Is there a relation between the two (internet connections and death due to cancer)? Absolutely not, so in order to avoid such confusions we need to have a way to establish such relationships. In this regard we use DoE, these are statistical way of conducting experiments which establishes cause & effect relationship. General sequence of events in DoE is as follows

Why DoE is important at R&D stage?

Just remember these two quotes

“Development speed is not determined by how fast we complete the R&D but by how fast we can commercialize the process”

“Things we do before tech transfer is more important that what is there in tech pack!”

In order to avoid the unnecessary learning curves, and to have a control on the COGS we need to deploy QbD as shown below

Details will be covered in DoE chapter

Is this information useful to you?