## Why Standard Normal Distribution Table is so important?

###### Abstract

Since we are entering the technical/statistical part of the subject hence, it would be better for us to understand the concept first

For many business decisions, we need to calculate the likelihood or probability of an event to occur. Histograms along with relative frequency of a dataset can be used to some extent.. But for every problem we come across we need to draw the histogram and relative frequency to find the probability using area under the curve (AUC).

In order to overcome this limitation a standard normal distribution or Z-distribution or Gaussian distribution was developed and the AUC or probability between any two points on this distribution is well documented in the statistical tables or can be easily found by using excel sheet.

But in order to use standard normal distribution table, we need to convert the parent dataset (irrespective of the unit of measurement) into standard normal distribution using Z-transformation. Once it is done, we can look into the standard normal distribution table to calculate the probabilities.

From my experience, I found the books belonging to category “statistics for business & economics” are much better for understanding the 6sigma concepts rather than a pure statistical book. Try any of these books as a reference guide.

Introduction

Let’s understand by this example

A company is trying to make a job description for the manager level position and most important criterion was the years of experience a person should possess. They collected a sample of ten manager from their company, data is tabulated below along with its histogram.

As a HR person, I want to know the mean years of experience of a manager and the various probabilities as discussed below

Average experience = 3.9 years

What is the probability that X ≤ 4 years?

What is the probability that X ≤ 5 years?

What is the probability that 3 < X ≤ 5 years?

In order to calculate the above probabilities, we need to calculate the relative frequency and cumulative frequency

Now we can answer above questions

What is the probability that X ≤ 4 years? = 0.7 (see cumulative frequency)

What is the probability that X ≤ 5 years? = 0.9

What is the probability that 3 < X ≤ 5 years? = (probability X ≤ 5) – (probability X < 3) = 0.9-0.3 = 0.6 i.e. 60% of the managers have experience between 3 to 5 years.

Area under the curve (AUC) as a measure of probability:

Width of a bar in the histogram = 1 unit

Height of the bar = frequency of the class

Area under the curve for a given bar = 1x frequency of the class

Total area under the curve (AUC) = total area under all bars = 1×1+1×2+1×4+1×2+1×1 = 10

Total area under the curve for class 3 < x ≤ 5 = (AUC of 3rd class + AUC of 4th class) /total AUC = (4+2)/10 = 0.6 = probability of finding x between 3 and 5 (excluding 3)

Now, what about the probability of (3.2 < x ≤ 4.3) =? It will be difficult to calculate by this method, as requires the use of calculus.

Yes, we can use calculus for calculating various probabilities or AUC for this problem. Are we going to do this whole exercise again and again for each and every problem we come across?

With God’s grace, our ancestors gave us the solution in the form of Z-distribution or Standard normal distribution or Gaussian distribution, where the AUC between any two points is already documented.

This Standard normal distribution or Gaussian distribution is widely used in the scientific measurements and for drawing statistical inferences. This normal curve is shown by a perfectly symmetrical and bell shaped curve.

The Standard normal probability distribution has following characteristics

1. The normal curve is defined by two parameters, µ = 0 and σ = 1. They determine the location and shape of the normal distribution.
2. The highest point on the normal curve is at the mean which is also the median and mode.
3. The normal distribution is symmetrical and tails of the curve extend to infinity i.e. it never touches the x-axis.
4. Probabilities of the normal random variable are given by the AUC. The total AUC for normal distribution is 1. The AUC to the right of the mean = AUC to the left of mean = 0.5.
5. Percentage of observations within a given interval around the mean in a standard normal distribution is shown below

The AUC for standard normal distribution have been calculated for all given value of p ≥ z and are available in tables that can be used for calculating probabilities.

Note: be careful whenever you are using this table as some table give area for ≤ z and some gives area between two z-values.

Let’s try to calculate some of the probabilities using above table

Problem-1:

Probability p(z ≥ 1.25). This problem is depicted below

Look for z = 1.2 in vertical column and then look for z = 0.05 for second decimal place in horizontal row of the z-table, p(z ≤ -1.25) = 0.8944

Note! The z-distribution table given above give the cumulative probability for p(z ≤ 1.25), but here we want p(z ≥ 1.25). Since total probability or AUC = 1, p(z ≥ 1.25) will be given by 1- p(z ≤ 1.25)

Therefore

p(z ≥ 1.25) = 1- p(z ≤ -1.25) = 1-0.8944 = 0.1056

Problem-2:

Probability p(z ≤ -1.25). This problem is depicted below

Note! Since above z-distribution table doesn’t contain -1.25 but the p(z ≤ -1.25) = p(z ≥ 1.25) as standard normal curve is symmetrical.

Therefore

Probability p(z ≤ -1.25) = 0.1056

Problem-3:

Probability p(-1.25 ≤ z ≤ 1.25). This problem is depicted below

For the obvious reasons, this can be calculated by subtracting the AUC of yellow region from one.

p(-1.25 ≤ z ≤ 1.25) = 1- {p(z ≤ -1.25) + p(z ≥ 1.25)} = 1 – (2 x 0.1056) = 0.7888

From the above discussion, we learnt that a standard normal distribution table (which is readily available) could be used for calculating the probabilities.

Now comes the real problem! Somehow I have to convert my original dataset into the standard normal distribution, so that calculating any probabilities becomes easy. In simple words, my original dataset has a mean of 3.9 years with σ = 1.37 years and we need to convert it into the standard normal distribution with a mean of 0 and σ = 1.

The formula for converting any normal random variable x with mean µ and standard deviation σ to the standard normal distribution is by z-transformation and the value so obtained is called as z-score.

Note that the numerator in the above equation = distance of a data point from the mean. The distance so obtained is divided by σ, giving distance of a data point from the mean in terms of σ i.e. now we can say that a particular data is 1.25σ away from the mean. Now the data becomes unit less!

Let’s do it for the above example discussed earlier

Note: Z-distribution table is used only in the cases where number of observations ≥ 30. Here we are using it to demonstrate the concept. Actually we should be using t-distribution in this case.

We can say that the managers with 4 years of experience are 0.073σ away from the mean and on the right hand side. Whereas the managers with 3 years of experience are -0.657σ away from the mean on left hand side.

Now, if you look at the distribution of the Z-scores, it resembles the standard normal distribution with mean = 0 and standard deviation =1.

But, still one question need to be answered. What is the advantage of converting a given data set into standard normal distribution?

There are three advantages, first being, it enables us to calculate the probability between any two points instantaneously. Secondly, once you convert your original data into standard normal distribution, you are ending in a unit less distribution (both numerator & denominator in Z-transformation formula has same units)! Hence, it makes possible to compare an orange with an apple. For example, I wish to compare the variation in the salary of the employees with the variation in their years of experience. Since, salary and experience has different unit of measurements, it is not possible to compare them but, once both distributions are converted to standard normal distribution, we can compare them (now both are unit less).

Third advantage is that, while solving problems, we needn’t to convert everything to z-scores as explained by following example

Historical 100 batches from the plant has given a mean yield of 88% with a standard deviation of 2.1. Now I want to know the various probabilities

Probability of batches having yield between 85% and 90%

Step-1: Transform the yield (x) data into z-scores

What we are looking for is the probability of yield between 85 and 90% i.e. p(85 ≤ x ≤ 90)

Step-2: Always draw rough the standard normal curve and preempt what area one is interested in

Step-3: Use the Z-distribution table for calculating probabilities.

The Z-distribution table given above can be used in following way to calculate p(-1.43 ≤ z ≤ 0.95)

Diagrammatically, p(-1.43 ≤ z ≤ 0.95) = p(z ≤ 0.95) – p(z ≤ -1.43), is represented below

p(-1.43 ≤ z ≤ 0.95) = p(z ≤ 0.95) – p(z ≤ -1.43)= 0.83-0.076 = 0.75

75% of the batches or there is a probability of 0.75 that the yield will be between 85 and 90%.

It can also be interpreted as “probability of getting a sample mean between 85 and 90 given that population mean is 88% with standard deviation of 2.1”.

Probability of yield ≥ 90%

What we are looking for is the probability of yield ≥ 90% i.e. p(x ≥ 90)

= p(z ≥ 0.95)

Diagrammatically, p(-1.43 ≤ z ≤ 0.95) = p(z ≤ 0.95) – p(z ≤ -1.43), is represented below

p(x ≥ 90) = p(z ≥ 0.95) = 1-p(z ≤ 0.95) = 1- 0.076 = 0.17, there is only 17% probability of getting yield ≥ 90%

Probability of yield between ≤ 90%

This is very easy, just subtract p(x ≥ 90) from 1

Therefore,

p(x ≤ 90) = 1- p(x ≥ 90) = 1- 0.17 = 0.83 or 83% of the batches would be having yield ≤ 90%.

Now let’s work the problem in reverse way, I want to know the yield corresponding to the probability of ≥ 0.85.

Graphically it can be represented as

Since the table that we are using gives the probability value ≤ z value hence, first we need to find the z-value corresponding to the probability of 0.85. Let’s look into the z-distribution table and find the probability close to 0.85

The probability of 0.8508 correspond to the z-value of 1.04

Now we have z-value of 1.04 and we need find corresponding x-value (i.e. yield) using the Z-transformation formula

Solving for x

x = 90.18

Therefore, there is 0.85 probability of getting yield ≤ 90.18% (as z-distribution table we are using give probability for ≤ z) hence, there is only 0.15 probability that yield would be greater than 90.18%.

Above problem can be represented by following diagram

Exercise:

The historical data shows that the average time taken to complete the BB exam is 135 minutes with a standard deviation of 15 minutes.

Fins the probability that

1. Exam is completed in less than 140 minutes
2. Exam is completed between 135 and 145 minutes
3. Exam takes more than 150 minutes

Summary:

This articles shows the limitations of histogram and relative frequency methods in calculating probabilities, as for every problem we need to draw them. To overcome this challenge, a standardized method of using standard normal distribution is adopted where, the AUC between any two points on the curve gives the corresponding probability can easily be calculated using excel sheet or by using z-distribution table. The only thing we need to do is to convert the given data into standard normal distribution using Z-transformation. This also enables us to compare two unrelated things as the Z-distribution is a unit less with mean = 0 and standard deviation = 1. If the population standard deviation is known, we can use z-distribution otherwise we have to work with sample’s standard deviation and we have to use Student’s t-distribution.

## How to provide a realistic range for a CQAs during product development to avoid unwanted OOS-1.

###### It is very important to understand the concept of CI/PI/TI before we can understand the reasons for OOS.

Let’s start from following situation

You have to reach the office before 9:30 AM. Now tell me how confident are you about reaching the office exactly between

(A) 9:10 to 9:15 (hmm…, such a narrow range, I am ~90% confident)

(B) 9:05 to 9:20 (a-haa.., now I am 95% confident)

(C) 9:00 to 9:25 (this is very easy, I am almost 99% confident)

The point to be noted here is that , your confidence increases with widening time interval (remember this for rest of the discussion).

More important thing is that, it is difficult to estimate the exact arrival time, but we can say with some confidence that my arrival time would be between some time interval.

Say my arrival time for last five days (assuming all other factors remains constant)  was 9:17 AM, so I can say with certain confidence (say 95%) that my arrival time would be given by

Average arrival time on (say 5 days) ± margin of error

The confidence we are showing is called as confidence level and the interval estimated by above equation at a given confidence level is called as CONFIDENCE INTERVAL (CI). This confidence interval may or may not contain my mean arrival time.

Now let’s go a manufacturing scenario

We all are aware of the diagram given below, the critical quality attribute (CQA or y) of any process is affected by many inputs like critical material attribute (CMA), critical process parameter (CPP) and other uncontrollable factors.

Since, CQAs are affected by CPPs and CMAs, it is said that CQA or any output Y is a function of X (X = CPPs/CMAx).

The relationship between Y and X is given by following regression equation

Following points worth mentioning are

1. Value of Y depends on the value of Y, it means that if there is deviation in X then there will be a corresponding deviation in Y. e.g. if the level of any impurity (y) is influenced by the temperature then any deviation in impurity level will be attributed to the change in temperature (x).
2. If you hold X constant at some value and performs the process many times (say 100) then all 100 products (Y) would not be of same quality because of inherent variation/noise in the system which in turn is because of other uncontrollable factor. That’s why we have error term in our regression equation. If error term becomes zero, then the relationship would be described perfectly by a straight line y = mx + C. In this condition the regression line gives expected value of Y, represented by E(Y) = b0+b1X1.

As we have seen that there will be a variation in Y even if you hold X constant. Hence, the term ‘expected value of Y’ represents the average value of Y for a given value of X.

It’s fine that for a given value of X, there will be a range of Y values because of inherent variation/noise in the process and the average of Y values is called as expected value of Y for a given value of X, but, tell how this is going to help me in investigating OOS/OOT?

Let’s come to the point, assume that we have manufactured one million tablets of 500 mg strength with a mixing time of 15 minutes (= x), Now I want to know the exact mean strength of all the tablets in the entire batch?

In statistical terms,

It’s not possible to estimate the exact mean strength of all the tablets in the entire batch as it would require destructive analysis of the entire one million tablets.

Then, what is the way out? How we can estimate the mean strength of the entire batch?

Best thing we can do is to take out a sample and analyze it and based on the sample mean strength, we can make an intelligent guess about the mean strength of the entire batch … but it would be with some error, as we are using sample for the estimation. This error is called as sampling error. The sample data would give an interval that may contain the population mean is given by

Sample mean ± margin of error = confidence interval (CI)

The term “Sample mean ± margin of error ” is called as confidence interval which may or may not contains the population mean.

It is unlikely that two samples from a given population will yield identical confidence intervals (CI), it means that every sample would provide a different interval but, if we repeat the sampling many times and calculate all CI, then a certain percentage of the resulting confidence intervals would contain the unknown population parameter. The percentage of these CI that contain the parameter is called as confidence level of the interval. The interval estimated by the sample is called as confidence interval (CI). This CI is for a given value of X. This CI will change, with change in X.

Note: Don’t get afraid of the formulas, we will we covering it latter

If 100 samples are withdrawn then we can have following confidence level

A 90% confidence level would indicate that the confidence interval (CI) generated by 90 samples (out of 100) would contain the unknown population parameter.

A 95% confidence level indicates that the CI estimated by 95 samples (out of 100) would contain the unknown population parameter.

To summarize, we can estimate the population mean by using confidence interval with certain degree of confidence level.

It’s fine that CI helps me in determining the range within which there is 95% or 99% probability of finding the mean strength of the entire batch. But I have an additional issue, I am also interested in knowing the number of tablets (out of one million tablets) that would be bracketed by this interval or any other interval and how many are outside this interval? This will help me in determining the failure rate once we compare this interval with customer’s specifications.

More precisely we want to know the interval which would contain the 99% of the tablets with desired strength and how confident we are about this interval that it will contain 99% of the population?

If we can get this interval, we can compare it with the customer’s specification which in turn would tell me something about the process capability. How this can be resolved?

Let’s understand the problem once again

If we understood the issue correctly, then we want to estimate an interval (with required characteristics) based on the sample data that will cover say 99% or 95% of the population and then we want to overlap this interval with the customer’s specification to check the capability of the process. This is represented by scenario-1 and scenario-2 (ideal) in the figure given below.

Having understood the issue, the solution lies in calculating another interval known as Tolerance Interval for the population with a desired characteristics (Y) for a given value of process parameter X.

Tolerance Interval: this interval captures the values of a specified proportion of all future observations of the response variable for a particular combination of the values of the predictor variables with some high confidence level.

We have seen that CI width is entirely due to the sampling error. As the sample size increases and approaches the entire population size, the width of the confidence interval approaches zero. This is because the term “margin of error” would become zero.

In contrast, the width of a tolerance interval is due to both sampling error and variance in the population. As the sample size approaches the entire population, the sampling error diminishes and the estimated percentiles approach the true population percentiles.

e.g. A 95% tolerance interval that captures 98 % of the population of a future batch of the tablets at a mixing time of 15 minutes is 485.221 to 505.579 (this is Y).

Now, if customer’s specification for the tablet strength is 497 to 502 then we are in trouble (representing scenario-1 in above figure) because, we need to work on the process (increase the mixing time) to reduce the variability.

Let’s assume that we increased the mixing time to 35 minutes and as a result, 95% tolerance interval which captures 99% of the population is given by 498.598 to 501.902. Now we are comfortable with the customer’s specification (scenario-2 in above figure). Hence, we need to blend the mixture for 35 minutes before compressing it into tablets.

We need to be careful while understanding the tolerance interval as it contains two types of percentage terms. The first one, 95% is the confidence level and the second term i.e. 98% is the proportion of the total population with required quality attributes that we want to bracket by the tolerance interval for a constant mixing time of 5 minutes.

To summarize: in order to generate tolerance intervals, we must specify both the proportion of the population to be covered and a confidence level. The confidence level is the likelihood that the interval actually covers the proportion.

This is what we wanted during the product development.

Let’s calculate the 95% CI using excel sheet

In next post we try to clarify the confusion that we have created in this post by a real time example. So, keep visiting us

Related posts:

Why We Have Out of Specifications (OOS) and Out of Trend (OOS) Batches?

Proposal for Six Sigma Way of Investigating OOT & OOS in Pharmaceutical Products-1

Proposal for Six Sigma Way of Investigating OOT & OOS in Pharmaceutical Products-2

###### Note on Regression Equation:

Regression line represents the expected value of y = E(yp) for a given value of x = xn. Hence, the point estimate of y for given value of x = xn s given by

xn = given value of x

yn = Value of output y corresponding to xn

E(yp) = mean or expected value of y for given value of x = xn, it denotes the unknown mean value of all y’s where x = xn.

Theoretically, is the point estimate of E(yp) hence should be equal. But in general it seldom happens. If we want to measure, how close the true mean value E(yp) is to the point estimator, then we need to measure the standard deviation of for given value xp.

Confidence interval for the expected value E(yp) is given by

Why we need this equation right now? (I don’t want you to get terrified!)but, if you focus on the numerator part of the standard deviation formula, then one important observation is that if

then the standard deviation would be minimum and as you move away from the mean, the standard deviation goes on increasing. It implies that the CI would be narrower at and it would widen as you move away from the mean.

Hence, the width of the CI depends on the value of CPP (x)

## Why Do We Have Out of Specifications (OOS) and Out of Trend (OOS) Batches

While developing a product, we are bound by the USP/EP/JP monographs for product’s critical quality attributes (CQAs) or by the ICH guidelines and we have seen regular OOT/OOS in commercial batches. It’s fine that, every generic company have developed an expertise in investigating and providing corrective & preventive action (CAPA) for all OOT and OOS, but question that remained in our heart and mind is that,

Why can’t we stop them from occurring?

Answers lies in following inherent issues at each level of product life cycle,

###### We assume customer’s specification and process control limits are same thing during the product development.

Let’s assume that USP monograph gives a acceptable assay range of a drug product between 97% to 102%. The product development team immediately start working on the process to meet this specifications. The focus is entirely on developing a process to give a drug product within this range. But we forget that even a 6sigma process has a failure rate of 3.4ppm. Therefore in absence of statistical knowledge, we consider customer’s specification as the target for the product development.

The right approach would be to calculate the required process control limits so that a given proportion of the batches (say 95% or 99%) should be in between customer’s specifications.

Here, I would like to draw an analogy where the customer’s specification like the width of a garage and the process control limits is like the width of the car. The width of the car should be much less than the width of the garage to avoid any scratches. Hence the target process control limits should be narrower for the product development.

For detail see earlier blog on car parking and 6sigma“.

###### Inadequate statistical knowledge leads to wrong target range  for a given quality parameters during Product development.

Take the above example once again, customer’s specification limit for the assay is 97% to 102% (= garage width) now, the question is, what should be the width of the process (= car’s width) that we need to target during the product development to reduce number of failures during commercialization? But one thing is clear at this point, we can’t take customer’s specification as a target for the product development.

Calculating the target range for the development team

In order to simplify it, I will take the formula for Cp

Where, Cp = process capability, σ = standard deviation of the process, USL & LSL are the upper and lower specification of the customer. The number 1.33 is least desired Cp for a capable process = 3.9 sigma process.

Calculating for σ

Calculating the σ for the above process

Centre of the specification = 99.5 hence the target range of the assay for the product development team is given by

Specification mean ± 3σ

= 99.5±3×σ = 99.5±1.89 = 97.61 to 101.39

Hence, product development team has to target an assay range of 97.61 to 101.39 instead of targeting the customers specifications.

There is other side of the coin, whatever range we take as a target for development, there is a assumption that 100% of the population would be in between that interval. This is not true because, even a 6 sigma process has a failure rate of 3.4 ppm. So the point I want to make here is that we should also provide a expected failure rate corresponding to the interval that we have chosen to work with.

For further discussion on this topic, keep vising for the forth coming article on Confidence, prediction and Tolerance intervals

###### Not Giving Due Respect to the Quality by Design Principle and PAT tools

Companies not having in-house QbD capability can have an excuse but even the companies with QbD capability witness failures during scale-up even though they claim to have used QbD principle. They often think that QbD and DoE are the same thing. For the readers I want to highlight that DoE just a small portion of QbD. There is a sequence of events that constitute QbD and DoE is just on of those events.

I have seen that people will start DoE directly on the process, scientist used to come to me that these are the critical process parameter (CPPs) and ask for DoE plan. These CPPs are selected mostly based on the chemistry knowledge like, moles, temperature, concentration, reaction time etc. Now thing is that, these variables will seldom vary in the plant because warehouse won’t issue you less or more quantity of the raw material and solvents, temperature won’t deviate that much. What we miss is the process related variables like heating and cooling gradient, hold up time of the reaction mass at a particular temperature, work-up time in plant (usually much higher than lab workup time, type of agitator, exothermicity,  waiting time for the analysis and other unit operations. We don’t understand the importance of these at the lab level, but these monsters raises their head during commercialization.

Therefore a proper guidelines is required for conducting a successful QbD studies in the lab (see the forth coming article on DoE). In general if we want a successful QbD then we need to make a dummy batch manufacturing record of the process in the lab and then perform the risk analysis to the whole process for identifying CPPs and CMAs. Brief QbD process is described below

###### Improper Control Strategy in the Developmental Report

Once the product is developed in the lab, there are some critical process parameters (CPPs) that can affect the CQAs. These CPPs are seldom deliberated in detail by the cross functional team to mitigate the risk by providing adequate manual and engineering control. This is because we are in a hurry to file ANDA/DMF and other reasons. Once the failures become the chronic issue, we take actions. Because of this CPPs vary in the plant resulting n OOS.

###### Monitoring of CQAs instead of CPPs during commercialization.

I like to call ourselves “knowledgeable sinners”. This because we know that a CQA is affected by the CPPs even then we continue to monitor the CQA instead of CPPs. This is because, if CPPs is under control, then CQA will have to be under control. For example, we know that if reaction temperature shoots, it will lead to impurities, even then we continue to monitor the impurities level using control charts but not the temperature itself. We can ask ourselves what we can achieve by monitoring the impurities after the batch is complete? Answer is we achieve nothing but a failed batch, investigation, loss of raw material/energy/manpower/production time, to summarize we can only do a postmortem of a failed batch and nothing else.

Instead of impurity, if we have monitored the temperature which was critical, we could have taken an corrective action then and there itself. Knowing that this batch is going to fail, we could have terminated the batch thereby saving loss of manpower/energy/production time etc. (imagine a single OOS investigation required at least 5-6 people working for a week, which is equal to 30 man days.

###### Role of QA is mistaken for Policing and auditing rather than in continuous improvement.

The QA department in all organization is frequently busy with audit preparation! Their main role has got restricted to documentation and keep the facility ready for audits (mostly in the pharmaceutical field). What I feel is that, within the QA there has to be a statistical process control (SPC) group, whose main function is to monitor the processes and suggest the areas of improvements.  This function should have sound knowledge of engineering and SPC so that they can foresee the OOT and OOS by monitoring CPPs on the control charts. So, role of QA is not only policing but also assisting other departments in improving quality. I understand that at present SPC knowledge is very limited among QA and other department, which we need to improve.

###### Lack of empowerment to the operators for reporting deviation occurred

You all will agree, the best process owner of any product is the shop-floor peoples or the operators but, we seldom give importance to their contribution. The pressure on them is to deliver a given number of batches per month to meet the sales target. Due to this production target, they often don’t report deviations in CPPs because they know if they do it, it will lead to investigation by QA and the batch will be only cleared once the investigation is over. In my opinion, QA should empower operators to report deviations, the punishment should not be there for the batch failure but for not asking for the help. It is fine to miss the target by one or two batch but the knowledge gained from those batches with deviation would improve the process.

###### Lack of basic statistical knowledge across the technical team (R&D, Production, QA, QC)

I am saying that everyone should become an statistical expert, but at least we can train our people on basic 7QC tools! that is not a rocket science. This will help everyone to monitor and understand the process, shop-floor people can themselves use these tools (or QA  can empower them after training and certification) to plot histogram, control charts etc.. pertaining to the process and can compile the report for QA.

What are Seven QC Tools & How to Remember them?

###### Other reasons for OOT/OOS are as follows which are self explanatory
1. Frequent vendor change (quality comes for a price). Someone has to bear the cost of poor quality.
1. Not linking vendors in your continuous improvement journey. The variation in his raw material can create a havoc in your process.
2. Focusing on delivery at the cost of preventive maintenance of the hardware’s

Related Topics

Proposal for Six Sigma Way of Investigating OOT & OOS in Pharmaceutical Products-1

Proposal for Six Sigma Way of Investigating OOT & OOS in Pharmaceutical Products-2

## You just can’t knock down this Monster “Variance” —- Part-3

If x & y are two variables, then irrespective of whether you add or subtract them the variance will always add up.

A store wants to know the mean and the variance of sales made by male and female customers in a day. He also wants to see the variance in case sales by both gender are added in pair randomly. Lastly he wants to analyze the mean and variance because of the gender effect (i.e. difference of means and variance). Data of sales in hundred dollars is given below

Using Excel Sheet Mean is calculated by typing formula  =average(array)

Variance is calculated by typing formula  =var.s(array)

Array = column of data, Var.s = variance of sample

But most surprising element is that, irrespective of whether you add or subtract the data, variance always increases. This monster will always raise its head. This is indicated by the resultant variance which is always greater than the individual variances.

In general, the variance always gets added irrespective of whether we are adding or subtracting the individual variances.

where ρ is the correlation coefficient between two variables.

If two random variables are not correlated or they are independent then, ρ = 0 and above formula will get reduced to

Try to calculate the variance for x+y and x-y, are you getting little bit different answer? use correlation coefficient into the equation!

Calculating correlation coefficient (ρ) in excel

Type formula in a cell  =correl(array1, array2)

array1 = column x, array2 = column y

Understanding the Monster “Variance” part-1

Why it is so Important to Know the Monster “Variance”? — part-2

Is this information useful to you?

## Why it is so Important to Know the Monster “Variance”? — part-2

Variance occupies the central role in the six-sigma methodology. Any process whether from manufacturing or service industry has many inputs and the variance from each input gets add up in the final product.

Hence variance has an additive property as shown below

Note: you can add two variances but not the standard deviations

Consequence of the variance addition and six sigma

Say if a product/services which is the output of some process, which in turn have many inputs. Then the variance from the input () and from the process () adds up to give the final variance () in the product/services.

DMAIC methodology of 6Sigma try to identify the inputs that contributes maximum towards the variance in the final product and once identified, its effect is studied in detail to minimize the variance from the input. This is done by reducing the variance in the input itself.

Example: if the quality of a input material used to manufacture a product is found to be critical, then steps would be taken to reduce the fluctuation of the quality of that input material from batch to batch either by requesting/threatening the vendor or by performing the rework of the input material at your end.

Related articles:

Understanding the Monster “Variance” part-1

You just can’t knock down this Monster “Variance” —- Part-3

Is this information useful to you?

## Understanding the Monster “Variance” part-1

This is one of the ways of calculating the variability in the data set.  Variance helps us in understanding how the data is arranged around the mean. In order to do so, we need to calculate the deviation of each observation from the mean in the data set .

For example: following is the time taken by me during the week to reach the office. The deviation of each  observation from the mean  time is given below.

Now next step is to calculate the average deviation from the mean using well-known formula

Note that the sum of all positive deviations = sum of all negative deviations which indicates that the mean divided the data set in two equal halves. As a result the sum of all deviation becomes zero, hence we need some other way to calculate this average deviation about the mean.

In order to avoid the issue, a very simple idea was used

Negative number → Square of negative number → positive number → square root of this number → parent number

Hence square of all the deviations are calculated and summed-up to give sum of squares (simply SS) [1]. This SS is then divided by total number of observations to give average variance s² around the mean.[2] The square root of this variance gives standard deviation s, the most common measure of variability.

What it physically means is that on an average data is deviating 7.42 units or simply one standard deviation (±1s) in either of the directions in a given data set.

Above discussion about the sample standard deviation represented by s. For population, variance is represented by σ² and standard deviation by σ.

The sample variance s² is the estimator of the population variance σ². The standard deviation is easier to interpret than the variance because the standard deviation is measured in the same units as the data.

[1] Popularly known as sum of squares, this most widely term used in ANOVA and Regression analysis

[2] SS divided by its degree of freedom → mean sum of squares or MSE, these concepts would appear in ANOVA & Regression analysis.

Related articles:

Why it is so Important to Know the Monster “Variance”? — part-2

You just can’t knock down this Monster “Variance” —- Part-3

Is this information useful to you?