It is very important to understand the concept of CI/PI/TI before we can understand the reasons for OOS.
Let’s start from following situation
You have to reach the office before 9:30 AM. Now tell me how confident are you about reaching the office exactly between
(A) 9:10 to 9:15 (hmm…, such a narrow range, I am ~90% confident)
(B) 9:05 to 9:20 (a-haa.., now I am 95% confident)
(C) 9:00 to 9:25 (this is very easy, I am almost 99% confident)
The point to be noted here is that , your confidence increases with widening time interval (remember this for rest of the discussion).
More important thing is that, it is difficult to estimate the exact arrival time, but we can say with some confidence that my arrival time would be between some time interval.
Say my arrival time for last five days (assuming all other factors remains constant) was 9:17 AM, so I can say with certain confidence (say 95%) that my arrival time would be given by
Average arrival time on (say 5 days) ± margin of error
The confidence we are showing is called as confidence level and the interval estimated by above equation at a given confidence level is called as CONFIDENCE INTERVAL (CI). This confidence interval may or may not contain my mean arrival time.
Now let’s go a manufacturing scenario
We all are aware of the diagram given below, the critical quality attribute (CQA or y) of any process is affected by many inputs like critical material attribute (CMA), critical process parameter (CPP) and other uncontrollable factors.
Since, CQAs are affected by CPPs and CMAs, it is said that CQA or any output Y is a function of X (X = CPPs/CMAx).
The relationship between Y and X is given by following regression equation
Following points worth mentioning are
- Value of Y depends on the value of Y, it means that if there is deviation in X then there will be a corresponding deviation in Y. e.g. if the level of any impurity (y) is influenced by the temperature then any deviation in impurity level will be attributed to the change in temperature (x).
- If you hold X constant at some value and performs the process many times (say 100) then all 100 products (Y) would not be of same quality because of inherent variation/noise in the system which in turn is because of other uncontrollable factor. That’s why we have error term in our regression equation. If error term becomes zero, then the relationship would be described perfectly by a straight line y = mx + C. In this condition the regression line gives expected value of Y, represented by E(Y) = b0+b1X1.
As we have seen that there will be a variation in Y even if you hold X constant. Hence, the term ‘expected value of Y’ represents the average value of Y for a given value of X.
It’s fine that for a given value of X, there will be a range of Y values because of inherent variation/noise in the process and the average of Y values is called as expected value of Y for a given value of X, but, tell how this is going to help me in investigating OOS/OOT?
Let’s come to the point, assume that we have manufactured one million tablets of 500 mg strength with a mixing time of 15 minutes (= x), Now I want to know the exact mean strength of all the tablets in the entire batch?
In statistical terms,
It’s not possible to estimate the exact mean strength of all the tablets in the entire batch as it would require destructive analysis of the entire one million tablets.
Then, what is the way out? How we can estimate the mean strength of the entire batch?
Best thing we can do is to take out a sample and analyze it and based on the sample mean strength, we can make an intelligent guess about the mean strength of the entire batch … but it would be with some error, as we are using sample for the estimation. This error is called as sampling error. The sample data would give an interval that may contain the population mean is given by
Sample mean ± margin of error = confidence interval (CI)
The term “Sample mean ± margin of error ” is called as confidence interval which may or may not contains the population mean.
It is unlikely that two samples from a given population will yield identical confidence intervals (CI), it means that every sample would provide a different interval but, if we repeat the sampling many times and calculate all CI, then a certain percentage of the resulting confidence intervals would contain the unknown population parameter. The percentage of these CI that contain the parameter is called as confidence level of the interval. The interval estimated by the sample is called as confidence interval (CI). This CI is for a given value of X. This CI will change, with change in X.
Note: Don’t get afraid of the formulas, we will we covering it latter
If 100 samples are withdrawn then we can have following confidence level
A 90% confidence level would indicate that the confidence interval (CI) generated by 90 samples (out of 100) would contain the unknown population parameter.
A 95% confidence level indicates that the CI estimated by 95 samples (out of 100) would contain the unknown population parameter.
To summarize, we can estimate the population mean by using confidence interval with certain degree of confidence level.
It’s fine that CI helps me in determining the range within which there is 95% or 99% probability of finding the mean strength of the entire batch. But I have an additional issue, I am also interested in knowing the number of tablets (out of one million tablets) that would be bracketed by this interval or any other interval and how many are outside this interval? This will help me in determining the failure rate once we compare this interval with customer’s specifications.
More precisely we want to know the interval which would contain the 99% of the tablets with desired strength and how confident we are about this interval that it will contain 99% of the population?
If we can get this interval, we can compare it with the customer’s specification which in turn would tell me something about the process capability. How this can be resolved?
Let’s understand the problem once again
If we understood the issue correctly, then we want to estimate an interval (with required characteristics) based on the sample data that will cover say 99% or 95% of the population and then we want to overlap this interval with the customer’s specification to check the capability of the process. This is represented by scenario-1 and scenario-2 (ideal) in the figure given below.
Having understood the issue, the solution lies in calculating another interval known as Tolerance Interval for the population with a desired characteristics (Y) for a given value of process parameter X.
Tolerance Interval: this interval captures the values of a specified proportion of all future observations of the response variable for a particular combination of the values of the predictor variables with some high confidence level.
We have seen that CI width is entirely due to the sampling error. As the sample size increases and approaches the entire population size, the width of the confidence interval approaches zero. This is because the term “margin of error” would become zero.
In contrast, the width of a tolerance interval is due to both sampling error and variance in the population. As the sample size approaches the entire population, the sampling error diminishes and the estimated percentiles approach the true population percentiles.
e.g. A 95% tolerance interval that captures 98 % of the population of a future batch of the tablets at a mixing time of 15 minutes is 485.221 to 505.579 (this is Y).
Now, if customer’s specification for the tablet strength is 497 to 502 then we are in trouble (representing scenario-1 in above figure) because, we need to work on the process (increase the mixing time) to reduce the variability.
Let’s assume that we increased the mixing time to 35 minutes and as a result, 95% tolerance interval which captures 99% of the population is given by 498.598 to 501.902. Now we are comfortable with the customer’s specification (scenario-2 in above figure). Hence, we need to blend the mixture for 35 minutes before compressing it into tablets.
We need to be careful while understanding the tolerance interval as it contains two types of percentage terms. The first one, 95% is the confidence level and the second term i.e. 98% is the proportion of the total population with required quality attributes that we want to bracket by the tolerance interval for a constant mixing time of 5 minutes.
To summarize: in order to generate tolerance intervals, we must specify both the proportion of the population to be covered and a confidence level. The confidence level is the likelihood that the interval actually covers the proportion.
This is what we wanted during the product development.
In next post we try to clarify the confusion that we have created in this post by a real time example. So, keep visiting us
Note on Regression Equation:
Regression line represents the expected value of y = E(yp) for a given value of x = xn. Hence, the point estimate of y for given value of x = xn s given by
xn = given value of x
yn = Value of output y corresponding to xn
E(yp) = mean or expected value of y for given value of x = xn, it denotes the unknown mean value of all y’s where x = xn.
Theoretically, is the point estimate of E(yp) hence should be equal. But in general it seldom happens. If we want to measure, how close the true mean value E(yp) is to the point estimator, then we need to measure the standard deviation of for given value xp.
Confidence interval for the expected value E(yp) is given by
Why we need this equation right now? (I don’t want you to get terrified!)but, if you focus on the numerator part of the standard deviation formula, then one important observation is that if
then the standard deviation would be minimum and as you move away from the mean, the standard deviation goes on increasing. It implies that the CI would be narrower at and it would widen as you move away from the mean.
Hence, the width of the CI depends on the value of CPP (x)