###### Abstract

**Since we are entering the technical/statistical part of the subject hence, it would be better for us to understand the concept first**

For many business decisions, we need to calculate the likelihood or probability of an event to occur. Histograms along with relative frequency of a dataset can be used to some extent.. But for every problem we come across we need to draw the histogram and relative frequency to find the probability using area under the curve (AUC).

In order to overcome this limitation a standard normal distribution or Z-distribution or Gaussian distribution was developed and the AUC or probability between any two points on this distribution is well documented in the statistical tables or can be easily found by using excel sheet.

But in order to use standard normal distribution table, we need to convert the parent dataset (irrespective of the unit of measurement) into standard normal distribution using Z-transformation. Once it is done, we can look into the standard normal distribution table to calculate the probabilities.

*From my experience, I found the books belonging to category “statistics for business & economics” are much better for understanding the 6sigma concepts rather than a pure statistical book. Try any of these books as a reference guide.*

**Introduction**

Let’s understand by this example

A company is trying to make a job description for the manager level position and most important criterion was the years of experience a person should possess. They collected a sample of ten manager from their company, data is tabulated below along with its histogram.

As a HR person, I want to know the mean years of experience of a manager and the various probabilities as discussed below

Average experience = 3.9 years

What is the probability that X ≤ 4 years?

What is the probability that X ≤ 5 years?

What is the probability that 3 < X ≤ 5 years?

In order to calculate the above probabilities, we need to calculate the relative frequency and cumulative frequency

Now we can answer above questions

What is the probability that X ≤ 4 years? = 0.7 (see cumulative frequency)

What is the probability that X ≤ 5 years? = 0.9

What is the probability that 3 < X ≤ 5 years? = (probability X ≤ 5) – (probability X < 3) = 0.9-0.3 = 0.6 i.e. 60% of the managers have experience between 3 to 5 years.

**Area under the curve (AUC) as a measure of probability: **

Width of a bar in the histogram = 1 unit

Height of the bar = frequency of the class

Area under the curve for a given bar = 1x frequency of the class

Total area under the curve (AUC) = total area under all bars = 1×1+1×2+1×4+1×2+1×1 = 10

Total area under the curve for class 3 < x ≤ 5 = (AUC of 3^{rd} class + AUC of 4^{th} class) /total AUC = (4+2)/10 = 0.6 = probability of finding x between 3 and 5 (excluding 3)

Now, what about the probability of (3.2 < x ≤ 4.3) =? It will be difficult to calculate by this method, as requires the use of calculus.

*Yes, we can use calculus for calculating various probabilities or AUC for this problem. Are we going to do this whole exercise again and again for each and every problem we come across?*

*With God’s grace, our ancestors gave us the solution in the form of Z-distribution or Standard normal distribution or Gaussian distribution, where the AUC between any two points is already documented. *

This Standard normal distribution or Gaussian distribution is widely used in the scientific measurements and for drawing statistical inferences. This normal curve is shown by a perfectly symmetrical and bell shaped curve.

The Standard normal probability distribution has following characteristics

- The normal curve is defined by two parameters, µ = 0 and σ = 1. They determine the location and shape of the normal distribution.
- The highest point on the normal curve is at the mean which is also the median and mode.
- The normal distribution is symmetrical and tails of the curve extend to infinity i.e. it never touches the x-axis.
- Probabilities of the normal random variable are given by the AUC. The total AUC for normal distribution is 1. The AUC to the right of the mean = AUC to the left of mean = 0.5.
- Percentage of observations within a given interval around the mean in a standard normal distribution is shown below

The AUC for standard normal distribution have been calculated for all given value of p ≥ z and are available in tables that can be used for calculating probabilities.

*Note: be careful whenever you are using this table as some table give area for ≤ z and some gives area between two z-values.*

Let’s try to calculate some of the probabilities using above table

**Problem-1:**

Probability p(z ≥ 1.25). This problem is depicted below

Look for z = 1.2 in vertical column and then look for z = 0.05 for second decimal place in horizontal row of the z-table, p(z ≤ -1.25) = 0.8944

*Note! The z-distribution table given above give the cumulative probability for p(z ≤ 1.25), but here we want p(z ≥ 1.25). Since total probability or AUC = 1, p(z ≥ 1.25) will be given by 1- p(z ≤ 1.25)*

*Therefore *

p(z ≥ 1.25) = 1- p(z ≤ -1.25) = 1-0.8944 = 0.1056

**Problem-2:**

Probability p(z ≤ -1.25). This problem is depicted below

*Note! Since above z-distribution table doesn’t contain -1.25 but the p(z ≤ -1.25) = p(z ≥ 1.25) as standard normal curve is symmetrical.*

*Therefore*

Probability p(z ≤ -1.25) = 0.1056

**Problem-3:**

Probability p(-1.25 ≤ z ≤ 1.25). This problem is depicted below

For the obvious reasons, this can be calculated by subtracting the AUC of yellow region from one.

p(-1.25 ≤ z ≤ 1.25) = 1- {p(z ≤ -1.25) + p(z ≥ 1.25)} = 1 – (2 x 0.1056) = 0.7888

From the above discussion, we learnt that a standard normal distribution table (which is readily available) could be used for calculating the probabilities.

*Now comes the real problem! Somehow I have to convert my original dataset into the standard normal distribution, so that calculating any probabilities becomes easy. In simple words, my **original dataset has a mean of 3.9 years with σ = 1.37 years and we need to convert it into the standard normal distribution with a mean of 0 and σ = 1.*

The formula for converting any normal random variable x with mean µ and standard deviation σ to the standard normal distribution is by z-transformation and the value so obtained is called as z-score.

*Note that the numerator in the above equation = distance of a data point from the mean. The distance so obtained is divided by σ, giving distance of a data point from the mean in terms of σ i.e. now we can say that a particular data is 1.25σ away from the mean. Now the data becomes unit less!*

Let’s do it for the above example discussed earlier

*Note: Z-distribution table is used only in the cases where number of observations ≥ 30. Here we are using it to demonstrate the concept. Actually we should be using t-distribution in this case. *

We can say that the managers with 4 years of experience are 0.073σ away from the mean and on the right hand side. Whereas the managers with 3 years of experience are -0.657σ away from the mean on left hand side.

Now, if you look at the distribution of the Z-scores, it resembles the standard normal distribution with mean = 0 and standard deviation =1.

*But, still one question need to be answered. What is the advantage of converting a given data set into standard normal distribution?*

There are three advantages, first being, it enables us to calculate the probability between any two points instantaneously. Secondly, once you convert your original data into standard normal distribution, you are ending in a unit less distribution (both numerator & denominator in Z-transformation formula has same units)! Hence, it makes possible to compare an orange with an apple. For example, I wish to compare the variation in the salary of the employees with the variation in their years of experience. Since, salary and experience has different unit of measurements, it is not possible to compare them but, once both distributions are converted to standard normal distribution, we can compare them (now both are unit less).

Third advantage is that, while solving problems, we needn’t to convert everything to z-scores as explained by following example

*Historical 100 batches from the plant has given a mean yield of 88% with a standard deviation of 2.1. Now I want to know the various probabilities *

*Probability of batches having yield between 85% and 90%*

Step-1: Transform the yield (x) data into z-scores

What we are looking for is the probability of yield between 85 and 90% i.e. p(85 ≤ x ≤ 90)

Step-2: Always draw rough the standard normal curve and preempt what area one is interested in

Step-3: Use the Z-distribution table for calculating probabilities.

The Z-distribution table given above can be used in following way to calculate p(-1.43 ≤ z ≤ 0.95)

Diagrammatically, p(-1.43 ≤ z ≤ 0.95) = p(z ≤ 0.95) – p(z ≤ -1.43), is represented below

p(-1.43 ≤ z ≤ 0.95) = p(z ≤ 0.95) – p(z ≤ -1.43)= 0.83-0.076 = 0.75

75% of the batches or there is a probability of 0.75 that the yield will be between 85 and 90%.

*It can also be interpreted as “probability of getting a sample mean between 85 and 90 given that population mean is 88% with standard deviation of 2.1”.*

*Probability of yield ≥ 90%*

What we are looking for is the probability of yield ≥ 90% i.e. p(x ≥ 90)

= p(z ≥ 0.95)

Diagrammatically, p(-1.43 ≤ z ≤ 0.95) = p(z ≤ 0.95) – p(z ≤ -1.43), is represented below

p(x ≥ 90) = p(z ≥ 0.95) = 1-p(z ≤ 0.95) = 1- 0.076 = 0.17, there is only 17% probability of getting yield ≥ 90%

*Probability of yield between ≤ 90%*

This is very easy, just subtract p(x ≥ 90) from 1

Therefore,

p(x ≤ 90) = 1- p(x ≥ 90) = 1- 0.17 = 0.83 or 83% of the batches would be having yield ≤ 90%.

*Now let’s work the problem in reverse way, I want to know the yield corresponding to the probability of ≥ 0.85. *

Graphically it can be represented as

Since the table that we are using gives the probability value ≤ z value hence, first we need to find the z-value corresponding to the probability of 0.85. Let’s look into the z-distribution table and find the probability close to 0.85

The probability of 0.8508 correspond to the z-value of 1.04

Now we have z-value of 1.04 and we need find corresponding x-value (i.e. yield) using the Z-transformation formula

Solving for x

*x = 90.18*

Therefore, there is 0.85 probability of getting yield ≤ 90.18% (as z-distribution table we are using give probability for ≤ z) hence, there is only 0.15 probability that yield would be greater than 90.18%.

Above problem can be represented by following diagram

**Exercise: **

The historical data shows that the average time taken to complete the BB exam is 135 minutes with a standard deviation of 15 minutes.

Fins the probability that

- Exam is completed in less than 140 minutes
- Exam is completed between 135 and 145 minutes
- Exam takes more than 150 minutes

**Summary:**

This articles shows the limitations of histogram and relative frequency methods in calculating probabilities, as for every problem we need to draw them. To overcome this challenge, a standardized method of using **standard normal distribution** is adopted where, the AUC between any two points on the curve gives the corresponding probability can easily be calculated using excel sheet or by using z-distribution table. The only thing we need to do is to convert the given data into standard normal distribution using Z-transformation. This also enables us to compare two unrelated things as the Z-distribution is a unit less with mean = 0 and standard deviation = 1. If the population standard deviation is known, we can use z-distribution otherwise we have to work with sample’s standard deviation and we have to use Student’s t-distribution.