Click here to Skip to main content
15,888,461 members
Articles / Productivity Apps and Services / Microsoft Office / Microsoft Excel
Tip/Trick

Statistical Outliers Detection: Worksheet Solution

Rate me:
Please Sign up or sign in to vote.
4.89/5 (7 votes)
23 Mar 2019CPOL3 min read 85K   7   8
Statistical Outliers detection in Microsoft Excel worksheet using Median() and array formula

Theoretical Background

"Outliers" are defined as numeric values in any random data set, which have an unusually high deviation from either the statistical mean (average) or the median value. In other words, these numbers are either relatively very small or too big. Detecting the outliers in a data set represents a complex statistical problem, with a corresponding variety of different methodologies and computational techniques as described, for example, in the NIST publication [1]. In general, finding the "Outliers" in a data set could be done by calculating the deviation for each number, expressed as either a "Z-score" or "modified Z-score" and testing it against certain predefined threshold. Z-score typically refers to number of standard deviation relative to the statistical average (in other words, it's measured in "Sigmas"). Modified Z-score applies the median computation technique to measure the deviation and in many cases provides more robust statistical detection of outliers. Mathematically, the Modified Z-score could be written (as suggested by Iglewicz and Hoaglin [1]) as:

Mi=0.6745 * (Xi -Median(Xi)) / MAD,

where MAD stands for Median Absolute Deviation. Any number in a data set with the absolute value of modified Z-score exceeding 3.5 is considered an "Outlier". Modified Z-score could be used to detect outliers in Microsoft Excel worksheet as described below.

Sample Computation of Outliers in Excel Worksheet Using Media/MAD

Step 1. Open Microsoft Excel worksheet and enter a sample set of 10 randomly selected numbers in column A, starting with the first row: 3, 1, -23, 7, 0, 12, -2, 7, 2, 1 (Note: don’t enter commas)

Step 2. In the first row of column C (in other words, C1), enter the formula:

=MEDIAN(A1:A10)

The value in this cell corresponds to the median calculated on a data set entered at step 1.

Step 3. In the second row of column C (in other words, C2), enter the array formula:

{=MEDIAN(ABS(MEDIAN(A1:A10)-A1:A10))}

The value in this cell (C2) corresponds to MAD [1].

Step 4. Enter the formula:

=IF(0.6745*(ABS(C$1-A1)>3.5*C$2), "OUTLIER", "NORMAL")

in the first row of column B and extend it down to the 10th row. Final result of “outlier’s detection” should appear in column B, indicating two "outliers" numbers (-23 and 12) as shown below:

3 NORMAL
1 NORMAL
-23 OUTLIER
7 NORMAL
0 NORMAL
12 OUTLIER
-2 NORMAL
7 NORMAL
2 NORMAL
1 NORMAL

Sample Computation of Outliers Using STD/MEAN Values

The first method described above in based on Modified Z-score and underlying Median/MAD computation. In many cases (particularly for small sample sizes), it provides more robust statistical outliers detection than a "traditional" z-score, which implies the detection based on Standard Deviation and Mean (average). Just for reference purposes, the sample computation below describes the outliers detection based on "2-Sigma" z-score; in other words, any numeric value in a data set considered an "outlier" if it's absolute value of deviation from the statistical average exceeds 2*STD. Note: As mentioned above, the first method is typically more robust and recommended for small sample sizes:

Step 1. Use the same Excel Worksheet with sample set of 10 sample numbers in column A as described in the previous example.

Step 2. Enter the formula: =AVERAGE(A1:A10) into the cell F1. This corresponds to the statistical mean calculated on a data set in column A.

Step 3. Enter the formula: =STDEV(A1:A10) into the cell F2. This corresponds to the Standard Deviation.

Step 4. Enter the formula: =IF(ABS(A1-F$1)>2*F$2,"OUTLIER", "NORMAL") in the first row of column E and extend it down to the 10th row. Final result of the “outlier’s detection” using statistical STD/AVERAGE on z-score level of 2 should appear in column E, indicating only one "outlier" number: -23 (see below); this method is missing a second outliers, which is 12, as found in the previous method.

NORMAL
NORMAL
OUTLIER
NORMAL
NORMAL
NORMAL
NORMAL
NORMAL
NORMAL
NORMAL

Notes

Both methods of statistical detection of Outliers discussed above could be easily implemented within .NET Framework. Suggested MS Excel implementation based upon worksheet functions was chosen mostly to serve the didactic/demo purposes, providing the maximum level of simplicity and clarity. For more details on statistical outliers detection methodology and computation technique, you could refer to the reference [1].

References

  1. Detection of Outliers

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Engineer
United States United States
Dr. Alexander Bell (aka DrABell), a seasoned full-stack DevOps and Data Engineer holds PhD in Electrical and Computer Engineering, authored 37 inventions and published 100+ technical articles, including those popular at CodeProject totaling 4M+ views. Alex pioneered AI/NLP, Cloud development, .NET/Java technology stacks, advanced SQL extensions, HTML5/CSS3 and other important Web technologies; developed multiple award-winning Web/Win apps submitted to App Innovation Contests (AIC 2012/2013). Currently focused on Microsoft Azure Cloud and GitHub Copilot AI-enabled DevOps.

  1. Quiz Engine powered by Azure Cloud (Dev Challenge article)
  2. 'enRoute': Real-time NY City Bus Tracking Web App (IoT on Azure)
  3. Azure web app: Engineering Calculator VOLTMATTER
  4. Azure: NYC real-time bus tracking app
  5. HTML5/CSS3 graphic enhancement: buttons, inputs
  6. Aggregate Product function extends SQL
  7. HTML5 Tables Formatting: Alternate Rows, Color Gradients, Shadows
  8. YouTube™ API for ASP.NET

Comments and Discussions

 
QuestionIs it possible to define "small sample size"? Pin
Burnhard29-Jun-21 1:30
Burnhard29-Jun-21 1:30 
QuestionSmall error Pin
walobu18-Feb-15 14:37
walobu18-Feb-15 14:37 
GeneralRe: Thanks a bunch, ASP-DOT-NET-DEV-U-THE-BEST! :) Actually, I t... Pin
DrABELL30-Jun-11 13:37
DrABELL30-Jun-11 13:37 
GeneralReason for my vote of 5 A robust, simple and efficient metho... Pin
PivotTableCell28-Jun-11 9:22
PivotTableCell28-Jun-11 9:22 
GeneralRe: Hi Alex, You are absolutely right: this very robust yet simp... Pin
DrABELL28-Jun-11 14:03
DrABELL28-Jun-11 14:03 
GeneralHi Walt, Thanks for your note. Let's me clarify couple thing... Pin
DrABELL28-Jun-11 3:20
DrABELL28-Jun-11 3:20 
GeneralIt appears that you are assuming a normal distribution, with... Pin
Dr.Walt Fair, PE25-Jun-11 16:22
professionalDr.Walt Fair, PE25-Jun-11 16:22 
It appears that you are assuming a normal distribution, with the median and mean equivalent and looking at the mean absolute deviation as an indicator of an outlier. Correct me if I misunderstand.

If we're going to use Excel, what is the advantage over using the AVERAGE() and STDEV() functions and comparing to a standardized normal variable? If we aren't using Excel, I can see that it's easier to find a median than an average and standard deviation, but in Excel the effort isn't much different for reasonable sized data sets.

I think you should also mention that this probably would be very misleading for skewed distributions, such as log-normal.
GeneralRe: FYI, Walt, DrABELL posted a comment but did not post it as a... Pin
AspDotNetDev30-Jun-11 8:58
protectorAspDotNetDev30-Jun-11 8:58 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.