Statistics for Beginners in Excel – Outliers and Robustness

(Basic Statistics for Citizen Data Scientist)

Outliers and Robustness

One problem that we face in analyzing data is the presence of outliers, i.e. a data element that is much bigger or much smaller than the other data elements.

For example, the mean of the sample {2, 3, 4, 5, 6} is 4, while the mean of {2, 3, 4, 5, 60} is 14.4. The appearance of the 60 completely distorts the mean in the second sample. Some statistics, such as the median, are more resistant to such outliers. In fact, the median for both samples is 4.

For this example it is obvious that 60 is a potential outlier.

Excel Function: One approach for dealing with outliers is to throw away data that is either too big or too small. Excel provides the TRIMMEAN function for dealing with this issue.

TRIMMEAN(R, p) – calculates the mean of the data in the range R after first throwing away p% of the data, half from the top and half from the bottom. If R contains n data elements and k = the largest whole number ≤ np/2, then the k largest items and the k smallest items are removed before calculating the mean.

For example, suppose R = {5, 4, 3, 20, 1, 4, 6, 4, 5, 6, 7, 1, 3, 7, 2}. Then TRIMMEAN(R, 0.2) works as follows. Since R has 15 elements, k = INT(15 * .2 / 2) = 1. Thus the largest element (20) and the smallest element (1) are removed from R to get R′ = {5, 4, 3, 4, 6, 4, 5, 6, 7, 1, 3, 7, 2}. TRIMMEAN now returns the mean of this range, namely 4.385 instead of the mean of R which is 5.2.

A related approach is to use Winsorized samples, in which the trimmed values are replaced by the remaining highest and lowest values. Consider the following sample:

4, 6, 10, 14, 16, 19, 22, 23, 25, 27, 27, 31, 37, 38, 40, 44, 45, 48, 50, 80

A 10% trimmed sample would simply remove the two lowest and two highest elements (i.e. 4, 6, 50, 80). A 10% Winsorized sample replaces the two lowest elements by the third lowest and the two highest by the 3rd highest, resulting in the following data set:

10, 10, 10, 14, 16, 19, 22, 23, 25, 27, 27, 31, 37, 38, 40, 44, 45, 48, 48, 48

Observation: Since 4 data elements have been replaced, the degrees of freedom of any statistical test needs to be reduced by 4.

Real Statistics Functions: The Real Statistics Resource Pack supplies the following functions:

TRIMDATA(R1, p): array function which returns a column range equivalent to R1 after removing the lowest and highest 100p/2 % of the data values.

WINSORIZE(R1, p): array function which returns a column range which is the Winsorized version of R1 replacing the lowest and highest 100p/2 % of the data values.

WINMEAN(R1, p) = Winsorized mean of the data in range R1 replacing the lowest and highest 100p/2 % of the data values.

Example 1: Find the trimmed and Winsorized data for p = 30% for the data in range A4:A23 of Figure 1. Also find the trimmed and Winsorized means.

 

Trimmed and Winsorized data

Figure 1 – Trimmed and Winsorized Data

 

Range C4:C23 contains the trimmed data in range A4:A23 using the formula

=TRIMDATA(A4:A23,.3)

The trimmed mean (cell C24) can be calculated using either of the formulas

=TRIMMEAN(A4:A23,.3) or =AVERAGE(C4:C17)

Range E4:E23 contains the Winsorized data in range A4:A23 using the formula

=WINSORIZE(A4:A23,.3)

The Winsorized mean (cell E24) can be calculated using either of the formulas

=WINMEAN(A4:A23,.3) or =AVERAGE(E4:E23)

Real Statistics Functions: Each of the function described above can optionally take a third argument p1. In this case, the action on the lowest data values is governed by p and the action on the highest data values is governed by p1.

TRIMDATA(R1, p, p1): array function which returns a column range equivalent to R1 after removing the lowest 100p % of the data values and the highest 100p1 % of the data values.

WINSORIZE(R1, p, p1): array function which returns a column range which is the Winsorized version of R1 replacing the lowest 100p % of the data values and the highest 100p1 % of the data values.

WINMEAN(R1, p, p1) = AVERAGE(WINSORIZE(R1, p, p1))

In addition, there is a new Real Statistics function which extends the Excel function TRIMMEAN, defined as follows:

TRIM_MEAN(R1, pp1) = AVERAGE(TRIMDATA(R1, p, p1))

Note that TRIM_MEAN(R1, p) = TRIMMEAN(R1, p)

 

 

 

Statistics for Beginners in Excel – Outliers and Robustness

Personal Career & Learning Guide for Data Analyst, Data Engineer and Data Scientist

Applied Machine Learning & Data Science Projects and Coding Recipes for Beginners

A list of FREE programming examples together with eTutorials & eBooks @ SETScholars

95% Discount on “Projects & Recipes, tutorials, ebooks”

Projects and Coding Recipes, eTutorials and eBooks: The best All-in-One resources for Data Analyst, Data Scientist, Machine Learning Engineer and Software Developer

Topics included: Classification, Clustering, Regression, Forecasting, Algorithms, Data Structures, Data Analytics & Data Science, Deep Learning, Machine Learning, Programming Languages and Software Tools & Packages.
(Discount is valid for limited time only)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.

Learn by Coding: v-Tutorials on Applied Machine Learning and Data Science for Beginners

Please do not waste your valuable time by watching videos, rather use end-to-end (Python and R) recipes from Professional Data Scientists to practice coding, and land the most demandable jobs in the fields of Predictive analytics & AI (Machine Learning and Data Science).

The objective is to guide the developers & analysts to “Learn how to Code” for Applied AI using end-to-end coding solutions, and unlock the world of opportunities!