(Basic Statistics for Citizen Data Scientist)
Outliers and Robustness
One problem that we face in analyzing data is the presence of outliers, i.e. a data element that is much bigger or much smaller than the other data elements.
For example, the mean of the sample {2, 3, 4, 5, 6} is 4, while the mean of {2, 3, 4, 5, 60} is 14.4. The appearance of the 60 completely distorts the mean in the second sample. Some statistics, such as the median, are more resistant to such outliers. In fact, the median for both samples is 4.
For this example it is obvious that 60 is a potential outlier.
Excel Function: One approach for dealing with outliers is to throw away data that is either too big or too small. Excel provides the TRIMMEAN function for dealing with this issue.
TRIMMEAN(R, p) – calculates the mean of the data in the range R after first throwing away p% of the data, half from the top and half from the bottom. If R contains n data elements and k = the largest whole number ≤ np/2, then the k largest items and the k smallest items are removed before calculating the mean.
For example, suppose R = {5, 4, 3, 20, 1, 4, 6, 4, 5, 6, 7, 1, 3, 7, 2}. Then TRIMMEAN(R, 0.2) works as follows. Since R has 15 elements, k = INT(15 * .2 / 2) = 1. Thus the largest element (20) and the smallest element (1) are removed from R to get R′ = {5, 4, 3, 4, 6, 4, 5, 6, 7, 1, 3, 7, 2}. TRIMMEAN now returns the mean of this range, namely 4.385 instead of the mean of R which is 5.2.
A related approach is to use Winsorized samples, in which the trimmed values are replaced by the remaining highest and lowest values. Consider the following sample:
4, 6, 10, 14, 16, 19, 22, 23, 25, 27, 27, 31, 37, 38, 40, 44, 45, 48, 50, 80
A 10% trimmed sample would simply remove the two lowest and two highest elements (i.e. 4, 6, 50, 80). A 10% Winsorized sample replaces the two lowest elements by the third lowest and the two highest by the 3rd highest, resulting in the following data set:
10, 10, 10, 14, 16, 19, 22, 23, 25, 27, 27, 31, 37, 38, 40, 44, 45, 48, 48, 48
Observation: Since 4 data elements have been replaced, the degrees of freedom of any statistical test needs to be reduced by 4.
Real Statistics Functions: The Real Statistics Resource Pack supplies the following functions:
TRIMDATA(R1, p): array function which returns a column range equivalent to R1 after removing the lowest and highest 100p/2 % of the data values.
WINSORIZE(R1, p): array function which returns a column range which is the Winsorized version of R1 replacing the lowest and highest 100p/2 % of the data values.
WINMEAN(R1, p) = Winsorized mean of the data in range R1 replacing the lowest and highest 100p/2 % of the data values.
Example 1: Find the trimmed and Winsorized data for p = 30% for the data in range A4:A23 of Figure 1. Also find the trimmed and Winsorized means.
Figure 1 – Trimmed and Winsorized Data
Range C4:C23 contains the trimmed data in range A4:A23 using the formula
=TRIMDATA(A4:A23,.3)
The trimmed mean (cell C24) can be calculated using either of the formulas
=TRIMMEAN(A4:A23,.3) or =AVERAGE(C4:C17)
Range E4:E23 contains the Winsorized data in range A4:A23 using the formula
=WINSORIZE(A4:A23,.3)
The Winsorized mean (cell E24) can be calculated using either of the formulas
=WINMEAN(A4:A23,.3) or =AVERAGE(E4:E23)
Real Statistics Functions: Each of the function described above can optionally take a third argument p1. In this case, the action on the lowest data values is governed by p and the action on the highest data values is governed by p1.
TRIMDATA(R1, p, p1): array function which returns a column range equivalent to R1 after removing the lowest 100p % of the data values and the highest 100p1 % of the data values.
WINSORIZE(R1, p, p1): array function which returns a column range which is the Winsorized version of R1 replacing the lowest 100p % of the data values and the highest 100p1 % of the data values.
WINMEAN(R1, p, p1) = AVERAGE(WINSORIZE(R1, p, p1))
In addition, there is a new Real Statistics function which extends the Excel function TRIMMEAN, defined as follows:
TRIM_MEAN(R1, p, p1) = AVERAGE(TRIMDATA(R1, p, p1))
Note that TRIM_MEAN(R1, p) = TRIMMEAN(R1, p)
Statistics for Beginners in Excel – Outliers and Robustness
Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.
Learn by Coding: v-Tutorials on Applied Machine Learning and Data Science for Beginners
Latest end-to-end Learn by Coding Projects (Jupyter Notebooks) in Python and R:
All Notebooks in One Bundle: Data Science Recipes and Examples in Python & R.
End-to-End Python Machine Learning Recipes & Examples.
End-to-End R Machine Learning Recipes & Examples.
Applied Statistics with R for Beginners and Business Professionals
Data Science and Machine Learning Projects in Python: Tabular Data Analytics
Data Science and Machine Learning Projects in R: Tabular Data Analytics
Python Machine Learning & Data Science Recipes: Learn by Coding
R Machine Learning & Data Science Recipes: Learn by Coding
Comparing Different Machine Learning Algorithms in Python for Classification (FREE)
There are 2000+ End-to-End Python & R Notebooks are available to build Professional Portfolio as a Data Scientist and/or Machine Learning Specialist. All Notebooks are only $29.95. We would like to request you to have a look at the website for FREE the end-to-end notebooks, and then decide whether you would like to purchase or not.