(Basic Statistics for Citizen Data Scientist)
Symmetry, Skewness and Kurtosis
We consider a random variable x and a data set S = {x1, x2, …, xn} of size n which contains possible values of x. The data set can represent either the population being studied or a sample drawn from the population.
Looking at S as representing a distribution, the skewness of S is a measure of symmetry while kurtosis is a measure of peakedness of the data in S.
Symmetry and Skewness
Definition 1: We use skewness as a measure of symmetry. If the skewness of S is zero then the distribution represented by S is perfectly symmetric. If the skewness is negative, then the distribution is skewed to the left, while if the skew is positive then the distribution is skewed to the right (see Figure 1 below for an example).
Excel calculates the skewness of a sample S as follows:
where x̄ is the mean and s is the standard deviation of S. To avoid division by zero, this formula requires that n > 2.
Observation: When a distribution is symmetric, the mean = median, when the distribution is positively skewed the mean > median and when the distribution is negatively skewed the mean < median.
Excel Function: Excel provides the SKEW function as a way to calculate the skewness of S, i.e. if R is a range in Excel containing the data elements in S then SKEW(R) = the skewness of S.
Excel 2013 Function: There is also a population version of the skewness given by the formula
This version has been implemented in Excel 2013 using the function, SKEW.P.
It turns out that for range R consisting of the data in S = {x1, …, xn}, SKEW.P(R) = SKEW(R)*(n–2)/SQRT(n(n–1)) where n = COUNT(R).
Real Statistics Function: Alternatively, you can calculate the population skewness using the SKEWP(R) function, which is contained in the Real Statistics Resource Pack.
Example 1: Suppose S = {2, 5, -1, 3, 4, 5, 0, 2}. The skewness of S = -0.43, i.e. SKEW(R) = -0.43 where R is a range in an Excel worksheet containing the data in S. Since this value is negative, the curve representing the distribution is skewed to the left (i.e. the fatter part of the curve is on the right). Also SKEW.P(R) = -0.34. See Figure 1.
Figure 1 – Examples of skewness and kurtosis
Observation: SKEW(R) and SKEW.P(R) ignore any empty cells or cells with non-numeric values.
Kurtosis
Definition 2: Kurtosis provides a measurement about the extremities (i.e. tails) of the distribution of data, and therefore provides an indication of the presence of outliers.
Excel calculates the kurtosis of a sample S as follows:
where x̄ is the mean and s is the standard deviation of S. To avoid division by zero, this formula requires that n > 3.
Observation: It is commonly thought that kurtosis provides a measure of peakedness (or flatness), but this is not true. Kurtosis pertains to the extremities and not to the center of a distribution.
Excel Function: Excel provides the KURT function as a way to calculate the kurtosis of S, i.e. if R is a range in Excel containing the data elements in S then KURT(R) = the kurtosis of S.
Observation: The population kurtosis is calculated via the formula
which can be calculated in Excel via the formula
=(KURT(R)*(n-2)*(n-3)/(n-1)-6)/(n+1)
Real Statistics Function: Excel does not provide a population kurtosis function, but you can use the following Real Statistics function for this purpose:
KURTP(R, excess) = kurtosis of the distribution for the population in range R1. If excess = TRUE (default) then 3 is subtracted from the result (the usual approach so that a normal distribution has kurtosis of zero).
Example 2: Suppose S = {2, 5, -1, 3, 4, 5, 0, 2}. The kurtosis of S = -0.94, i.e. KURT(R) = -0.94 where R is a range in an Excel worksheet containing the data in S. The population kurtosis is -1.114. See Figure 1.
Observation: KURT(R) ignores any empty cells or cells with non-numeric values.
Graphical Illustration
We now look at an example of these concepts using the chi-square distribution.
Figure 2 – Example of skewness and kurtosis
Figure 2 contains the graphs of two chi-square distributions (with different degrees of freedom df). We study the chi-square distribution elsewhere, but for now note the following values for the kurtosis and skewness:
Figure 3 – Comparison of skewness and kurtosis
Both curves are asymmetric, and skewed to the right (i.e. the fat part of the curve is on the left). This is consistent with the fact that the skewness for both is positive. But the blue curve is more skewed to the right, which is consistent with the fact that the skewness of the blue curve is larger.
Statistics for Beginners – Symmetry, Skewness and Kurtosis
Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.
Learn by Coding: v-Tutorials on Applied Machine Learning and Data Science for Beginners
Latest end-to-end Learn by Coding Projects (Jupyter Notebooks) in Python and R:
All Notebooks in One Bundle: Data Science Recipes and Examples in Python & R.
End-to-End Python Machine Learning Recipes & Examples.
End-to-End R Machine Learning Recipes & Examples.
Applied Statistics with R for Beginners and Business Professionals
Data Science and Machine Learning Projects in Python: Tabular Data Analytics
Data Science and Machine Learning Projects in R: Tabular Data Analytics
Python Machine Learning & Data Science Recipes: Learn by Coding
R Machine Learning & Data Science Recipes: Learn by Coding
Comparing Different Machine Learning Algorithms in Python for Classification (FREE)
There are 2000+ End-to-End Python & R Notebooks are available to build Professional Portfolio as a Data Scientist and/or Machine Learning Specialist. All Notebooks are only $29.95. We would like to request you to have a look at the website for FREE the end-to-end notebooks, and then decide whether you would like to purchase or not.