Statistics for Beginners – Measures of Central Tendency

(Basic Statistics for Citizen Data Scientist)

Measures of Central Tendency

We consider a random variable x and a data set S = {x1, x2, …, xn} of size n which contains possible values of x. The data set can represent either the population being studied or a sample drawn from the population.

We seek a single measure (i.e. a statistic) which somehow represents the center of the entire data set S. The commonly used measures of central tendency are the meanmedian and mode. Besides the normally studied mean (also called the arithmetic mean) we also consider two other types of mean: the geometric mean and the harmonic mean.

Excel Functions: If R is an Excel range which contains the data elements in S then the Excel formula which calculates each of these statistics is shown in Figure 1.

Statistic Excel 2007 Excel 2010/2013/2016
Arithmetic Mean AVERAGE(R) AVERAGE(R)
Median MEDIAN(R) MEDIAN(R)
Mode MODE(R) MODE.SNGL(R), MODE.MULT(R)
Geometric Mean GEOMEAN(R) GEOMEAN(R)
Harmonic Mean HARMEAN(R) HARMEAN(R)

Figure 1 – Measures of central tendency

Observation: All these functions ignore any empty or non-numeric cells.

While formulas such as AVERAGE(R1) (as well as VAR(R), STDEV(R), etc. described on other webpages) ignore any empty or non-numeric cells, they return an error value if R contains an error value such as #NUM or #DIV/0!. This limitation can often be overcome by using the following approach:

=AVERAGE(IF(ISERROR(R),””,R))

This array formula returns the mean of all the cells in R1 ignoring any cells that contain an error value. Since this is an array formula, you must press Ctrl-Shft-Enter. An alternative approach is to use the following function.

Real Statistics Function: The Real Statistics Resource Pack provides the following array function:

DELErr(R) = the array of the same size and shape as R consisting of all the elements in R where any cells with an error value are replaced by a blank (i.e. an empty cell).

E.g. to find the average of a range R which may contain error cells, you can use the formula

=AVERAGE(DELErr(R))

Real Statistics Data Analysis Tool: The Remove error cells option of the Reformatting a Data Range data analysis tool described in Reformatting Tools makes a copy of the inputted range where all cells that contain error values are replaced by empty cells.

To use this capability, press Ctrl-m and double click on Reformatting a Data Range. When the dialog box shown in Figure 2 of Reformatting Tools, fill in the Input Range, choose the Remove error cells option and leave the # of Rows and # of Columns fields blank. The output will have the same size and shape as the input range.

Mean

We begin with the most commonly used measure of central tendency, the mean.

Definition 1: The mean (also called the arithmetic mean) of the data set S is defined by

mean

Excel Function: The mean is calculated in Excel using the function AVERAGE.

Example 1: The mean of S = {5, 2, -1, 3, 7, 5, 0, 2} is (2 + 5 – 1 + 3 + 7 + 5 + 0 + 2) / 8 = 2.875. We achieve the same result by using the formula =AVERAGE(C3:C10) in Figure 2.

Measures central tendency examples

Figure 2 – Excel examples of central tendency

Observation: When the data set S is a population the Greek letter µ is used for the mean. When S is a sample, then the symbol  is used.

Observation: When data is expressed in the form of frequency tables then the following property is useful.

Property 1: If  is the mean of sample {x1, x2, …, xm} and ȳ is the mean of sample  {y1, y2, …, yn} then the mean of the combined sample is

Combined sample mean

Similarly, if µx is the mean of population {x1, x2, …, xm}  and µy is the mean of population {y1, y2, …, yn} then the mean of the combined population is

Combined population mean

Real Statistics Functions: The Real Statistics Resource Pack furnishes the following array functions:

COUNTCOL(R1) = a row range which contains the number of numeric elements in each of the columns in R1

SUMCOL(R1) = a row range which contains the sums of each of the columns in R1

MEANCOL(R1) = a row range which contains the means of each of the columns in R1

COUNTROW(R1) = a column range which contains the number of numeric elements in each of the rows in R1

SUMROW(R1) = a column range which contains the sums of each of the rows in R1

MEANROW(R1) = a column range which contains the means of each of the rows in R1

Example 2: Use the COUNTCOL and MEANCOL functions to calculate the number of cells in each of the three columns in the range L4:N11 of Figure 3 as well as their means.

Means by column

Figure 3 – Count, Sum and Mean by Column

The array formula =COUNTCOL(L4:N11) produces the first result (in range L13:N13), while the formula =MEANCOL(L4:N11) produces the second result (in range L14:N14) and the formula =SUMCOL(L4:N11) produces the third result (in range L15:N15).

Remember that after entering any of these formulas you must press Ctrl-Shft-Enter.

Median

Definition 2: The median of the data set S is the middle value in S. If you arrange the data in increasing order the middle value is the median. When S has an even number of elements there are two such values; the average of these two values is the median.

Excel Function: The median is calculated in Excel using the function MEDIAN.

Example 3: The median of S = {5, 2, -1, 3, 7, 5, 0} is 3 since 3 is the middle value (i.e the 4th of 7 values) in -1, 0, 2, 3, 5, 5, 7. We achieve the same result by using the formula =MEDIAN(B3:B10) in Figure 2.

Note that each of the functions in Figure 2 ignores any non-numeric values, including blanks. Thus the value obtained for =MEDIAN(B3:B10) is the same as that for =MEDIAN(B3:B9).

The median of S = {5, 2, -1, 3, 7, 5, 0, 2} is 2.5 since 2.5 is the average of the two middle value 2 and 3 of -1, 0, 2, 2, 3, 5, 5, 7. This is the same result as =MEDIAN(C3:C10) in Figure 2.

Mode

Definition 3: The mode of the data set S is the value of the data element that occurs most often.

Example 4: The mode of S = {5, 2, -1, 3, 7, 5, 0} is 5 since 5 occurs twice, more than any other data element. This is the result we obtain from the formula =MODE(B3:B10) in Figure 2. When there is only one mode, as in this example, we say that S is unimodal.

If S = {5, 2, -1, 3, 7, 5, 0, 2}, the mode of S consists of both 2 and 5 since they each occur twice, more than any other data element. When there are two modes, as in this case, we say that S is bimodal.

Excel Function: The mode is calculated in Excel by the formula MODE. If range R contains unimodal data then MODE(R) returns this unique mode. For the first data set in Example 3 this is 5. When R contains data with more than one mode, MODE(R) returns the first of these modes. For the second data set in Example 4 this is 5 (since 5 occurs before 2, the other mode, in the data set). Thus MODE(C3:C10) = 5.

As remarked above, if there is more than one mode, MODE returns only the first, although if all the values occur only once then MODE returns an error value. This is the case for S = {5, 2, -1, 3, 7, 4, 0, 6}. Thus MODE(D3:D10) = #N/A.

Excel 2010/2013 provide an array function MODE.MULT, which is useful for multimodal data by returning a vertical list of modes. When we highlight C19:C20 and enter the array formula =MODE.MULT(C3: C10) and then press Ctrl-Alt-Enter, we see that both modes are displayed.

Excel 2010/2013 also provide the function MODE.SNGL which is equivalent to MODE.

Geometric Mean

Definition 4: The geometric mean of the data set S is calculated by

Geometric mean

This statistic is commonly used to provide a measure of average rate of growth as described in Example 5.

Example 5: Suppose the sales of a certain product grow 5% in the first two years and 10% in the next two years, what is the average rate of growth over the 4 years?

If sales in year 1 are $1 then sales at the end of the 4 years are (1 + .05)(1 + .05)(1 + .1)(1 + .1) = 1.334. The annual growth rate r is that amount such that (1+r)4 = 1.334. Thus r = 1.3341/4 – 1 = .0747.

The same annual growth rate of 7.47% can be obtained in Excel using the formula GEOMEAN(H7:H10) – 1 = .0747.

Harmonic Mean

Definition 5: The harmonic mean of the data set S is calculated by the formula

Harmonic mean

The harmonic mean can be used to calculate an average speed, as described in Example 6.

Example 6: If you go to your destination at 50 mph and return at 70 mph, what is your average rate of speed?

Assuming the distance to your destination is d, the time it takes to reach your destination is d/50 hours and the time it takes to return is d/70, for a total of d/50 + d/70 hours. Since the distance for the whole trip is 2d, your average speed for the whole trip is

image5007

This is equivalent to the harmonic mean of 50 and 70, and so can be calculated in Excel as HARMEAN(50,70), which is HARMEAN(G7:G8) from Figure 2.

 

 

Statistics for Beginners – Measures of Central Tendency

Personal Career & Learning Guide for Data Analyst, Data Engineer and Data Scientist

Applied Machine Learning & Data Science Projects and Coding Recipes for Beginners

A list of FREE programming examples together with eTutorials & eBooks @ SETScholars

95% Discount on “Projects & Recipes, tutorials, ebooks”

Projects and Coding Recipes, eTutorials and eBooks: The best All-in-One resources for Data Analyst, Data Scientist, Machine Learning Engineer and Software Developer

Topics included: Classification, Clustering, Regression, Forecasting, Algorithms, Data Structures, Data Analytics & Data Science, Deep Learning, Machine Learning, Programming Languages and Software Tools & Packages.
(Discount is valid for limited time only)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.

Learn by Coding: v-Tutorials on Applied Machine Learning and Data Science for Beginners

Please do not waste your valuable time by watching videos, rather use end-to-end (Python and R) recipes from Professional Data Scientists to practice coding, and land the most demandable jobs in the fields of Predictive analytics & AI (Machine Learning and Data Science).

The objective is to guide the developers & analysts to “Learn how to Code” for Applied AI using end-to-end coding solutions, and unlock the world of opportunities!