Statistics for Beginners in Excel – Sampling using Real Statistics

(Basic Statistics for Citizen Data Scientist)

Sampling

Excel provides a Sampling data analysis tool that can be used to create samples. The tool works by defining the population as an array in an Excel worksheet and then using the following input parameters to determine how you would like to carry out the sampling.

Input Range – Specify the range of data that contains the population of values you want to sample. Excel draws samples from the first column, then the second column, and so on.

Sampling Method – Select one of the following two sampling intervals:

  • Periodic – In this case, you specify the Period n at which you want sampling to take place. The nth value in the input range and every nth value thereafter is copied to the output column. Sampling stops when the end of the input range is reached.
  • Random – In this case, you specify the Random Number of Samples.  This number of values is drawn from random positions in the input range. A value can be selected more than once. (i.e. sampling is with replacement).

Example 1: From a population of 10 women and 10 men as given in the table in Figure 1 on the left below, create a random sample of 6 people for Group 1 and a periodic sample consisting of every 3rd woman for Group 2.

 

Random periodic samples Excel

Figure 1 – Creating random and periodic samples

 

You need to run the sampling data analysis tool twice, once to create Group 1 and again to create Group 2. For Group 1 you select all 20 population cells as the Input Range and Random as the Sampling Method with 6 for the Random Number of Samples.  For Group 2 you select the 10 cells in the Women column as Input Range and Periodic with Period 3.

Observation: The Sampling data analysis tool has a number of limitations which unfortunately reduces its usefulness. These include:

  • Only numeric data (including blank) can be used.
  • If in the example above the number of women is not equal to the number of men any blank cells will simply be treated as data and can be chosen for inclusion in a sample.
  • The Label option does not function properly and so should not be used
  • Random sampling is with replacement. As you can see from the example, the number 2 is chosen twice in the Group 1 sample.

As a result, it often better to use other approaches to create a sample. We now show how to create the Group 1 sample above without duplicates.

Example 2: Recreate Group 1 from Example 1 without allowing any duplicates.

We accomplish this by creating a worksheet as in Figure 2.

 

Random sample without replacement

Figure 2 – Creating a random sample without replacement

 

Column A consists of the data elements in the population (as taken from Figure 1). Column B consists of random numbers between 0 and 1. These are generated using the Excel function RAND(). Simply enter =RAND() in cell B4 and then highlight the range B4:B23 and enter Ctrl-D. This will place the formula =RAND() in every cell in the range B4:B23.

Finally create column C by putting the following formula in cell C4 and then copying it down (using Ctrl-D as described above) for as many rows as you want items in the sample.

=INDEX(A$4:A$23,RANK(B4,B$4:B$23))

Observation: If we wanted to generate a sample of size 6 with replacement, we would use the following formula in cell C4 instead (column B would not be necessary):

=INDEX(A$4:A$23,RANDBETWEEN(1,COUNT(A$4:A$23)))

Real Statistics Excel Functions: The Real Statistics Resource Pack provides the following useful array functions that allow you to avoid the complex syntax described above.

SHUFFLE(R1, filler, nrows, ncols): returns an nrows ⨯ ncols array with elements from R1 drawn at random without replacement (i.e. it shuffles the elements in R1). The string filler is used as a filler in case the output array has more cells than R1. This second argument is optional and defaults to the error value #N/A; ncols defaults to 1.

RANDOMIZE(R1, nrows, ncols): returns an nrows ⨯ ncols array with elements from R1 drawn at random with replacement; ncols defaults to 1.

If nrows = 0 or is omitted then the output array is the highlighted range on the active worksheet. If nrows < 0 then the output is an array of the same size and shape as R1; in both these cases, ncols is not used and can be omitted.

The Real Statistics Resource Pack also provides the following functions which support dynamic arrays:

SHUFFLES(R1): returns an array of the same size and shape as R1 with elements from R1 drawn at random without replacement (i.e. it shuffles the elements in R1).

RANDOMIZES(R1): returns an array of the same size and shape as R1 with elements from R1 drawn at random with replacement

 

Statistics for Beginners – Ranking Functions in Excel

 

Statistics for Beginners in Excel – Sampling using Real Statistics

Personal Career & Learning Guide for Data Analyst, Data Engineer and Data Scientist

Applied Machine Learning & Data Science Projects and Coding Recipes for Beginners

A list of FREE programming examples together with eTutorials & eBooks @ SETScholars

95% Discount on “Projects & Recipes, tutorials, ebooks”

Projects and Coding Recipes, eTutorials and eBooks: The best All-in-One resources for Data Analyst, Data Scientist, Machine Learning Engineer and Software Developer

Topics included: Classification, Clustering, Regression, Forecasting, Algorithms, Data Structures, Data Analytics & Data Science, Deep Learning, Machine Learning, Programming Languages and Software Tools & Packages.
(Discount is valid for limited time only)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.

Learn by Coding: v-Tutorials on Applied Machine Learning and Data Science for Beginners

Please do not waste your valuable time by watching videos, rather use end-to-end (Python and R) recipes from Professional Data Scientists to practice coding, and land the most demandable jobs in the fields of Predictive analytics & AI (Machine Learning and Data Science).

The objective is to guide the developers & analysts to “Learn how to Code” for Applied AI using end-to-end coding solutions, and unlock the world of opportunities!