(Basic Statistics for Citizen Data Scientist)
Sampling
Excel provides a Sampling data analysis tool that can be used to create samples. The tool works by defining the population as an array in an Excel worksheet and then using the following input parameters to determine how you would like to carry out the sampling.
Input Range – Specify the range of data that contains the population of values you want to sample. Excel draws samples from the first column, then the second column, and so on.
Sampling Method – Select one of the following two sampling intervals:
- Periodic – In this case, you specify the Period n at which you want sampling to take place. The nth value in the input range and every nth value thereafter is copied to the output column. Sampling stops when the end of the input range is reached.
- Random – In this case, you specify the Random Number of Samples. This number of values is drawn from random positions in the input range. A value can be selected more than once. (i.e. sampling is with replacement).
Example 1: From a population of 10 women and 10 men as given in the table in Figure 1 on the left below, create a random sample of 6 people for Group 1 and a periodic sample consisting of every 3rd woman for Group 2.
Figure 1 – Creating random and periodic samples
You need to run the sampling data analysis tool twice, once to create Group 1 and again to create Group 2. For Group 1 you select all 20 population cells as the Input Range and Random as the Sampling Method with 6 for the Random Number of Samples. For Group 2 you select the 10 cells in the Women column as Input Range and Periodic with Period 3.
Observation: The Sampling data analysis tool has a number of limitations which unfortunately reduces its usefulness. These include:
- Only numeric data (including blank) can be used.
- If in the example above the number of women is not equal to the number of men any blank cells will simply be treated as data and can be chosen for inclusion in a sample.
- The Label option does not function properly and so should not be used
- Random sampling is with replacement. As you can see from the example, the number 2 is chosen twice in the Group 1 sample.
As a result, it often better to use other approaches to create a sample. We now show how to create the Group 1 sample above without duplicates.
Example 2: Recreate Group 1 from Example 1 without allowing any duplicates.
We accomplish this by creating a worksheet as in Figure 2.
Figure 2 – Creating a random sample without replacement
Column A consists of the data elements in the population (as taken from Figure 1). Column B consists of random numbers between 0 and 1. These are generated using the Excel function RAND(). Simply enter =RAND() in cell B4 and then highlight the range B4:B23 and enter Ctrl-D. This will place the formula =RAND() in every cell in the range B4:B23.
Finally create column C by putting the following formula in cell C4 and then copying it down (using Ctrl-D as described above) for as many rows as you want items in the sample.
=INDEX(A$4:A$23,RANK(B4,B$4:B$23))
Observation: If we wanted to generate a sample of size 6 with replacement, we would use the following formula in cell C4 instead (column B would not be necessary):
=INDEX(A$4:A$23,RANDBETWEEN(1,COUNT(A$4:A$23)))
Real Statistics Excel Functions: The Real Statistics Resource Pack provides the following useful array functions that allow you to avoid the complex syntax described above.
SHUFFLE(R1, filler, nrows, ncols): returns an nrows ⨯ ncols array with elements from R1 drawn at random without replacement (i.e. it shuffles the elements in R1). The string filler is used as a filler in case the output array has more cells than R1. This second argument is optional and defaults to the error value #N/A; ncols defaults to 1.
RANDOMIZE(R1, nrows, ncols): returns an nrows ⨯ ncols array with elements from R1 drawn at random with replacement; ncols defaults to 1.
If nrows = 0 or is omitted then the output array is the highlighted range on the active worksheet. If nrows < 0 then the output is an array of the same size and shape as R1; in both these cases, ncols is not used and can be omitted.
The Real Statistics Resource Pack also provides the following functions which support dynamic arrays:
SHUFFLES(R1): returns an array of the same size and shape as R1 with elements from R1 drawn at random without replacement (i.e. it shuffles the elements in R1).
RANDOMIZES(R1): returns an array of the same size and shape as R1 with elements from R1 drawn at random with replacement
Statistics for Beginners in Excel – Sampling using Real Statistics
Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.
Learn by Coding: v-Tutorials on Applied Machine Learning and Data Science for Beginners
Latest end-to-end Learn by Coding Projects (Jupyter Notebooks) in Python and R:
All Notebooks in One Bundle: Data Science Recipes and Examples in Python & R.
End-to-End Python Machine Learning Recipes & Examples.
End-to-End R Machine Learning Recipes & Examples.
Applied Statistics with R for Beginners and Business Professionals
Data Science and Machine Learning Projects in Python: Tabular Data Analytics
Data Science and Machine Learning Projects in R: Tabular Data Analytics
Python Machine Learning & Data Science Recipes: Learn by Coding
R Machine Learning & Data Science Recipes: Learn by Coding
Comparing Different Machine Learning Algorithms in Python for Classification (FREE)
There are 2000+ End-to-End Python & R Notebooks are available to build Professional Portfolio as a Data Scientist and/or Machine Learning Specialist. All Notebooks are only $29.95. We would like to request you to have a look at the website for FREE the end-to-end notebooks, and then decide whether you would like to purchase or not.