(Basic Statistics for Citizen Data Scientist)
Histograms
A histogram is a graphical representation of the output of the FREQUENCY function (as described in Frequency Tables).
Example 1: Create a histogram for the data and bin selection for Example 1 from Frequency Tables.
We start by replicating the data and bin section for Example 1 in Figure 1.
Figure 1 – Data for Example 1
You can use Excel’s chart tool to graph the data in Figure 1, or alternatively you can use the Histogram data analysis tool to accomplish this directly, as described next.
Excel Data Analysis Tool: To use Excel’s Histogram data analysis tool, you must first establish a bin array (as for the FREQUENCY function described in Frequency Tables) and then select the Histogram data analysis tool. In the dialog box that is displayed you next specify the input data (Input Range) and bin array (Bin Range). You can optionally include the labels for these ranges (in which case you check the Labels check box).
For Example 1, the Input Range is A4:B14 and the Bin Range is D4:D7 (with the Labels check box unchecked). The output is displayed in Figure 2.
Observation: Caution must be exercised when creating histograms to present the data in a clear and accurate way. For most purposes it is important that the intervals be equal in size (except for an unbounded first and/or last interval). Otherwise a distorted picture of the data may be presented.
To avoid this problem equally-spaced intervals can be used. This is the approach illustrated in Example 4 of Frequency Tables using the FREQTABLE supplemental function. Alternatively, the Real Statistics Histogram with Normal Curve Overlay data analysis tool can be used.
Real Statistics Data Analysis Tool: The Histogram with Normal Curve Overlay data analysis tool provided in the Real Statistics Resource Pack can be used to create a frequency table and histogram as illustrated in the following example.
Example 2: Create a frequency table and histogram for the 22 data elements in the range A4:B14 of Figure 1 based on bins of size 15.
Enter Ctrl-m and select the Histogram with Normal Curve Overlay option. Fill in the the dialog box that appears as shown in Figure 3.
Figure 3 – Dialog box for Histogram data analysis tool
After clicking on the OK button, the output as shown in Figure 4 is displayed.
Figure 4 – Frequency Table and Histogram
Note that if 100 is inserted in the Maximum bin value (or blank) field of Figure 3 then the output would be as shown in Figure 5.
Figure 5 – Frequency Table and Histogram (revised)
If the Curve chart option is chosen as shown in the dialog box in Figure 3 then the output is as in Figure 6.
Figure 6 – Smoothed Line Curve
Observation: What is the ideal number of bins or bin width? Historically, Sturgis’s Rule was used.
This approach works pretty well for small n (up to about 200). Better approaches were supplied by Scott, namely
where s = the standard deviation, or Freedman and Diaconis, namely
where IQR is the interquartile range. The latter is probably more appropriate for data that is not normally distributed.
For Example 1, since n = 22, s = 22.43 and IQR = 21, the appropriate bin width is 28 per Scott or 15 per Freedman and Diaconis; or about 5.5 bins per Sturgis.
Observation: The histogram (or straight line chart or smoothed line curve) produced by this data analysis tool can also be overlayed by a normal curve to help determine whether the data is normally distributed. See Histogram with Normal Curve Overlay for more details.
Observation: You can also produce a frequency table and histogram of the type described in Example 3 of Discrete Probability Distributions (i.e. without specifying any bins) via the Histogram with Normal Curve Overlay data analysis tool. In this case you would leave the Bin size field blank in the dialog box shown in Figure 3.
Statistics for Beginners with Excel – Histograms
Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.
Learn by Coding: v-Tutorials on Applied Machine Learning and Data Science for Beginners
Latest end-to-end Learn by Coding Projects (Jupyter Notebooks) in Python and R:
All Notebooks in One Bundle: Data Science Recipes and Examples in Python & R.
End-to-End Python Machine Learning Recipes & Examples.
End-to-End R Machine Learning Recipes & Examples.
Applied Statistics with R for Beginners and Business Professionals
Data Science and Machine Learning Projects in Python: Tabular Data Analytics
Data Science and Machine Learning Projects in R: Tabular Data Analytics
Python Machine Learning & Data Science Recipes: Learn by Coding
R Machine Learning & Data Science Recipes: Learn by Coding
Comparing Different Machine Learning Algorithms in Python for Classification (FREE)
There are 2000+ End-to-End Python & R Notebooks are available to build Professional Portfolio as a Data Scientist and/or Machine Learning Specialist. All Notebooks are only $29.95. We would like to request you to have a look at the website for FREE the end-to-end notebooks, and then decide whether you would like to purchase or not.