(Basic Statistics for Citizen Data Scientist)

Goodness of Fit

Basic Concepts

Observation: Suppose the random variable x has binomial distribution B(n, p) and define z as

By Corollary 1 of Relationship between Binomial and Normal Distributions, provided n is large enough, generally if np ≥ 5 and n(1–p) ≥ 5, then z is approximately normally distributed with mean 0 and standard deviation 1.

Thus by Corollary 1 of Chi-square Distribution, z² ~ χ²(1), where

Example 1: Suppose we flip a coin 10 times and obtain 9 heads and 1 tail. Is the coin fair?

We have already studied problems like this in Binomial Distribution and Hypothesis Testing using the Binomial Distribution. We can even use the normal approximation of the binomial distribution to solve such problems. This time we will use the chi-square distribution.

Let x be the random variable that counts the number of heads in n trials and define the null hypothesis as follows:

H₀: the coin is fair, i.e. p = .5 (one-tail test)

Since np = 10 ∙ .5 = 5 ≥ 5 and n(1–p) = 10 ∙ .5 = 5 ≥ 5, we can apply the chi-square distribution as described above.

Now p-value = CHIDIST(6.4, 1) = 0.011412 < .05 = α, and so we reject the null hypothesis and conclude with 95% confidence that the coin is not fair. This analysis is summarized in Figure 1.

Figure 1 – Chi-square test

Note that the “two-tailed” hypothesis is tested by a one-tailed chi-square test.

Observation: If for each trial, let E₁ = success and E₂ = failure, and let obs₁ = number of observed successes and obs₂ = number of observed failures in n trials. Furthermore, let exp₁ = number of expected successes and exp₂= number of expected failures in n trials. Using the terminology of Example 1, this means that obs₁ = x, obs₂ = n – x, exp₁ = np and exp₂= n(1–p).

By algebra, z² can be expressed as

As we have observed above, z² ~ χ²(1). This observation amounts to what is known as the chi-square goodness of fit test for two mutually exclusive outcomes. We now look at the situation for more than two outcomes.

Observation: Suppose we have an experiment consisting of n independent trials, each with k mutually exclusive outcomes E_i, such that for each trial the probability of outcome E_i is p_i. Suppose further that for each i the observed number of occurrences of outcome E_i is n_i. These are the conditions for the multinomial distribution.

Our objective is to determine whether the observed frequencies n₁, …, n_kare consistent with the expected outcomes based on the multinomial distribution, namely np₁,…, np_k.

This problem is equivalent to determining whether to accept the following null hypothesis:

H₀: p_i = n_i/n for all i = 1, …, k

For small samples the problem can be solved using the Fisher’s Exact Test, but for larger samples this becomes computationally difficult. Instead we use the maximum likelihood ratio λ, as described in Maximum Likelihood Function. The idea behind this approach is to create a model for which the probability of obtaining the observed data is maximized, and then compare this model with the probability of obtaining the observed data under the null hypothesis.

We now use a theorem from advanced theoretical statistics that under certain conditions the random variable -2 ln λ, where λ is as defined in Definition 2 of Maximum Likelihood Function, has a distribution which approaches χ²(k) as n → ∞ where n = sample size and k = the number of parameters determined by the null hypothesis.

For a multinomial distribution the random variable -2 ln λ can be expressed as

Here the n_iare the observed values and the np_i are the expected values (based on the multinomial distribution). Since

the null hypothesis H₀ reference only k – 1 parameters, and so by this theorem from advanced theoretical statistics, y ~ χ²(k–1) for sufficiently large n.

We summarize the above observations in Theorem 1 using the following alternative way of expressing the random variable y.

Definition 1: The maximum likelihood statistic can be expressed as:

Theorem 1: For sufficiently large values of n, the maximum likelihood test statistic has an approximately chi-square distribution with k – 1 degrees of freedom, i.e. χ²(k–1).

Definition 2: The Pearson’s chi-square test statistic, is defined as:

Observation: In general, the maximum likelihood test statistic is not used directly. Instead a further approximation, the Pearson’s chi-square test statistic is commonly used. For large samples the results are similar, but for small samples the maximum likelihood statistic yields better results. In the case where k = 2, the Pearson’s chi-square test statistic is the z² statistic we looked at earlier on this webpage.

Theorem 2: For sufficiently large values of n, the Pearson’s chi-square test statistic has approximately a chi-square distribution with k – 1 degrees of freedom, i.e. χ²(k–1)

Observation: Theorem 2 is used to perform what is called goodness of fit testing, where we check to see whether the observed data correspond sufficiently well to the expected values. In order to apply such tests, the following assumptions must be met (otherwise the chi-square approximation in Theorem 2 may not be accurate), namely:

Random sample: Data must come from a random sampling of a population.
Independence: The observations must be independent of each other. This means chi-square cannot be used to test correlated data (e.g. matched pairs).
Cell size: k ≥ 5 and the expected frequencies exp_i ≥ 5. When k < 5 it is better to have even larger values for the exp_i. These assumptions are similar to those for the normal approximation to the binomial distribution.

Since the data is usually organized in the form of a table, the last assumption means that there must be at least 5 cells in the table and the expected frequency for each cell should be at least 5. For large values of k, a small percentage of cells with expected frequency of less than 5 can be acceptable. Even for smaller values of k this may not cause big problems, but it is probably a better choice to use Fisher Exact Test in this case. In any event, you should avoid using the chi-square test where there is an expected frequency of less than 1 in any cell.

If the expected frequency for one or more cells is less than 5, it may be beneficial to combine one or more cells so that this condition can be met (although this must be done in such a way as to not bias the results).

Example 2: We have a die which we suspect is loaded to favor one or more numbers over the others. To test this we throw the die 60 times and get the following count for each of the 6 possible throws (as shown in the upper part of the worksheet in Figure 2):

Figure 2 – Data for Example 2

Essentially we are testing the following hypothesis about the multinomial distribution:

H₀: the probability of throwing any of the six faces on the die is 1/6.

We calculate the chi-square test statistic to be 12.4 (using the formula =SUM(B7:G7) in cell H7 of Figure 2). Here cell B7 contains the formula =(B4-B5)^2/B5 (and similarly for the other cells in range B7:G7). We now apply the chi-square test with k = 6 (and so df = 5) as follows:

p-value = CHIDIST(χ², df) = CHIDIST(12.4,5) = .0297 < .05 = α

Since p-value < α, we reject the null hypothesis, and conclude (with 95% confidence) that the die is loaded. We can reach the same conclusion by looking at the critical value of the test statistic:

χ²-crit = CHIINV(α, df) = CHIINV(.05,5) = 11.07 < 12.4 = χ²-obs

Excel Function: Excel provides the following function which automates the above calculations:

CHITEST(R1, R2) = CHIDIST(χ², df) where R1 = the array of observed data, R2 = the array of expected values, χ² is calculated from R1 and R2 as in Definition 2 and df = the number of elements in R1 (or R2) minus 1.

The ranges R1 and R2 must both have either one row or one column, they must contain the same number of elements and all the cells in R1 and R2 must contain only numeric values.

For Example 2 we can calculate p-value = CHITEST(B4:G4,B5:G5) = .0297.

Example 3: A safari park in Africa is divided into 8 zones, each containing a known population of elephants. A sample is taken of the number of elephants found in each zone to determine whether the distribution of elephants is significantly different from what would be expected based on the known population in each zone. The table on the left of Figure 3 (columns A-C) summarizes the data:

Figure 3 – Data for Example 3

The sample consists of the 55 elephants actually recorded (obs_iby zone). If we scale the known population of 205 elephants down to 55 (by multiplying the population in each zone by 55/205) we arrive at the expected number of elephants (exp_i) in each zone (column E). For the analysis we use the following null hypothesis:

H₀: there is no significant difference between the distribution of the sample and the population distribution

We now calculate the p-value = CHITEST(B4:B11, E4:E11) = 0.82

Since p-value = .82 > .05 = α, we don’t reject the null hypothesis, and conclude there is no significant difference between the distribution of elephants in the sample among the zones compared with the known population.

Fitting data to a distribution

Observation: The chi-square goodness of fit test (as well as the maximum likeliness test) can also be applied to determine whether observed data fit a certain distribution (or curve). For this purpose a modified version of Theorem 1 or 2 can be employed as follows.

Theorem 3: Where there are m unknown parameters in the distribution or curve being fitted, the test statistic in Theorem 2 has approximately the chi-square distribution χ²(k–m–1).

Thus when fitting data to a Poisson distribution m = 1 (the mean parameter), while if fitting data to a normal distribution m = 2 (the mean and standard deviation parameters).

Example 4: A substance is bombarded with radioactive particles for 200 minutes. It is observed that between 0 and 7 hits are made in any one minute interval, as summarized in columns A and B of the worksheet in Figure 4. Thus for 8 of the one minute intervals there were no hits, for 33 one minute intervals there was 1 hit, etc.

Figure 4 – Data for Example 4 plus calculation of chi-square

We hypothesize that the data follows a Poisson distribution whose mean is the weighted average of the observed number of hits per minute, which we calculate to be 612/200 = 3.06 (cell B14).

H₀: the observed data follows a Poisson distribution

For each row in the table we next calculate the probability of x (where x = hits per minute) for x = 0 through 7 using the Poisson pdf, i.e. f(x) = POISSON(x, 3.06, FALSE), and then multiply this probability by 200 to get the expected number of hits per interval assuming the null hypothesis is true (column D).

We would like to proceed as in Example 3, except that this time we can’t use CHITEST since df ≠ the sample size minus 1. In fact, df = k – m – 1 = 8 – 1 – 1 = 6 since there are 8 intervals (k) and the Poisson distribution has 1 unidentified parameter (m), namely the mean. We therefore proceed as in Example 2 and explicitly calculate the chi-square test statistic (in column F) to be 3.085. We next calculate the following:

p-value = CHIDIST(χ², df) = CHIDIST(3.085,6) = .798 > .05 = α

χ²-crit = CHIINV(α, df) = CHIINV(.05,6) = 12.59 > 3.09 = χ²-obs

Based on either of the above inequalities, we retain the null hypothesis, and so with 95% confidence conclude that the observed data follow a Poisson distribution.

Real Statistics Function: The Real Statistics Resource Pack provides the following function to handle analyses such as that used for Example 3:

FIT_TEST(R1, R2, par) = CHISQ.DIST.RT(χ², df) where R1 = the array of observed data, R2 = the array of expected values, par = the number of unknown parameters as in Theorem 3 (default = 0) and χ² is calculated from R1 and R2 as in Definition 2 with df = the number of elements in R1 (or R2) – par – 1.

For Example 2, FIT_TEST(B4:G4,B5:G5) = CHITEST(B4:G4,B5:G5) = .0297 and for Example 3, FIT_TEST(B4:B11,D4:D11,1) = .798.

Testing using the index of dispersion

As we saw above, Theorem 3 can be used to determine whether data follows a Poisson distribution. Note that if we want to test whether data follows a Poisson distribution with a predefined mean then we can use Theorem 2 instead, and so don’t need to reduce the degrees of freedom of the chi-square test by one.

The index of dispersion can also be used to test whether a data set follows a Poisson distribution. This test is especially useful with small data sets where the approach based on Theorem 2 or 3 is impractical.

Definition 3: The Poisson index of dispersion is defined as

Since the index of dispersion is the variance divided by the mean, the Poisson index of dispersion is simply the index of dispersion multiplied by n−1. The Poisson index of dispersion for the data in R1 can be calculated by the Excel formula =DEVSQ(R1)/AVERAGE(R1).

Property 1: For sample size sufficiently large and mean ≥ 4, the Poisson index of dispersion follows a chi-square distribution with degrees of freedom.

Observation: The estimate is pretty good when the mean ≥ 4 even for values of as low as 5. Thus the property is especially useful with small samples, where we don’t have sufficient data to use the goodness-of-fit test described previously.

Example 5: Use Property 1 to determine whether the data in range A3:B8 of Figure 5 follows a Poisson distribution.

As we can see from the analysis in Figure 5, we don’t have sufficient reason to reject the null hypothesis that the data follows a Poisson distribution.

Figure 5 – Testing using Poisson Index of Dispersion

Python Data Visualisation for Business Analyst – Scatter plot with linear regression line of best fit

Statistics for Beginners in Excel – Goodness of Fit

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.

Learn by Coding: v-Tutorials on Applied Machine Learning and Data Science for Beginners

Latest end-to-end Learn by Coding Projects (Jupyter Notebooks) in Python and R:

All Notebooks in One Bundle: Data Science Recipes and Examples in Python & R.

End-to-End Python Machine Learning Recipes & Examples.

End-to-End R Machine Learning Recipes & Examples.

Applied Statistics with R for Beginners and Business Professionals

Data Science and Machine Learning Projects in Python: Tabular Data Analytics

Data Science and Machine Learning Projects in R: Tabular Data Analytics

Python Machine Learning & Data Science Recipes: Learn by Coding

R Machine Learning & Data Science Recipes: Learn by Coding

Comparing Different Machine Learning Algorithms in Python for Classification (FREE)

There are 2000+ End-to-End Python & R Notebooks are available to build Professional Portfolio as a Data Scientist and/or Machine Learning Specialist. All Notebooks are only $29.95. We would like to request you to have a look at the website for FREE the end-to-end notebooks, and then decide whether you would like to purchase or not.

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Towards Advanced Analytics Specialist & Analytics Engineer

Statistics for Beginners in Excel – Goodness of Fit

(Basic Statistics for Citizen Data Scientist)

Goodness of Fit

Basic Concepts

Fitting data to a distribution

Testing using the index of dispersion

Statistics for Beginners in Excel – Goodness of Fit

Personal Career & Learning Guide for Data Analyst, Data Engineer and Data Scientist

Applied Machine Learning & Data Science Projects and Coding Recipes for Beginners

95% Discount on “Projects & Recipes, tutorials, ebooks”

Projects and Coding Recipes, eTutorials and eBooks: The best All-in-One resources for Data Analyst, Data Scientist, Machine Learning Engineer and Software Developer

Learn by Coding: v-Tutorials on Applied Machine Learning and Data Science for Beginners

Related Posts

Unlocking the Power of Mixed Models in Statistical Analysis

Mastering Analysis of Covariance (ANCOVA): A Comprehensive Statistical Guide with Python and R Examples

Exploring Random Effects Models in Economics Research