(Basic Statistics for Citizen Data Scientist)
Poisson Distribution
Basic Concepts
Definition 1: The Poisson distribution has a probability distribution function (pdf) given by
The parameter μ is often replaced by λ. A chart of the pdf of the Poisson distribution for λ = 3 is shown in Figure 1.
Figure 1 – Poisson Distribution
Observation: Some key statistical properties of the Poisson distribution are:
- Mean = µ
- Variance = µ
- Skewness = 1 /
- Kurtosis = 1/µ
Excel Function: Excel provides the following function for the Poisson distribution:
POISSON(x, μ, cum) where μ = the mean of the distribution and cum takes the values TRUE and FALSE
POISSON(x, μ, FALSE) = probability density function value f(x) at the value x for the Poisson distribution with mean μ.
POISSON(x, μ, TRUE) = cumulative probability distribution function F(x) at the value x for the Poisson distribution with mean μ.
Excel 2010/2013/2016 provide the additional function POISSON.DIST which is equivalent to POISSON.
Real Statistics Function: Excel doesn’t provide a worksheet function for the inverse of the Poisson distribution. Instead you can use the following function provided by the Real Statistics Resource Pack.
POISSON_INV(p, μ) = smallest integer x such that POISSON(x, μ, TRUE) ≥ p
Note that the maximum value of x is 1,024,000,000. A value higher than this indicates an error.
Poisson Process
If the average number of occurrences of a particular event in an hour (or some other unit of time) is μ and the arrival times are random without any tendency to bunch up (i.e. the assumptions for what is called a Poisson process) then the probability of x events occurring in an hour is given by
Example 1: A large department store sells on average 100 MP3 players a week. Assuming that purchases are as described in the above observation, what is the probability that the store will have to turn away potential buyers before the end if they stock 120 players? How many MP3 players should the store stock in order to make sure that it has a 99% probability of being able to supply a week’s demand?
The probability that they will sell ≤ 120 MP3 players in a week is
POISSON(120, 100, TRUE) = 0.977331
Thus, the answer to the first problem is 1 – 0.977331 = 0.022669, or about 2.3%. We can answer the second question by using successive approximations until we arrive at the correct answer. E.g. we could try x = 130, which is higher than 120. The cumulative Poisson is 0.998293, which is too high. We then pick x = 125 (halfway between 120 and 130). This yields 0.993202, which is a little too high, and so we try 123. This yields 0.988756, which a little too low, and so we finally arrive at 124, which has cumulative Poisson distribution of 0.991226.
Alternatively, you can arrive at the same answer (124) by using the Real Statistics formula =POISSON_INV(0.99,100).
Confidence Intervals
The 1–α confidence interval for the mean based on x events occurring (in a unit of time) is given by
For Excel 2007, χ2p,df = CHIINV(1−p,df).
Example 2: Suppose the number of radioactive particles that hits a screen per second follows a Poisson process and suppose that 5 hits occurred in one second, find the 95% confidence interval for the mean number of hits per second.
Figure 2 shows the confidence intervals for various values of x and α.
Figure 2 – Confidence intervals for the Poisson mean
The requested confidence interval is
1.623486 ≤ μ ≤ 11.66833
as calculated by the formulas in cells C9 and D9:
=CHISQ.INV(B9/2,2*A9)/2
=CHISQ.INV.RT(B9/2,2*(A9+1))/2
Note that CHISQ.INV(p,0) = #NUM! for any value of p, and so we cannot use this formula to calculate the lower bound when x = 0 (cell C4). In any case, this value is zero.
Relationship with Binomial and Normal Distributions
Theorem 1: If the probability p of success of a single trial approaches 0 while the number of trials n approaches infinity and the value μ = np stays fixed, then the binomial distribution B(n, p) approaches the Poisson distribution with mean μ.
Click here for the proof of this theorem.
Observation: Based on Theorem 1 the Poisson distribution can be used to estimate the binomial distribution when n ≥ 50 and p ≤ .01, preferably with np ≤ 5.
Example 3: A company produces high precision bolts so that the probability of a defect is .05%. In a sample of 4,000 units what is the probability of having more than 3 defects?
We can solve this problem using the distribution B(4000, .0005), namely the desired probability is
1 – BINOMDIST(3, 4000, .0005, TRUE) = 1 – 0.857169 = 0.142831
We can also use the Poisson approximation as follows:
μ = np = 4000(.0005) = 2
1 – POISSON(3, 2, TRUE) = 1 – 0.857123 = 0.142877
As you can see the approximation is quite accurate.
Observation: The Poisson distribution can be approximated by the normal distribution, as shown in the following theorem.
Theorem 2: For n sufficiently large (usually n ≥ 20), if x has a Poisson distribution with mean μ, then x ~ N(μ, ).
Test for a Poisson Distribution
The index of dispersion of a data set or distribution is the variance divided by the mean.
Since the mean and variance of a Poisson distribution are equal, data that conforms to a Poisson distribution must have an index of dispersion approximately equal to 1. This fact can be used to test whether a data set has a Poisson distribution, as described in Goodness of Fit.
In fact in Goodness of Fit, we also show how to use the chi-square goodness-of-fit test to determine whether a data set follows a Poisson distribution.
Statistics for Beginners in Excel – Poisson Distribution
Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.
Learn by Coding: v-Tutorials on Applied Machine Learning and Data Science for Beginners
Latest end-to-end Learn by Coding Projects (Jupyter Notebooks) in Python and R:
All Notebooks in One Bundle: Data Science Recipes and Examples in Python & R.
End-to-End Python Machine Learning Recipes & Examples.
End-to-End R Machine Learning Recipes & Examples.
Applied Statistics with R for Beginners and Business Professionals
Data Science and Machine Learning Projects in Python: Tabular Data Analytics
Data Science and Machine Learning Projects in R: Tabular Data Analytics
Python Machine Learning & Data Science Recipes: Learn by Coding
R Machine Learning & Data Science Recipes: Learn by Coding
Comparing Different Machine Learning Algorithms in Python for Classification (FREE)
There are 2000+ End-to-End Python & R Notebooks are available to build Professional Portfolio as a Data Scientist and/or Machine Learning Specialist. All Notebooks are only $29.95. We would like to request you to have a look at the website for FREE the end-to-end notebooks, and then decide whether you would like to purchase or not.