(Basic Statistics for Citizen Data Scientist)
Discrete Probability Distributions
We now define the concept of probability distributions for discrete random variables, i.e. random variables that take a discrete set of values. Such random variables generally take a finite set of values (heads or tails, people who live in London, scores on an IQ test), but they can also include random variables that take a countable set of values (0, 1, 2, 3, …).
Definition 1: The (probability) frequency function, also called the probability density function (abbreviated pdf), of a discrete random variable x is defined so that for any value t in the domain of the random variable (i.e. in its sample space):
i.e. the probability that x assumes the value t.
The corresponding (cumulative) distribution function F(x) is defined at value t by
Property 1: For any discrete random variable defined over the range S with frequency function f and distribution function F
for all t in S.
Proof: These are characteristics of the probability function P(E)per Property 1 of Basic Probability Concepts.
Observation: If f is the frequency function of a discrete random x with distribution function F, then f(t) is the probability that x takes the value t and F(t) is the probability that x takes a value less than or equal to t. Thus,the probability that x takes a value t such that t1 < t ≤ t2 is F(t2) – F(t1).
Here u1 is the first value in the domain of f which is larger than t1. Such a u1 exists since x is a discrete random variable (usually u1 = t1+1).
A frequency function can be expressed as a table or a bar chart, as described in the following example.
Example 1: Find the distribution function for the frequency function given in columns A and B below. Also show the graph of the frequency and distribution functions.
Figure 1 – Table of frequency and distribution functions
Given the frequency function defined by the table in the range B4:B11, we can define the distribution function in the range C4:C11 by putting the formula =B4 in cell C4 and the formula =B5+C4 in cell C5 and then copying this formula into cells C6 to C11 (e.g. by highlighting the range C5:C11 and pressing Ctrl-D).
Using the approach described in Example 2.1, we can generate the graphs of the frequency and distribution functions as follows:
Figure 2 – Charts of frequency and distribution functions
Excel Function: Excel provides the function PROB, which is defined as follows:
Where R1 is the range defining the discrete values of the random variable x (e.g. A4:A11 in Figure 1) and R2 is the range consisting of the frequency values f(x) corresponding to the x values in R1 (e.g. B4:B11 in Figure 1), the Excel function PROB is defined as follows:
PROB(R1, R2, c) = the frequency value f(c)
PROB(R1, R2,, c) = the cumulative distribution value F(c)
PROB(R1, R2, a, b) = the probability that x takes a value t between a and b, inclusive, i.e.
Thus in Example 1, we can put the formula =PROB(A4:A11,B4:B11,,A8) in cell C8, and similarly for the other values in column C. Also for the frequency function in Example 1,
P(3 ≤ x ≤ 5) = PROB(A4:A11,B4:B11,A6,A8)
For Example 1 it also follows that , P(3 ≤ x ≤ 5) = f(3) + f(4) + f(5) = F(5) – F(2) = 0.31.
Example 2: Determine the frequency function for the data in column A of Figure 3.
Figure 3 – Constructing a frequency function
First create a list of unique data values. This can be obtained by first copying the raw data scores in column A to a new place in the worksheet (e.g. in column C in the example above) and selecting Data > Data Tools|Remove Duplicates. The highlighted data can then optionally be sorted via Data > Sort & Filter|Sort. The result appears in cell range C4:C8 above. Alternatively use the Real Statistics QSORT and NoDupes functions as described in Supplemental Functions.
Then use the COUNTIF function to count how many times each score appears in the sample data. E.g. cell D4 contains the formula =COUNTIF($A$3:$A$15,C4), which has value 2 since the data element 12 (the value in cell C4) appears twice in the raw data. Since there are 12 data elements, the correct value of the frequency function for data element 2 is 2/12 = 0.167, which can be calculated via the formula D4/D$9 in cell E4 where D9 contains the formula SUM(D4:D8).
Real Statistics Function: The Real Statistics Resource Pack supplies the following supplemental array function to create the frequency function.
FREQTABLE(R1) = an n × 3 array which contains the frequency table for the data in range R1, where n = the number of unique values in R1 (i.e. the number of data elements in R1 without duplicates)
To use the function you must highlight an array with 3 columns and at least as many rows as unique elements in R1. You can highlight more rows than you need; any extra rows will take value #N/A.
Example 3: Repeat Example 2 using the FREQTABLE function.
Figure 4 – Using the FREQTABLE function
The output from =FREQTABLE(A3:A14) (where A3:A14 is as in Figure 3) is shown in range M4:O8 of Figure 4 (the headings in row 3 have been added manually).
Real Statistics Data Analysis Tool: The resource pack also contains a data analysis tool called Histogram with Normal Curve Overlay. This works just like the FREQTABLE function except that you don’t need to specify the size of the frequency table. The analysis tool sizes the output automatically.
Observation: See Histograms for examples of the use of the FREQTABLE function and Frequency Table data analysis tool.
Observation: The notion of probability function can be extended to multiple random variables. We now give the definition for two random variables.
Definition 2: f(x, y) is a joint probability density function (pdf) of random variables x, y if for any values of a and b in the domains of x and y respectively
In this case the cumulative distribution function is given by
Property 2: If x is a random variable with pdf f and y is a random variable with pdf g, then x and y are independent if and only if the function f(x) ∙ g(y) is a joint pdf for x, y.
Proof: Follows from Definition 3 of Basic Probability Concepts.
Statistics for Beginners – Discrete Probability Distributions
Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.
Learn by Coding: v-Tutorials on Applied Machine Learning and Data Science for Beginners
Latest end-to-end Learn by Coding Projects (Jupyter Notebooks) in Python and R:
All Notebooks in One Bundle: Data Science Recipes and Examples in Python & R.
End-to-End Python Machine Learning Recipes & Examples.
End-to-End R Machine Learning Recipes & Examples.
Applied Statistics with R for Beginners and Business Professionals
Data Science and Machine Learning Projects in Python: Tabular Data Analytics
Data Science and Machine Learning Projects in R: Tabular Data Analytics
Python Machine Learning & Data Science Recipes: Learn by Coding
R Machine Learning & Data Science Recipes: Learn by Coding
Comparing Different Machine Learning Algorithms in Python for Classification (FREE)
There are 2000+ End-to-End Python & R Notebooks are available to build Professional Portfolio as a Data Scientist and/or Machine Learning Specialist. All Notebooks are only $29.95. We would like to request you to have a look at the website for FREE the end-to-end notebooks, and then decide whether you would like to purchase or not.