Statistics for Beginners – Discrete Probability Distributions

(Basic Statistics for Citizen Data Scientist)

Discrete Probability Distributions

We now define the concept of probability distributions for discrete random variables, i.e. random variables that take a discrete set of values. Such random variables generally take a finite set of values (heads or tails, people who live in London, scores on an IQ test), but they can also include random variables that take a countable set of values (0, 1, 2, 3, …).

Definition 1: The (probabilityfrequency function, also called the probability density function (abbreviated pdf), of a discrete random variable x is defined so that for any value t in the domain of the random variable (i.e. in its sample space):

Probability frequency function

i.e. the probability that x assumes the value t.

The corresponding (cumulative) distribution function F(x) is defined at value t by

image7101

Property 1: For any discrete random variable defined over the range S with frequency function  f and distribution function F

image074image075image076

for all t in S.

Proof: These are characteristics of the probability function P(E)per Property 1 of Basic Probability Concepts.

Observation: If f is the frequency function of a discrete random x with distribution function F, then f(t) is the probability that x takes the value t and F(t) is the probability that x takes a value less than or equal to t. Thus,the probability that x takes a value t such that t1 < t ≤ t2 is F(t2) – F(t1).

image018z

Here u1 is the first value in the domain of f which is larger than t1. Such a u1 exists since x is a discrete random variable (usually  u1 = t1+1).

A frequency function can be expressed as a table or a bar chart, as described in the following example.

Example 1: Find the distribution function for the frequency function given in columns A and B below. Also show the graph of the frequency and distribution functions.

Frequency and distribution functions

Figure 1 – Table of frequency and distribution functions

Given the frequency function  defined by the table in the range B4:B11, we can define the distribution function  in the range C4:C11 by putting the formula =B4 in cell C4 and the formula =B5+C4 in cell C5 and then copying this formula into cells C6 to C11 (e.g. by highlighting the range C5:C11 and pressing Ctrl-D).

Using the approach described in Example 2.1, we can generate the graphs of the frequency and distribution functions as follows:

Frequency function chart   Discrete distribution function chart

Figure 2 – Charts of frequency and distribution functions

Excel Function: Excel provides the function PROB, which is defined as follows:

Where R1 is the range defining the discrete values of the random variable x (e.g. A4:A11 in Figure 1) and R2 is the range consisting of the frequency values f(x) corresponding to the x values in R1 (e.g. B4:B11 in Figure 1), the Excel function PROB is defined as follows:

PROB(R1, R2, c) = the frequency value f(c)
PROB(R1, R2,, c) = the cumulative distribution value F(c)
PROB(R1, R2, ab)  = the probability that x takes a value t between a and b, inclusive, i.e.

image7103

Thus in Example 1, we can put the formula =PROB(A4:A11,B4:B11,,A8) in cell C8, and similarly for the other values in column C. Also for the frequency function in Example 1,

P(3 ≤ x ≤ 5) = PROB(A4:A11,B4:B11,A6,A8)

For Example 1 it also follows that , P(3 ≤ x ≤ 5) = f(3) + f(4) + f(5) = F(5) – F(2) = 0.31.

Example 2: Determine the frequency function for the data in column A of Figure 3.

Constructing Frequency Function

Figure 3 – Constructing a frequency function

First create a list of unique data values. This can be obtained by first copying the raw data scores in column A to a new place in the worksheet (e.g. in column C in the example above) and selecting Data > Data Tools|Remove Duplicates. The highlighted data can then optionally be sorted via Data > Sort & Filter|Sort. The result appears in cell range C4:C8 above. Alternatively use the Real Statistics QSORT and NoDupes functions as described in Supplemental Functions.

Then use the COUNTIF function to count how many times each score appears in the sample data. E.g. cell D4 contains the formula =COUNTIF($A$3:$A$15,C4), which has value 2 since the data element 12 (the value in cell C4) appears twice in the raw data. Since there are 12 data elements, the correct value of the frequency function for data element 2 is 2/12 = 0.167, which can be calculated via the formula D4/D$9 in cell E4 where D9 contains the formula SUM(D4:D8).

Real Statistics Function: The Real Statistics Resource Pack supplies the following supplemental array function to create the frequency function.

FREQTABLE(R1)  = an n × 3 array which contains the frequency table for the data in range R1, where n = the number of unique values in R1 (i.e. the number of data elements in R1 without duplicates)

To use the function you must highlight an array with 3 columns and at least as many rows as unique elements in R1. You can highlight more rows than you need; any extra rows will take value #N/A.

Example 3: Repeat Example 2 using the FREQTABLE function.

Frequency table function

Figure 4 – Using the FREQTABLE function

The output from =FREQTABLE(A3:A14) (where A3:A14 is as in Figure 3) is shown in range M4:O8 of Figure 4 (the headings in row 3 have been added manually).

Real Statistics Data Analysis Tool: The resource pack also contains a data analysis tool called Histogram with Normal Curve Overlay. This works just like the FREQTABLE function except that you don’t need to specify the size of the frequency table. The analysis tool sizes the output automatically.

Observation: See Histograms for examples of the use of the FREQTABLE function and Frequency Table data analysis tool.

Observation: The notion of probability function can be extended to multiple random variables. We now give the definition for two random variables.

Definition 2f(x, y) is a joint probability density function (pdf) of random variables x, y if for any values of a and in the domains of x and y respectively

Joint probability function

In this case the cumulative distribution function is given by

image7104

Property 2: If x is a random variable with pdf f and y is a random variable with pdf g, then x and y are independent if and only if the function f(x) ∙ g(y) is a joint pdf for x, y.

Proof: Follows from Definition 3 of Basic Probability Concepts.

 

 

Statistics for Beginners – Discrete Probability Distributions

Personal Career & Learning Guide for Data Analyst, Data Engineer and Data Scientist

Applied Machine Learning & Data Science Projects and Coding Recipes for Beginners

A list of FREE programming examples together with eTutorials & eBooks @ SETScholars

95% Discount on “Projects & Recipes, tutorials, ebooks”

Projects and Coding Recipes, eTutorials and eBooks: The best All-in-One resources for Data Analyst, Data Scientist, Machine Learning Engineer and Software Developer

Topics included: Classification, Clustering, Regression, Forecasting, Algorithms, Data Structures, Data Analytics & Data Science, Deep Learning, Machine Learning, Programming Languages and Software Tools & Packages.
(Discount is valid for limited time only)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.

Learn by Coding: v-Tutorials on Applied Machine Learning and Data Science for Beginners

Please do not waste your valuable time by watching videos, rather use end-to-end (Python and R) recipes from Professional Data Scientists to practice coding, and land the most demandable jobs in the fields of Predictive analytics & AI (Machine Learning and Data Science).

The objective is to guide the developers & analysts to “Learn how to Code” for Applied AI using end-to-end coding solutions, and unlock the world of opportunities!