Hits: 42 (Basic Statistics for Citizen Data Scientist) Negative Binomial and Geometric Distributions Negative Binomial Distribution Definition 1: Under the same assumptions as for the binomial distribution, let x be a discrete random variable. The probability density function (pdf) for the negative binomial distribution is the probability of getting x failures before k successes where p = the probability of success on any single …

Hits: 7 (Basic Statistics for Citizen Data Scientist) Two-sample Proportion Testing Theorem 1: Let x1 and x2 be random variables with proportional distributions with mean π1 and π2 respectively. Let p1 be the proportion of successes in n1 trials of the first distribution and let p2 be the proportion of successes in n2 trials of the second distribution. When the number of trials n1 and n2 are sufficiently large, usually when ni πi ≥ 5 and ni (1 –πi) ≥ 5, the …

Hits: 9 (Basic Statistics for Citizen Data Scientist) One-sample Proportion Testing From the theorem, we know that when sufficiently large samples of size n are taken, the distribution of sample proportions is approximately normal, distributed around the true population proportion mean π, with standard deviation (i.e. the standard error) We can use this fact to do hypothesis testing …

Hits: 24 (Basic Statistics for Citizen Data Scientist) Hypothesis Testing for Binomial Distribution Example 1: Suppose you have a die and suspect that it is biased towards the number three, and so run an experiment in which you throw the die 10 times and count that the number three comes up 4 times. Determine whether the die …

Hits: 48 (Basic Statistics for Citizen Data Scientist) Binomial Distribution Definition 1: Suppose an experiment has the following characteristics: the experiment consists of n independent trials, each with two mutually exclusive outcomes (success and failure) for each trial the probability of success is p (and so the probability of failure is 1 – p) Each such trial is called a Bernoulli trial. Let x be …

Hits: 96 (Basic Statistics for Citizen Data Scientist) Tolerance Interval As described in Confidence Intervals, a confidence interval provides a way of estimating a population parameter by a corresponding sample statistic to a given level of confidence. We show how to estimate the population mean (the parameter) by the sample mean (the statistic). In particular, if the …

Hits: 25 (Basic Statistics for Citizen Data Scientist) Identifying Outliers and Missing Data The Real Statistics Resource Pack provides an option for identifying potential outliers in a sample. Assuming the sample is normally distributed (based on the Central Limit Theorem), we know that NORM.S.DIST(-2.5,TRUE) = 0.621% of the data should have a z-score less than …

Hits: 9 (Basic Statistics for Citizen Data Scientist) Power and Sample Size using Real Statistics Real Statistics Functions: The Real Statistics Resource Pack supplies the following functions for calculating the power and sample size requirements for one-sample and two-sample hypothesis testing of the mean using the normal distribution. NORM1_POWER(d, n, tails, α) = the power …

Hits: 20 (Basic Statistics for Citizen Data Scientist) Sampling Excel provides a Sampling data analysis tool that can be used to create samples. The tool works by defining the population as an array in an Excel worksheet and then using the following input parameters to determine how you would like to carry out the sampling. Input Range …

Hits: 67 (Basic Statistics for Citizen Data Scientist) Simulation It is often useful to create a model using simulation. Usually, this takes the form of generating a series of random observations (often based on a specific statistical distribution) and then studying the resulting observations using techniques described throughout the rest of this website. This approach is …

Hits: 15 (Basic Statistics for Citizen Data Scientist) Comparing two means when variances are known Theorem 1: Let x̄ and ȳ be the means of two samples of size nx and ny respectively. If x and y are normal or nx and ny are sufficiently large for the Central Limit Theorem to hold, then x̄ – ȳ has normal distribution with mean μx – μy and standard deviation Proof: Since the samples are random, x̄ and ȳ are normally and independently distributed. By …

Hits: 52 (Basic Statistics for Citizen Data Scientist) Descriptive Statistics Tools Excel provides a data analysis tool called Descriptive Statistics which produces a summary of the key statistics for a data set. Example 1: Provide a table of the most common descriptive statistics for the scores in column A of Figure 1. Figure 1 – Output from Descriptive …

Hits: 60 (Basic Statistics for Citizen Data Scientist) Measures of Variability We consider a random variable x and a data set S = {x1, x2, …, xn} of size n which contains possible values of x. The data set can represent either the population being studied or a sample drawn from the population. The mean is the statistic used most often to characterize the center …

Hits: 59 (Basic Statistics for Citizen Data Scientist) Basic Probability Concepts Definition 1: Typically in the field of statistics we study data that results from experiments. An experiment can be considered to be a series of trials, each with a particular outcome. An event is a collection of outcomes corresponding to some result in the experiment. The number of outcomes …

Hits: 30 (Basic Statistics for Citizen Data Scientist) Basic Statistical Concepts Statistics plays a central role in research in the social sciences, pure sciences and medicine. A simplified view of experimental research is as follows: You make some observations about the world and then create a theory consisting of a hypothesis and possible alternative hypotheses …