Statistics for Beginners in Excel – Null and Alternative Hypothesis

(Basic Statistics for Citizen Data Scientist)

Null and Alternative Hypothesis

Generally to understand some characteristic of the general population we take a random sample and study the corresponding property of the sample. We then determine whether any conclusions we reach about the sample are representative of the population.

This is done by choosing an estimator function for the characteristic (of the population) we want to study and then applying this function to the sample to obtain an estimate. By using the appropriate statistical test we then determine whether this estimate is based solely on chance.

The hypothesis that the estimate is based solely on chance is called the null hypothesis. Thus, the null hypothesis is true if the observed data (in the sample) do not differ from what would be expected on the basis of chance alone. The complement of the null hypothesis is called the alternative hypothesis.

The null hypothesis is typically abbreviated as H0 and the alternative hypothesis as H1. Since the two are complementary (i.e. H0 is true if and only if H1 is false), it is sufficient to define the null hypothesis.

Since our sample usually only contains a subset of the data in the population, we cannot be absolutely certain as to whether the null hypothesis is true or not. We can merely gather information (via statistical tests) to determine whether it is likely or not. We therefore speak about rejecting or not rejecting (aka retaining) the null hypothesis on the basis of some test, but not of accepting the null hypothesis or the alternative hypothesis. Often in an experiment we are actually testing the validity of the alternative hypothesis by testing whether to reject the null hypothesis.

When performing such tests, there is some chance that we will reach the wrong conclusion. There are two types of errors:

  • Type I – H0 is rejected even though it is true (false positive)
  • Type II – H0 is not rejected even though it is false (false negative)

 

The acceptable level of a Type I error is designated by alpha (α), while the acceptable level of a Type II error is designated beta (β).

We use the following terminology:

Significance level is the acceptable level of type I error, denoted α. Typically, a significance level of α = .05 is used (although sometimes other levels such as α = .01 may be employed). This means that we are willing to tolerate up to 5% of type I errors, i.e. we are willing to accept the fact that in 1 out of every 20 samples we reject the null hypothesis even though it is true.

P-value (the probability value) is the value p of the statistic used to test the null hypothesis. If  p < α then we reject the null hypothesis.

Critical region is the part of the sample space that corresponds to the rejection of the null hypothesis, i.e. the set of possible values of the test statistic which are better explained by the alternative hypothesis. The significance level is the probability that the test statistic will fall within the critical region when the null hypothesis is assumed.

Usually the critical region is depicted as a region under a curve for continuous distributions (or a portion of a bar chart for discrete distributions).

The typical approach for testing a null hypothesis is to select a statistic based on a sample of fixed size, calculate the value of the statistic for the sample and then reject the null hypothesis if and only if the statistic falls in the critical region.

One-tailed hypothesis testing specifies a direction of the statistical test. For example to test whether cloud seeding increases the average annual rainfall in an area which usually has an average annual rainfall of 20 cm, we define the null and alternative hypotheses as follows, where μ represents the average rainfall after cloud seeding.

H0: µ ≤ 20 (i.e. average rainfall does not increase after cloud seeding)

H1µ > 20 (i.e. average rainfall increases after cloud seeding

Here the experimenters are quite sure that the cloud seeding will not significantly reduce rainfall, and so a one-tailed test is used where the critical region is as in the shaded area in Figure 1. The null hypothesis is rejected only if the test statistic falls in the critical region, i.e. the test statistic has a value larger than the critical value.

 

Right tailed significance test

Figure 1 – Critical region is the right tail

 

The critical value here is the right (or upper) tail. It is quite possible to have one sided tests where the critical value is the left (or lower) tail. For example, suppose the cloud seeding is expected to decrease rainfall. Then the null hypothesis could be as follows:

H0µ ≥ 20 (i.e. average rainfall does not decrease after cloud seeding)

H1µ < 20 (i.e. average rain decreases after cloud seeding)

Left tailed significance testing

Figure 2 – Critical region is the left tail

 

Two-tailed hypothesis testing doesn’t specify a direction of the test. For the cloud seeding example, it is more common to use a two-tailed test. Here the null and alternative hypotheses are as follows.

H0µ = 20

H1µ ≠ 20

The reasons for using a two-tailed test is that even though the experimenters expect cloud seeding to increase rainfall, it is possible that the reverse occurs and, in fact, a significant decrease in rainfall results. To take care of this possibility, a two tailed test is used with the critical region consisting of both the upper and lower tails.

 

Two tailed hypothesis testing

Figure 3 – Two-tailed hypothesis testing

 

In this case we reject the null hypothesis if the test statistic falls in either side of the critical region. To achieve a significance level of α, the critical region in each tail must have size α/2.

Statistical power is 1 – β. Thus power is the probability that you find an effect when one exists, i.e. the probability of correctly rejecting a false null hypothesis. While a significance level for type I error of α = .05 is typically used, generally the target for β is .20 or .10, and so .80 or .90 is used as the target value for power.

The general procedure for null hypothesis testing is as follows:

  • State the null and alternative hypotheses
  • Specify α and the sample size
  • Select an appropriate statistical test
  • Collect data (note that the previous steps should be done prior to collecting data)
  • Compute the test statistic based on the sample data
  • Determine the p-value associated with the statistic
  • Decide whether to reject the null hypothesis by comparing the p-value to α (i.e. reject the null hypothesis if p < α)
  • Report your results, including effect sizes (as described in Effect Size)

 

Observation: Suppose we perform a statistical test of the null hypothesis with α = .05 and obtain a p-value of p = .04, thereby rejecting the null hypothesis. This does not mean that there is a 4% probability of the null hypothesis being true, i.e. P(H0) =.04. What we have shown instead is that assuming the null hypothesis is true, the conditional probability that the sample data exhibits the obtained test statistic is 0.04; i.e. P(D|H0) =.04 where D = the event that the sample data exhibits the observed test statistic.

 

R tutorials for Business Analyst – R ANOVA Tutorial: One way and Two way

 

Statistics for Beginners in Excel – Null and Alternative Hypothesis

Personal Career & Learning Guide for Data Analyst, Data Engineer and Data Scientist

Applied Machine Learning & Data Science Projects and Coding Recipes for Beginners

A list of FREE programming examples together with eTutorials & eBooks @ SETScholars

95% Discount on “Projects & Recipes, tutorials, ebooks”

Projects and Coding Recipes, eTutorials and eBooks: The best All-in-One resources for Data Analyst, Data Scientist, Machine Learning Engineer and Software Developer

Topics included: Classification, Clustering, Regression, Forecasting, Algorithms, Data Structures, Data Analytics & Data Science, Deep Learning, Machine Learning, Programming Languages and Software Tools & Packages.
(Discount is valid for limited time only)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.

Learn by Coding: v-Tutorials on Applied Machine Learning and Data Science for Beginners

Please do not waste your valuable time by watching videos, rather use end-to-end (Python and R) recipes from Professional Data Scientists to practice coding, and land the most demandable jobs in the fields of Predictive analytics & AI (Machine Learning and Data Science).

The objective is to guide the developers & analysts to “Learn how to Code” for Applied AI using end-to-end coding solutions, and unlock the world of opportunities!