# Hypothesis Testing – Interpreting Data with Statistical Models

## Introduction

Building predictive models, or carrying out data science research, depends on formulating a hypothesis and drawing conclusions using statistical tests. In this guide, you will learn about how to perform these tests using the statistical programming language, ‘R’.

The most widely used inferential statistic techniques are covered in this guide, as listed below:

• One sample T-test

• Independent T-test

• Chi-square Test

• Correlation Test

• Analysis of Variance (ANOVA)

## Data

In this guide, we will be using the fictitious data of loan applicants containing 200 observations and ten variables, as described below:

• Marital_status: Whether the applicant is married (“Yes”) or not (“No”).

• Is_graduate: Whether the applicant is a graduate (“Yes”) or not (“No”).

• Income: Annual Income of the applicant (in USD).

• Loan_amount: Loan amount (in USD) for which the application was submitted.

• Credit_score: Whether the applicant’s credit score was good (“Good”) or not (“Bad”).

• approval_status: Whether the loan application was approved (“Yes”) or not (“No”).

• Investment: Investments in stocks and mutual funds (in USD), as declared by the applicant.

• gender: Whether the applicant is “Female” or “Male”.

• age: The applicant’s age in years.

• work_exp: work experience in years.

``````
library(dplyr)
library(mlbench)

glimpse(df)
``````

Output:

``````
Observations: 200
Variables: 10
Marital_status  <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
Is_graduate 	<chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", ...
Income      	  <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000...
Loan_amount  <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61...
approval_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
Investment  	<int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9...
gender      	<chr> "Female", "Female", "Female", "Female", "Female", "Fem...
age         	  <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33...
work_exp  	<dbl> 9.0, 8.0, 10.0, 9.5, 9.0, 7.0, 6.0, 9.0, 9.0, 11.0, 9....
``````

## Key Terms

Before moving ahead to the statistical tests, it is good to understand a few important terminologies.

• Null and Alternative Hypotheses

The statistical tests in this guide rely on testing a null hypothesis, which is specific for each case.

The null hypothesis assumes the absence of relationship between two or more variables. For example, for two groups, the null hypothesis assumes that there is no correlation or association between the two variables.

The alternative hypothesis is simply the contrary of the null hypothesis.

• P-value

For any statistical test, the p-value is a statistic used to evaluate if we will reject or fail to reject the null hypothesis. It is defined as the probability of obtaining a result equal to or more extreme than what was observed in the data.

• Decision Rule

The p-value, determined by conducting the statistical test, is then compared to a predetermined value ‘alpha’, which is often taken as 0.05.

The decision rule is: if the p-value for the test is less than 0.05, we reject the null hypothesis, but if it is greater than or equal to 0.05, we fail to reject the null hypothesis.

## One Sample T-test

The idea behind one sample t-test is to compare the mean of a vector against a theoretical mean. In our data, we will be taking the ‘Income’ variable, and evaluating it against the theoretical mean.

As per the United States Census Bureau’s annual mid year population estimates, the average per capita personal income in the United States, in the year 2018, was USD 53,820. We will be testing a claim that the mean income of the applicants is USD 53,820.

An important assumption of the one-sample t-test is that the distribution of the variable ‘Income’ should be normally distributed. The line of code below creates a histogram, which seems to be approximately normally distributed.

``````
hist(df\$Income, main='Annual Income of Loan Applicants in USD',xlab='Income(USD)')
``````
{r}

Output:

``````
![image name](https://i.imgur.com/AEpv8FF.png)

``````

Since the normality assumption is satisfied, we will go ahead with the t-test. In ‘R’, the t.test function is used to perform this task, which is done in the line of code below. The first argument is the vector of numbers, ‘Income’, while the second argument is the theoretical mean, denoted by the notation ‘mu’.

``````
````t.test(df\$Income, mu=53820)````

Output:

``````
One Sample t-test

data:  df\$Income
t = 11.871, df = 199, p-value < 2.2e-16

alternative hypothesis: true mean is not equal to 53820

95 percent confidence interval: 61266.55, 64233.45

sample estimates: mean of x: 62750
``````

#### Interpretation of the Output

The output above prints the t-statistic (t = 11.871) and the degrees of freedom, which is 199 (n – 1). The p-value here is close to 0, and less than 0.05, which means that we would reject the null hypothesis that the population mean is equal to USD 53,820.

Another point to notice is the line “alternative hypothesis: true mean is not equal to 53820.” This is corresponding to a two-sided alternative hypothesis. If we wanted to make it a one-sided t-test, then we will add the argument “less” or “greater” in quotes, and that will define the direction of our alternative hypothesis.

## Independent T-test

In this test, we are going to compare two independent groups and see if their means are equal. The variable under study is the ‘work_exp’ variable, and we will test whether the work experience is the same across the male and the female applicants.

The first and second lines of code below create two vectors containing the work experience of the female and male applicants, respectively. We must also test the assumption that both the groups are normally distributed. This is done in the third to fifth lines of code below, which creates two histograms. The histograms suggest that both the variables are approximately normally distributed.

``````
f_workexp = df\$work_exp[df\$gender=='Female']
m_workexp = df\$work_exp[df\$gender=='Male']

#histogram
par(mfrow=c(1,2))
hist(f_workexp)
hist(m_workexp)
``````

Output:

``````
![image name](https://i.imgur.com/5dnhPCi.png)
``````

Since the normality assumption is satisfied, we will perform the t-test using the line of code below.

``````
t.test(f_workexp, m_workexp)
``````

Output:

``````           	Welch Two Sample t-test

data:  f_workexp and m_workexp
t = -0.29465, df = 25.088, p-value = 0.7707

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval: -0.7904954, 0.5925894

sample estimates: mean of x mean of y: 7.832865, 7.931818
``````

### Interpretation of the Output

Since the p-value of 0.7707 is greater than 0.05, we fail to reject the null hypothesis that the means of these two groups are equal. In other words, there is no significant difference in the work experience of the male and female applicants.

## Chi-square Test of Independence

Chi Square Test of Independence is used to determine if there is an association between two or more categorical variables. In our case, we would like to test if the marital status of the applicants has any association with the approval status.

The first step is to create a two-way table between the variables under study, which is done in the lines of code below.

``````
mar_approval <-table(df\$Marital_status, df\$approval_status)
mar_approval
``````

Output:

``````
No   Yes
Divorced     31     29
No   	         66     10
Yes  	         52     12
``````

The next step is to generate the expected counts using the line of code below.

``````
chisq.test(mar_approval, correct=FALSE)\$expected
``````

Output:

``````          	         No        Yes
Divorced    44.70    15.30
No   	        56.62     19.38
Yes  	        47.68     16.32
``````

We are now ready to run the test of independence using the chisq.test function, as shown in the line of code below.

``````
chisq.test(mar_approval, correct=FALSE)
``````

Output:

``````           	Pearson's Chi-squared test

data:  mar_approval
X-squared = 24.095, df = 2, p-value = 5.859e-06
``````

### Interpretation of the Output

Since the p-value is less than 0.05, we reject the null hypothesis that the marital status of the applicants is not associated with the approval status.

## Correlation Test

Correlation Tests are used to determine the presence and extent of a linear relationship between two quantitative variables. In our case, we would like to statistically test if there is a correlation between the applicant’s investment and the work experience.

The first step is to visualize the relationship with the scatter plot, which is done in the line of code below.

``````
````plot(df\$Investment,df\$work_exp, main="Correlation between Investment Levels and Work Experience", xlab="Work experience in years", ylab="Investment in USD")````
Output:
``````
![image name](https://i.imgur.com/pip40R6.png)

``````

The above plot suggests the absence of linear relationship between the two variables. We can quantify this inference by calculating the correlation coefficient, which is done below.

``````
cor(df\$Investment, df\$work_exp)
``````

Output:

``````
 0.06168653
``````

The value of 0.06 shows positive but weak linear relationship between the two variables. Let us further confirm this with the correlation test, which is done in ‘R’ with the cor.test() function.

The basic syntax is cor.test(var1, var2, method = “method”), with the default method being “pearson”. This is done with the line of code below.

``````
``````cor.test(df\$Investment, df\$work_exp)
``````

Output:

``````
Pearson's product-moment correlation

data:  df\$Investment and df\$work_exp
t = 0.86966, df = 198, p-value = 0.3855

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval: -0.07771964, 0.19872675

sample estimates: cor 0.06168653
``````

### Interpretation of the Output

Since the p-value of 0.3855 is greater than 0.05, we fail to reject the null hypothesis that the relationship between the applicant’s investment and their work experience is not significant.

## Analysis of Variance (ANOVA)

The Analysis of Variance (ANOVA) test is used to determine if the categorical group (‘Marital_status’) has any impact on the numerical variable (‘Income’). In our case, the null hypothesis to test is that the applicant’s marital status has no impact on their income level.

The first step is to calculate the average income by the applicants, in each category of the variable ‘Marital_status’. The line of code below performs this task.

``````
aggregate(Income~Marital_status,df,mean)
``````

Output:

``````    Marital_status      	Income
Divorced          	62166.67
No            	63052.63
Yes              	62937.50``````

The next step is to calculate the standard deviation of income levels within each group, which is done in the line of code below.

``````
aggregate(Income~Marital_status,df,sd)
``````

Output:

``````Marital_status         	Income
Divorced                	11213.14
No                        10345.88
Yes                   	10576.82``````

The standard deviation is calculated to test if the assumptions of ANOVA is satisfied or not. Since the largest standard deviation is 11213, for ‘Divorced’ Group, is not more than twice the smallest standard deviation, 10345, we can conclude that the assumptions are satisfied, and we can go ahead with the test. The final step is to run the ‘anova’ test and print the summary result, which is done in the lines of code below.

``````
anova_1 = aov(df\$Income~df\$Marital_status)
summary(anova_1)``````

Output:

``````               	       	Df	    	Sum Sq         Mean Sq	F value     Pr(>F)
df\$Marital_status         2         	2.963e+07    14813596	0.13        0.878
Residuals     	 	197     	            2.249e+10     114182095           	``````

### Interpretation of the Output

Since the p-value of 0.878 is greater than 0.05, we fail to reject the null hypothesis that there is no impact on the income levels of the applicants, basis their marital status.

Since ANOVA results are not significant, there is no need to conduct the Tukey’s HSD post-hoc tests, to understand the differences in the group (Income) means. However, if the results were significant, we could have run the post-hoc test, as done in the line of code below.

``````
TukeyHSD(anova_1)
``````

Output:

``````
Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = df\$Income ~ df\$Marital_status)

\$`df\$Marital_status`
No-Divorced	  	885.9649   -3472.022   5243.952      0.8807915
Yes-Divorced     	770.8333   -3763.821   5305.487     0.9150500
Yes-No     	     	-115.1316  -4396.338   4166.075      0.9977789
``````

All the p-values are greater than 0.05, which suggests that the variations in the income level of the applicants, based on their marital status, is not significant.

## Conclusion

In this guide, you have learned about several techniques for performing hypothesis testing for data interpretation. You also learned about how to interpret the results of the statistical tests in the context of null hypothesis.

# Special 95% discount

## 2000+ Applied Machine Learning & Data Science Recipes

### Portfolio Projects for Aspiring Data Scientists: Tabular Text & Image Data Analytics as well as Time Series Forecasting in Python & R ## Two Machine Learning Fields

There are two sides to machine learning:

• Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
• Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.

Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes

Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!

Latest end-to-end Learn by Coding Recipes in Project-Based Learning:

Applied Statistics with R for Beginners and Business Professionals

Data Science and Machine Learning Projects in Python: Tabular Data Analytics

Data Science and Machine Learning Projects in R: Tabular Data Analytics

Python Machine Learning & Data Science Recipes: Learn by Coding

R Machine Learning & Data Science Recipes: Learn by Coding

Comparing Different Machine Learning Algorithms in Python for Classification (FREE)

`Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.  `