Interpreting Data Using Statistical Models with R
Introduction
Statistical models are useful not only in machine learning, but also in interpreting data and understanding the relationships between the variables. In this guide, the reader will learn how to fit and analyze statistical models on the quantitative (linear regression) and qualitative (logistic regression) target variables. The reader will also learn how to create and interpret the correlation matrix of the numerical variables.
We will begin by understanding the data.
Data
In this guide, we will be using the fictitious data of loan applicants containing 600 observations and nine variables, as described below:
-
Marital_status: Whether the applicant is married (“Yes”) or not (“No”).
-
Is_graduate: Whether the applicant is a graduate (“Yes”) or not (“No”).
-
Income: Annual Income of the applicant (in USD).
-
Loan_amount: Loan amount (in USD) for which the application was submitted.
-
Credit_score: Whether the applicant’s credit score is good (“Good”) or not (“Bad”).
-
Age: The applicant’s age in years.
-
Sex: Whether the applicant is female (F) or male (M).
-
approval_status: Whether the loan application was approved (“Yes”) or not (“No”).
-
Investment: Investments in stocks and mutual funds (in USD), as declared by the applicant.
Let us start by loading the required libraries and the data.
library(readr) library(dplyr) library(mlbench) dat <- read_csv("data_r.csv") glimpse(dat)
Output:
Observations: 600 Variables: 9 $ Marital_status <chr> "Yes", "No", "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes... $ Is_graduate <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No... $ Income <int> 586700, 426700, 735500, 327200, 240000, 683200, 800000, 4... $ Loan_amount <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61000... $ Credit_score <chr> "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "... $ approval_status <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No... $ Age <int> 76, 76, 75, 75, 75, 74, 72, 72, 71, 71, 71, 70, 70, 69, 6... $ Sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M... $ Investment <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9568...
The output shows that the dataset has five categorical variables (labelled as ‘chr’) while the remaining four are numerical variables (labelled as ‘int’).
Linear Regression
Regression models are algorithms that predict a continuous label. Linear Regression is a type of regression models which assume the presence of linear relationship between the target and the predictor variables.
Simple Linear Regression
Simple linear regression is the simplest form of regression which uses only one covariate for predicting the target variable. In our case, ‘Investment’ is the covariate variable, while ‘Income’ is the target variable.
The first line of code below fits the univariate linear regression model, while the second line prints the summary of the fitted model. Note that we are using the lm command, which is used for fitting linear models in R.
fit_lin <- lm(Income ~ Investment, data = dat) summary(fit_lin)
Output:
Call:
lm(formula = Income ~ Investment, data = dat)
Residuals:
Min 1Q Median 3Q Max
-4940996 -93314 -33441 78990 3316423
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.393e+05 2.091e+04 11.45 <2e-16 ***
Investment 2.895e+00 8.071e-02 35.87 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 401100 on 598 degrees of freedom
Multiple R-squared: 0.6827, Adjusted R-squared: 0.6821
F-statistic: 1286 on 1 and 598 DF, p-value: < 2.2e-16
Interpretation of the Output
- Investment is a significant variable for predicting Income, as is evident from the significance code ‘***’, printed next to the p-value of the variable.
- The p-value, shown under the column, Pr(>|t|), is less than the significance value of 0.05, which also suggests that there are statistically significant relationships between the variables, ‘Investment’, and ‘Income’.
- The coefficients of the output indicate that for every unit increase in the ‘Investment’, the ‘Income’ goes up by 2.895 dollars.
-
R-squared Value: represents the percentage variation in the dependent variable (Income) that is explained by the independent variable (Investment). In our case, the R-squared value of 0.68 means that 68 percent of the variation in the variable ‘Income’ is explained by the variable ‘Investment’.
All the above factors indicate that there is a strong linear relationship between the two variables.
Correlation
For numerical attributes, an excellent way to think about relationships is to calculate the correlation.
Correlation Coefficient Between Two Variables
The Pearson correlation coefficient, calculated using the cor function, is an indicator of the extent and strength of the linear relationship between the two variables. The line of code below prints the correlation coefficient which comes out to be 0.82. This is a strong positive correlation between the two variables, with the highest positive value being one.
cor(dat$Income, dat$Investment)
Output:
[1] 0.8262401
Correlation Between Multiple Variables
It is also possible to create a correlation matrix for multiple variables, which is a symmetrical table of all pairs of attribute correlations for numerical variables. The first line of code below calculates the correlation between the numerical variables, while the second line displays the correlation matrix.
correl_dat <- cor(dat[,c(3,4,7,9)]) print(correl_dat)
Output:
Income Loan_amount Age Investment Income 1.00000000 0.76643958 0.02787282 0.8262401 Loan_amount 0.76643958 1.00000000 0.05791348 0.7202692 Age 0.02787282 0.05791348 1.00000000 0.1075841 Investment 0.82624011 0.72026924 0.10758414 1.0000000
The matrix above shows that Income has a high positive correlation with ‘Loan_amount’ and ‘Investment’.
Multiple Linear Regression
As the name suggests, multiple linear regression tries to predict the target variable using multiple predictors. In our case, we will build the multivariate statistical model using all the other variables. But before doing the modelling, it is better to convert the character variables into the factor type. The first line of code below creates a list of character variables in the dataset. The second line uses the lapply function to convert these variables, stored in ‘names’, into the factor variables. The third line provides the information about the data, where the categorical variables have been converted to ‘factor’ type.
names <- c(1,2,5,6,8) dat[,names] <- lapply(dat[,names] , factor) glimpse(dat)
Output:
Observations: 600 Variables: 9 $ Marital_status <fct> Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, Yes, Yes, Yes, ... $ Is_graduate <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes, Yes,... $ Income <int> 586700, 426700, 735500, 327200, 240000, 683200, 800000, 4... $ Loan_amount <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61000... $ Credit_score <fct> Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Go... $ approval_status <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, N... $ Age <int> 76, 76, 75, 75, 75, 74, 72, 72, 71, 71, 71, 70, 70, 69, 6... $ Sex <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, F, M, M, M, M, ... $ Investment <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9568...
Now we are ready to fit the multiple linear regression. The lines of code below fit the model and prints the result summary.
fit_mlr <- lm(Income ~ Marital_status + Is_graduate + Loan_amount + Credit_score + Age + Sex + Investment, data = dat) summary(fit_mlr)
Output:
Call: lm(formula = Income ~ Marital_status + Is_graduate + Loan_amount + Credit_score + Age + Sex + Investment, data = dat) Residuals: Min 1Q Median 3Q Max -4184641 -133867 -37001 92469 2852369 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.055e+05 6.802e+04 4.491 8.53e-06 *** Marital_statusYes 2.341e+04 3.299e+04 0.710 0.4782 Is_graduateYes 8.032e+04 3.671e+04 2.188 0.0291 * Loan_amount 3.419e-01 2.925e-02 11.688 < 2e-16 *** Credit_scoreGood -5.012e+04 3.196e+04 -1.568 0.1174 Age -2.426e+03 1.006e+03 -2.412 0.0162 * SexM 4.793e+04 4.048e+04 1.184 0.2370 Investment 2.021e+00 1.043e-01 19.379 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 357700 on 592 degrees of freedom Multiple R-squared: 0.7502, Adjusted R-squared: 0.7473 F-statistic: 254 on 7 and 592 DF, p-value: < 2.2e-16
Interpretation of the Output
- The R-squared Value increased from 0.68 to 0.75 which shows that the addition of variables have improved the prediction power.
- ‘Investment’ and ‘Loan_amount’ are the highly significant predictors, while ‘Age’ and ‘Is_graduate’ are the moderately significant variables. The degree of significance can also be understood from the number of stars, if any, printed next to the p-value of the variable.
-
The p-value for all four variables, discussed above, is less than a significance value of 0.05, as shown under the column labeled Pr(>|t|). This also reinforces our inference that these variables have a statistically significant relationship with the ‘Income’ variable.
Logistic Regression
Logistic Regression is a type of generalized linear model which is used for classification problems. While a linear regression model predicts a continuous outcome, the idea of a logistic regression model is to extend it to situations where the outcome variable is categorical. In this guide, we will perform two-class classification using logistic regression. We will be using the same dataset, but this time, the target variable will be ‘approval_status’, which indicates whether the loan application was approved (“Yes”) or not (“No”).
Univariate Logistic Regression
We will start with only one covariate, ‘Credit_score’, to predict ‘approval_status’. The function being used is the glm command, which is used for fitting generalized linear models in R. The lines of code below fit the univariate logistic regression model and prints the model summary. The argument, family=”binomial”, specifies that we are building a logistic regression model for predicting binary outcomes.
mod_log = glm(approval_status ~ Credit_score, data=dat, family="binomial") summary(mod_log)
Output:
Call: glm(formula = approval_status ~ Credit_score, family = "binomial", data = dat) Deviance Residuals: Min 1Q Median 3Q Max -2.3197 -0.6550 0.3748 0.3748 1.8137 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.4302 0.1783 -8.023 1.03e-15 *** Credit_scoreGood 4.0506 0.2674 15.147 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 749.20 on 599 degrees of freedom Residual deviance: 395.64 on 598 degrees of freedom AIC: 399.64 Number of Fisher Scoring iterations: 5
Using the Pr(>|z|) result above, we can conclude that the variable ‘Credit_score’ is an important predictor for ‘diabetes’, as the p-value is less than 0.05. The significance code also supports this inference. It is also intuitive that the applicants with good credit score will more likely get their loan applications approved, and vice versa.
Multivariate Logistic Regression
We can also include multiple variables in a logistic regression model, using the approval_status ~ ., command. Below we will fit a multivariate logistic regression model for ‘approval_status’ using all the other variables.
mod_log2 = glm(approval_status ~ ., data=dat, family="binomial") summary(mod_log2)
Output:
Call: glm(formula = approval_status ~ ., family = "binomial", data = dat) Deviance Residuals: Min 1Q Median 3Q Max -2.7715 -0.2600 0.1995 0.2778 1.8321 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.160e+00 7.062e-01 -4.474 7.67e-06 *** Marital_statusYes 7.360e-01 3.265e-01 2.254 0.02418 * Is_graduateYes 2.469e+00 3.809e-01 6.484 8.95e-11 *** Income 1.949e-07 5.013e-07 0.389 0.69746 Loan_amount -9.635e-07 3.128e-07 -3.080 0.00207 ** Credit_scoreGood 4.649e+00 3.612e-01 12.869 < 2e-16 *** Age -1.379e-02 1.002e-02 -1.377 0.16841 SexM -3.306e-01 3.941e-01 -0.839 0.40158 Investment 1.784e-06 1.923e-06 0.928 0.35360 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 749.20 on 599 degrees of freedom Residual deviance: 327.75 on 591 degrees of freedom AIC: 345.75 Number of Fisher Scoring iterations: 6
Interpretation of the Output
- The variables ‘Is_graduate’, with label “Yes”, and ‘Credit_score’, with label “Good”, are the two most significant variables. This is indicated by their lower p-values and the higher significance code. ‘Loan_amount’ and ‘Marital_status’ are the next two important variables for predicting ‘approval_status’.
-
The Akaike information criterion (AIC) value also decreased from 399.64 in the univariate model to 345.75 in the multivariate model. In simple terms, the AIC value is an estimator of the relative quality of statistical models for a given set of data. The decrease in AIC value also suggests that adding more variables have strengthened the predictive power of the statistical model.
Conclusion
In this guide, you have learned about interpreting data using statistical models. You also learned about building the correlation matrix for numerical variables and interpreting the output to identify statistically significant variables.
Python Example for Beginners
Two Machine Learning Fields
There are two sides to machine learning:
- Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
- Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.
Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes
Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!
Latest end-to-end Learn by Coding Recipes in Project-Based Learning:
Applied Statistics with R for Beginners and Business Professionals
Data Science and Machine Learning Projects in Python: Tabular Data Analytics
Data Science and Machine Learning Projects in R: Tabular Data Analytics
Python Machine Learning & Data Science Recipes: Learn by Coding
R Machine Learning & Data Science Recipes: Learn by Coding
Comparing Different Machine Learning Algorithms in Python for Classification (FREE)
Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.