Coping with Missing, Invalid and Duplicate Data in R
Introduction
A vital component of data science is cleaning the data and getting it ready for predictive modeling. The most common problem related to data cleaning is coping with missing data, invalid records and duplicate values.
In this guide, you will learn about techniques for dealing with missing, invalid and duplicate data in the statistical programming language R.
Data
In this guide, we will be using a fictitious dataset of loan applications containing 600 observations and 12 variables:
1. UID – Unique identifier for an applicant
2. Marital_status – Whether the applicant is married (“Yes”) or not (“No”)
3. Dependents – Number of dependents of the applicant
4. Is_graduate – Whether the applicant is a graduate (“Yes”) or not (“No”)
5. Income – Annual income of the applicant (in USD)
6. Loan_amount – Loan amount (in USD) for which the application was submitted
7. Term_months – Tenure of the loan
8. Credit_score – Whether the applicants credit score is good (“Satisfactory”) or not (“Not Satisfactory”)
9. Approval_status – Whether the loan application was approved (“1”) or not (“0”)
10. Age – The applicant’s age in years
11. Sex – Whether the applicant is a male (“M”) or a female (“F”)
12. Purpose – Purpose of applying for the loan
Let’s start by loading the required libraries and the data.
library(readr) dat <- read_csv("data_cleaning.csv") glimpse(dat)
Output:
Observations: 600 Variables: 12 $ UID <chr> "UIDA7", "UIDA354", "UIDA402", "UIDA467", "UIDA78", "U... $ Marital_status <chr> "No", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "Y... $ Dependents <int> NA, 0, 0, 0, NA, NA, NA, NA, 0, 2, 0, 2, 0, 0, NA, NA,... $ Is_graduate <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "No", ... $ Income <int> 333300, 532400, 454700, 368500, 611100, 266700, 243700... $ Loan_amount <int> 53500, 115000, 49000, 61000, 62000, 70000, 55500, 6300... $ Term_months <int> 504, 384, 384, 384, 384, 384, 384, 204, 384, 384, 384,... $ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis... $ approval_status <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, ... $ Age <int> -3, -3, -10, -12, -12, -12, -12, 23, 23, 23, 23, 23, 2... $ Sex <chr> "M", "M", "F", "F", "F", "M", "M", "M", "M", "M", "M",... $ Purpose <chr> "Education", "Travel", "Personal", "Business", "Educat...
The output shows that the dataset has six numerical variables (labeled as ‘int’), while the remaining six are categorical variables (labelled as ‘chr’). We will convert these into ‘factor’ variables, except for the ‘UID’ variable, using the line of code below.
names <- c(2,4,8,9,11,12) dat[,names] <- lapply(dat[,names] , factor) glimpse(dat)
Output:
Observations: 600 Variables: 12 $ UID <chr> "UIDA7", "UIDA354", "UIDA402", "UIDA467", "UIDA78", "U... $ Marital_status <fct> No, Yes, No, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes... $ Dependents <int> NA, 0, 0, 0, NA, NA, NA, NA, 0, 2, 0, 2, 0, 0, NA, NA,... $ Is_graduate <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, No, Yes, Yes, Yes, Y... $ Income <int> 333300, 532400, 454700, 368500, 611100, 266700, 243700... $ Loan_amount <int> 53500, 115000, 49000, 61000, 62000, 70000, 55500, 6300... $ Term_months <int> 504, 384, 384, 384, 384, 384, 384, 204, 384, 384, 384,... $ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory... $ approval_status <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, ... $ Age <int> -3, -3, -10, -12, -12, -12, -12, 23, 23, 23, 23, 23, 2... $ Sex <fct> M, M, F, F, F, M, M, M, M, M, M, M, M, F, F, M, M, M, ... $ Purpose <fct> Education, Travel, Personal, Business, Education, Educ...
We are now ready to carry out the data cleaning steps in the following sections.
Duplicate Values
The first step is to check for duplicate records, one of the most common errors in real world data. Duplicate records increase computation time and decrease model accuracy, and hence must be removed. In our dataset, ‘UID’ is the unique identifier variable and will be used to drop the duplicate records. The first line of code below uses the duplicated() function to find duplicates, while the second line prints the number of duplicates.
dup_records <- duplicated(dat$UID) sum(dup_records)
Output:
[1] 3
The output shows that there are three duplicate records. We will drop these records using the first line of code below. The second line prints the dimension of the resulting dataset — 597 observations and 12 variables.
dat <- dat[!duplicated(dat$UID), ] dim(dat)
Output:
[1] 597 12
Invalid Values
When we looked at the data using the glimpse() function in the previous section, we realized that the age variable has incorrect entries. Let’s look at the summary of the age variable.
summary(dat$Age)
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's -12.00 36.50 51.00 49.03 61.00 76.00 2
The output shows that the minimum value of the variable ‘Age’ is -12. This is impossible and brings us to the next common problem in real world datasets: the presence of inaccurate records. It is safe to assume that for loan applications, the minimum age should be 18 years. This means that we will remove records of applicants below 18 years of age.
The first two of lines of code below give us the number of records in the dataset for which the age is below 18 years. The number of such records is seven, and they are removed with the third line of code. The fourth line prints the dimensions of the new data — 590 observations and 12 variables.
Finally, we relook at the summary of the age variable. This shows that the range of age is now 23 to 76 years, indicating that the correction has been made.
age_18 <- dat[which(dat$Age<18),]
dim(age_18)
dat <- dat[-which(dat$Age<18),]
dim(dat)
summary(dat$Age)
Output:
[1] 590 12
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
23.00 37.00 51.00 49.72 61.00 76.00 2
It was easy to detect incorrect entries in the age variable. In other cases, invalid values are in the form of outliers. Let’s look at an example of the ‘Income’ variable. The summary of the variable shows that the minimum and maximum income levels are USD 136700, and 3321001, respectively. This is a highly skewed range, indicating some extreme values. To better understand the distribution, we use the quantile function, which gives us the first to hundredth percentile values of the variable in the sequence of unit percentile.
summary(dat$Income) quantile(dat$Income,seq(0,1,0.01))
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max. 136700 386700 512800 687874 775300 3321001 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 136700 136700 210676 240000 244288 254540 263684 274664 286108 295100 10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 301560 311100 317700 320000 329100 332220 333300 335288 344400 346700 20% 21% 22% 23% 24% 25% 26% 27% 28% 29% 352860 358800 363084 371396 383332 386700 391172 397980 401508 405556 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 410220 412724 421052 422244 424804 431960 437016 444400 448620 454972 40% 41% 42% 43% 44% 45% 46% 47% 48% 49% 458920 465068 468448 476468 479696 486180 491380 495548 500000 506956 50% 51% 52% 53% 54% 55% 56% 57% 58% 59% 512800 515552 523184 532012 536480 551820 555504 563080 572852 577700 60% 61% 62% 63% 64% 65% 66% 67% 68% 69% 585380 607584 611276 620300 625904 633300 648308 656708 666700 683156 70% 71% 72% 73% 74% 75% 76% 77% 78% 79% 700000 721040 733300 753968 761484 775300 788132 800000 807740 821696 80% 81% 82% 83% 84% 85% 86% 87% 88% 89% 834660 853300 880008 914712 963752 1010680 1058180 1111100 1149276 1219460 90% 91% 92% 93% 94% 95% 96% 97% 98% 99% 1262060 1333300 1392412 1502676 1664032 1944400 2064768 2223884 2608396 3197268 100% 3321001
We can remove the outliers using the method described in the previous section. We can also address them through a different method of flooring and capping the extreme values. The first line of code below does the flooring of the lower outliers at the first percentile value, i.e., USD 136700. Similarly, the second line performs the capping of the higher outliers at the 99th percentile value, i.e., USD 3321001.
The third line of code prints the new summary of the variable, indicating that the correction has been done.
dat$Income[which(dat$Income<136700)]<- 136700 dat$Income[which(dat$Income > 3321001)]<- 3321001 summary(dat$Income)
Min. 1st Qu. Median Mean 3rd Qu. Max. 136700 386175 508650 685301 772650 3321001
Missing Values
Missing value treatment is the most common data cleaning step performed in a data science project. The line of code below prints the number of missing values in each of the variables.
sapply(dat, function(x) sum(is.na(x)))
Output:
UID Marital_status Dependents Is_graduate Income 0 0 3 6 0 Loan_amount Term_months Credit_score approval_status Age 7 6 0 0 2 Sex Purpose 0 0
The output above shows the presence of missing values across the variables, most of which are numerical variables, except ‘Is_graduate’, which is a categorical variable.
Missing Value Imputation for Numerical Variables
The most widely used technique for imputing values for a numerical variable is to replace the missing values with the mean or the median value. In the lines of code below, we replace missing values in ‘Loan_amount’ with the median value, while the missing values in ‘Term_months’ are replaced by the mean value. The output shows that the missing values have been imputed.
dat$Loan_amount[is.na(dat$Loan_amount)] <- median(dat$Loan_amount, na.rm = TRUE)
table(is.na(dat$Loan_amount))
dat$Term_months[is.na(dat$Term_months)] <- mean(dat$Term_months, na.rm = TRUE)
table(is.na(dat$Term_months))
Output:
FALSE 590 FALSE 590
Missing Value Imputation for Categorical Variables
For categorical variables, it is important to understand the frequency distribution, which can be printed with the line of code below.
table(dat$Is_graduate)
Output:
No Yes 127 457
The output shows that most applicants were graduates, identified with the label ‘Yes’. The lines of code below replace the missing values with the highest frequency label, ‘Yes’.
dat$Is_graduate[is.na(dat$Is_graduate)] <- "Yes" table(dat$Is_graduate)
Output:
No Yes 127 463
Removing Rows with Missing Values
We have imputed missing values using measures of central tendency: mean, median and mode. Another technique is to delete rows where any variable has missing values. This is performed using the na.omit() function, which removes all the rows containing missing values.
dat <- na.omit(dat)
dim(dat)
Output:
[1] 585 12
The resulting data has 585 observations of 12 variables. We can do a final check to see if all the missing values have been removed using the command below.
sapply(dat, function(x) sum(is.na(x)))
Output:
UID Marital_status Dependents Is_graduate Income 0 0 0 0 0 Loan_amount Term_months Credit_score approval_status Age 0 0 0 0 0 Sex Purpose 0 0
We can look at the data and post all these transformations using the glimpse command below.
glimpse(dat)
Output:
Observations: 585
Variables: 12
$ UID <chr> "UIDA209", "UIDA238", "UIDA256", "UIDA274", "UIDA430",...
$ Marital_status <fct> Yes, Yes, Yes, Yes, No, Yes, No, Yes, Yes, No, No, Yes...
$ Dependents <int> 0, 2, 0, 2, 0, 0, 0, 1, 3, 0, 0, 1, 0, 0, 1, 0, 0, 0, ...
$ Is_graduate <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, ...
$ Income <dbl> 1984000, 1066700, 834100, 775900, 421100, 402300, 1111...
$ Loan_amount <int> 1070000, 111000, 89000, 1330000, 49500, 56500, 104000,...
$ Term_months <dbl> 384, 384, 384, 384, 384, 384, 384, 384, 384, 384, 384,...
$ Credit_score <fct> Satisfactory, Satisfactory, Not _satisfactory, Satisfa...
$ approval_status <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, ...
$ Age <int> 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 25, 25, 25, 25...
$ Sex <fct> M, M, M, M, M, F, M, M, M, M, F, M, M, M, M, F, M, M, ...
$ Purpose <fct> Personal, Personal, Personal, Travel, Personal, Person...
Conclusion
In this guide, you have learned methods of dealing with missing, invalid and duplicate data in R. These techniques will help you in cleaning data and making it ready for machine learning.
Python Example for Beginners
Two Machine Learning Fields
There are two sides to machine learning:
- Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
- Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.
Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes
Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!
Latest end-to-end Learn by Coding Recipes in Project-Based Learning:
Applied Statistics with R for Beginners and Business Professionals
Data Science and Machine Learning Projects in Python: Tabular Data Analytics
Data Science and Machine Learning Projects in R: Tabular Data Analytics
Python Machine Learning & Data Science Recipes: Learn by Coding
R Machine Learning & Data Science Recipes: Learn by Coding
Comparing Different Machine Learning Algorithms in Python for Classification (FREE)
Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.