Splitting and Combining Data with R

Splitting and Combining Data with R

 

Introduction

In real-world data science projects, it is often necessary to divide data into two or more subsets or to combine multiple sets into one. This is an integral part of the data wrangling process for predictive modeling.

In this guide, you will learn techniques for splitting and combining data using the statistical programming language R.

Combining Data: Columns

Combining, or joining, data is a common data preparation task. The merge function can be used to perform all four standard join functions:

  • Inner join

  • Full join

  • Left join

  • Right join

Let’s start by creating sample datasets for carrying out these operations.

 
per_data <- data.frame(cust_id   = c("ID1", "ID2", "ID3"), married = c("Yes", "No", "Yes"), Age = c(44, 23, 35))
 
inc_data <- data.frame(cust_id   = c("ID2", "ID1", "ID3"), income = c(60000, 80000, 45000))
 
head(per_data)
 
head(inc_data)
 

Output:

 
cust_id   married   Age
ID1 	     Yes        44
ID2  	    No          23
ID3 	     Yes        35 
 
 
cust_id income
ID2       60000
ID1  	 80000
ID3  	 45000
 

Inner Join

The two datasets can be combined horizontally using the merge function. In our case, we will inner join the two datasets using the common key variable ‘UID’.

The first line of code below merges the two data frames, while the second line displays the resultant dataset, ‘merge1’.

 
merge1 = merge(per_data,inc_data,by="cust_id")
head(merge1)
 

Output:

 
|           	| cust_id  | married  | Age   	| income  |
|---       	|---------   |---------   |-----    	|--------	|
| 1        	| ID1 	   | Yes 	   | 44      	| 80000    |
| 2        	| ID2 	   | No  	   | 23      	| 60000    |
| 3        	| ID3 	   | Yes 	   | 35      	| 45000    |
 

Full Join

The ‘full join’ results in keeping all the observations in either of the datasets. The first line of code below creates a data frame ‘df3’, while the second line performs the full join and saves the object as data frame, ‘df4’.

The output will create a data frame where all the observations are maintained in the resultant data. The missing cells will be marked as ‘NA’.

 
df3 =  data.frame(cust_id   = c("ID4", "ID1", "ID5"), approval = c("Yes", "No", "Yes"))
 
df4 = merge(x = merge1, y = df3,  by = "cust_id", all = TRUE)
head(df4)
 

Output:


|   	| cust_id   	| married   	| Age   | income    	| approval  	|
|---	|---------  	|---------  	|-----  |--------   	|---------- 	|
| 1 	| ID1 	  	| Yes 	  	| 44    | 80000     	| No   	 	|
| 2 	| ID2 	  	| No  	  	| 23    | 60000     	| NA   	 	|
| 3 	| ID3 	  	| Yes       	| 35    | 45000     	| NA   	 	|
| 4 	| ID4 	  	| NA  	  	| NA    | NA 	   	| Yes  	 	|
| 5 	| ID5 	  	| NA  	  	| NA    | NA 	   	| Yes  	 	|
 

Left Join

The ‘left join’ function adds information from the right data frame to the left data frame. Let’s perform the left join on the data frames ‘merge1’ and ‘df3’, as shown below.

 
ljoin = merge(x = merge1, y = df3, by = "cust_id", all.x = TRUE)
 
head(ljoin)
 

Output:


|   	| cust_id   	| married   	| Age   | income    	| approval  	|
|---	|---------  	|---------  	|-----  |--------   	|---------- 	|
| 1 	| ID1 	  	| Yes 	  	| 44    | 80000     	| No   	 	|
| 2 	| ID2 	  	| No  	  	| 23    | 60000     	| NA   	 	|
| 3 	| ID3 	  	| Yes 	  	| 35    | 45000     	| NA   	 	|
  

The output shows that all the records in the left dataset, ‘merge1’, are maintained while the right dataset, ‘df3’, is mapped to it.

Right Join

The ‘right join’ function adds information from the left data frame to the right data frame. Let’s perform the right join on the data frames ‘merge1’ and ‘df3’, as shown below.


 
rjoin = merge(x = merge1, y = df3, by = "cust_id", all.y = TRUE)
head(rjoin)
 

Output:



|   	| cust_id   	| married   	| Age   | income    	| approval  	|
|---	|---------  	|---------  	|-----  |--------   	|---------- 	|
| 1 	| ID1 	  	| Yes 	  	| 44    | 80000     	| No   	 	|
| 2 	| ID4 	  	| NA  	  	| NA    | NA 	   	| Yes  	 	|
| 3 	| ID5 	  	| NA  	  	| NA    | NA 	   	| Yes  	 	|
 
 

Combining Data: Rows

It’s possible to combine the two data frames based on their rows using the rbind function. Let’s create two data frames and join them vertically using the lines of code below.

 
r1 <- data.frame(cust_id   = c("ID11", "ID12", "ID13"), married = c("Yes", "No", "Yes"), Age = c(44, 23, 35))
 
r2 <- data.frame(cust_id   = c("ID14", "ID15"), married = c("Yes", "Yes"), Age = c(40, 35))
 
r_tot <- rbind(r1, r2)
glimpse(r_tot)
 

Output:

 
Observations: 5
Variables: 3
$ cust_id <fct> ID11, ID12, ID13, ID14, ID15
$ married <fct> Yes, No, Yes, Yes, Yes
$ Age 	<dbl> 44, 23, 35, 40, 35
 

The resultant data now has five observations and three variables. To carry out this operation, the two data frames must have the same variables, but they do not need to be in the same order.

Splitting Data: Rows

One of the most common data partitioning steps in machine learning is dividing the data into training and test sets for evaluating model performance. This is called the holdout approach to model validation. There are many techniques to perform this task depending upon the target variable.

We will use fictitious loan application data containing 600 observations and 4 variables—’UID’, ‘Income’, ‘Credit_score’, and ‘approval_status’. The lines of code below load the libraries and the data.

 
library(readr)
library(dplyr)
library(caret)
library(caTools)
 
d1 <- read_csv("data_1.csv")
glimpse(d1)
 

Output:


Observations: 600
Variables: 4
$ UID         	<chr> "UIDA7", "UIDA354", "UIDA402", "UIDA467", "UIDA78", "U...
$ Income 	     <int> 333300, 532400, 454700, 368500, 611100, 266700, 243700...
$ Credit_score	<chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
$ approval_status <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, ...
 

Since the target variable, ‘approval_status’, is categorical, we will convert it into a factor variable.

 
d1$approval_status = as.factor(d1$approval_status)
str(d1$approval_status)
 

Output:


Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
 

We can now divide the dataset into training and test datasets using the ‘caTools’ package.

The first line of code below loads the ‘caTools’ library, while the second line sets the random seed for reproducibility of the results. The third line uses the sample.split function to divide the data in the ratio of 70 to 30. This ensures that 70 percent of the data is allocated to the training set, while the remaining 30 percent gets allocated to the test set. This is done in the fourth and fifth lines of code.


library(caTools)
set.seed(150)
split = sample.split(d1$approval_status, SplitRatio = 0.70)
 
# Create training and testing sets
train = subset(d1, split == TRUE)
test = subset(d1, split == FALSE)
 
dim(train); dim(test)
 
Output:

[1] 420   4
[1] 180   4
 

The output confirms that the data has been divided into training data and test data, containing 420 and 180 observations, respectively.

Conclusion

In this guide, you have learned techniques for combining and splitting data frames using the base R packages. These techniques allow you to join and divide data in real-world datasets.

 

Python Example for Beginners

Two Machine Learning Fields

There are two sides to machine learning:

  • Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
  • Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.

 

Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes

Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!

Latest end-to-end Learn by Coding Recipes in Project-Based Learning:

Applied Statistics with R for Beginners and Business Professionals

Data Science and Machine Learning Projects in Python: Tabular Data Analytics

Data Science and Machine Learning Projects in R: Tabular Data Analytics

Python Machine Learning & Data Science Recipes: Learn by Coding

R Machine Learning & Data Science Recipes: Learn by Coding

Comparing Different Machine Learning Algorithms in Python for Classification (FREE)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.  

Google –> SETScholars