Very often, we have data from multiple sources. To perform an analysis, we need to merge two dataframes together with one or more common key variables.

In this tutorial, you will learn

Full match
Partial match

Full match

A full match returns values that have a counterpart in the destination table. The values that are not match won’t be return in the new data frame. The partial match, however, return the missing values as NA.

We will see a simple inner join. The inner join keyword selects records that have matching values in both tables. To join two datasets, we can use merge() function. We will use three arguments :

merge(x, y, by.x = x, by.y = y)

Arguments: -x: The origin data frame -y: The data frame to merge -by.x: The column used for merging in x data frame. Column x to merge on -by.y: The column used for merging in y data frame. Column y to merge on

Example:

Create First Dataset with variables

surname
nationality

Create Second Dataset with variables

surname
movies

The common key variable is surname. We can merge both data and check if the dimensionality is 7×3.

We add stringsAsFactors=FALSE in the data frame because we don’t want R to convert string as factor, we want the variable to be treated as character.

# Create origin dataframe(

producers <- data.frame(   
    surname =  c("Spielberg","Scorsese","Hitchcock","Tarantino","Polanski"),    
    nationality = c("US","US","UK","US","Poland"),    
    stringsAsFactors=FALSE)

# Create destination dataframe
movies <- data.frame(    
    surname = c("Spielberg",
		"Scorsese",
                "Hitchcock",
              	"Hitchcock",
                "Spielberg",
                "Tarantino",
                "Polanski"),    
    title = c("Super 8",
    		"Taxi Driver",
    		"Psycho",
    		"North by Northwest",
    		"Catch Me If You Can",
    		"Reservoir Dogs","Chinatown"),                
     		stringsAsFactors=FALSE)

# Merge two datasets
m1 <- merge(producers, movies, by.x = "surname")
m1
dim(m1)

Output:

surname		nationality		title
1 Hitchcock		UK		Psycho
2 Hitchcock		UK		North by Northwest
3 Polanski		Poland		Chinatown
4 Scorsese		US		Taxi Driver
5 Spielberg		US		Super 8
6 Spielberg		US		Catch Me If You Can
7 Tarantino		US		Reservoir Dogs

Let’s merge data frames when the common key variables have different names.

We change surname to name in the movies data frame. We use the function identical(x1, x2) to check if both dataframes are identical.

# Change name of ` movies ` dataframe
colnames(movies)[colnames(movies) == 'surname'] <- 'name'
# Merge with different key value
m2 <- merge(producers, movies, by.x = "surname", by.y = "name")
# Print head of the data
head(m2)

Output:

##surname     nationality		title
## 1 Hitchcock          UK		Psycho
## 2 Hitchcock          UK		North by Northwest
## 3 Polanski          Poland		Chinatown
## 4 Scorsese           US		Taxi Driver
## 5 Spielberg          US		Super 8
## 6 Spielberg          US		Catch Me If You Can

# Check if data are identical
identical(m1, m2)

Output:

## [1] TRUE

This shows that merge operation is performed even if the column names are different.

Partial match

It is not surprising that two dataframes do not have the same common key variables. In the full matching, the dataframe returns only rows found in both x and y data frame. With partial merging, it is possible to keep the rows with no matching rows in the other data frame. These rows will have NA in those columns that are usually filled with values from y. We can do that by setting all.x= TRUE.

For instance, we can add a new producer, Lucas, in the producer data frame without the movie references in movies data frame. If we set all.x= FALSE, R will join only the matching values in both data set. In our case, the producer Lucas will not be join to the merge because it is missing from one dataset.

Let’s see the dimension of each output when we specify all.x= TRUE and when we don’t.

# Create a new producer
add_producer <-  c('Lucas', 'US')
#  Append it to the ` producer` dataframe
producers <- rbind(producers, add_producer)
# Use a partial merge 
m3 <-merge(producers, movies, by.x = "surname", by.y = "name", all.x = TRUE)
m3

Output:

# Compare the dimension of each data frame
dim(m1)

Output:

## [1] 7 3

dim(m2)

Output:

## [1] 7 3

dim(m3)

Output:

## [1] 8 3

As we can see, the dimension of the new data frame 8×3 compared with 7×3 for m1 and m2. R includes NA for the missing author in the books data frame.

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.

Learn by Coding: v-Tutorials on Applied Machine Learning and Data Science for Beginners

Latest end-to-end Learn by Coding Projects (Jupyter Notebooks) in Python and R:

All Notebooks in One Bundle: Data Science Recipes and Examples in Python & R.

End-to-End Python Machine Learning Recipes & Examples.

End-to-End R Machine Learning Recipes & Examples.

Applied Statistics with R for Beginners and Business Professionals

Data Science and Machine Learning Projects in Python: Tabular Data Analytics

Data Science and Machine Learning Projects in R: Tabular Data Analytics

Python Machine Learning & Data Science Recipes: Learn by Coding

R Machine Learning & Data Science Recipes: Learn by Coding

Comparing Different Machine Learning Algorithms in Python for Classification (FREE)

There are 2000+ End-to-End Python & R Notebooks are available to build Professional Portfolio as a Data Scientist and/or Machine Learning Specialist. All Notebooks are only $29.95. We would like to request you to have a look at the website for FREE the end-to-end notebooks, and then decide whether you would like to purchase or not.

Linear Regression in R – partial least squares regression in R

Pandas Example – Write a Pandas program to merge two given dataframes with different columns

MySQL Tutorials for Business Analyst: MySQL DELETE Query

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Towards Advanced Analytics Specialist & Analytics Engineer

R tutorials for Business Analyst – Merge Data Frames in R: Full and Partial Match

(R Tutorials for Business Analyst)

Merge Data Frames in R: Full and Partial Match

Full match

Partial match

Personal Career & Learning Guide for Data Analyst, Data Engineer and Data Scientist

Applied Machine Learning & Data Science Projects and Coding Recipes for Beginners

95% Discount on “Projects & Recipes, tutorials, ebooks”

Projects and Coding Recipes, eTutorials and eBooks: The best All-in-One resources for Data Analyst, Data Scientist, Machine Learning Engineer and Software Developer

Learn by Coding: v-Tutorials on Applied Machine Learning and Data Science for Beginners

Related Posts

Decoding Biomedical Insights: Mastering Analysis of Variance (ANOVA) in Biomedical Science

Revolutionizing Predictive Modeling in R with Boosting and AdaBoost

Optimizing Predictive Analysis: Linear Regression with Gradient Descent in Machine Learning