Data Wrangling in Python – How to Join And Merge Pandas Dataframe

Join And Merge Pandas Dataframe

import modules


import pandas as pd
from IPython.display import display
from IPython.display import Image

Create a dataframe

raw_data = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}

df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])
df_a
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches

Create a second dataframe

raw_data = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}

df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])
df_b
subject_id first_name last_name
0 4 Billy Bonder
1 5 Brian Black
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisan

Create a third dataframe

raw_data = {
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}

df_n = pd.DataFrame(raw_data, columns = ['subject_id','test_id'])
df_n
subject_id test_id
0 1 51
1 2 15
2 3 15
3 4 61
4 5 16
5 7 14
6 8 15
7 9 1
8 10 61
9 11 16

Join the two dataframes along rows

df_new = pd.concat([df_a, df_b])
df_new
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
0 4 Billy Bonder
1 5 Brian Black
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisan

Join the two dataframes along columns

pd.concat([df_a, df_b], axis=1)
subject_id first_name last_name subject_id first_name last_name
0 1 Alex Anderson 4 Billy Bonder
1 2 Amy Ackerman 5 Brian Black
2 3 Allen Ali 6 Bran Balwner
3 4 Alice Aoni 7 Bryce Brice
4 5 Ayoung Atiches 8 Betty Btisan

Merge two dataframes along the subject_id value

pd.merge(df_new, df_n, on='subject_id')
subject_id first_name last_name test_id
0 1 Alex Anderson 51
1 2 Amy Ackerman 15
2 3 Allen Ali 15
3 4 Alice Aoni 61
4 4 Billy Bonder 61
5 5 Ayoung Atiches 16
6 5 Brian Black 16
7 7 Bryce Brice 14
8 8 Betty Btisan 15

Merge two dataframes with both the left and right dataframes using the subject_id key

pd.merge(df_new, df_n, left_on='subject_id', right_on='subject_id')
subject_id first_name last_name test_id
0 1 Alex Anderson 51
1 2 Amy Ackerman 15
2 3 Allen Ali 15
3 4 Alice Aoni 61
4 4 Billy Bonder 61
5 5 Ayoung Atiches 16
6 5 Brian Black 16
7 7 Bryce Brice 14
8 8 Betty Btisan 15

Merge with outer join

“Full outer join produces the set of all records in Table A and Table B, with matching records from both sides where available. If there is no match, the missing side will contain null.” – source

pd.merge(df_a, df_b, on='subject_id', how='outer')
subject_id first_name_x last_name_x first_name_y last_name_y
0 1 Alex Anderson NaN NaN
1 2 Amy Ackerman NaN NaN
2 3 Allen Ali NaN NaN
3 4 Alice Aoni Billy Bonder
4 5 Ayoung Atiches Brian Black
5 6 NaN NaN Bran Balwner
6 7 NaN NaN Bryce Brice
7 8 NaN NaN Betty Btisan

Merge with inner join

“Inner join produces only the set of records that match in both Table A and Table B.” – source

pd.merge(df_a, df_b, on='subject_id', how='inner')
subject_id first_name_x last_name_x first_name_y last_name_y
0 4 Alice Aoni Billy Bonder
1 5 Ayoung Atiches Brian Black

Merge with right join

pd.merge(df_a, df_b, on='subject_id', how='right')
subject_id first_name_x last_name_x first_name_y last_name_y
0 4 Alice Aoni Billy Bonder
1 5 Ayoung Atiches Brian Black
2 6 NaN NaN Bran Balwner
3 7 NaN NaN Bryce Brice
4 8 NaN NaN Betty Btisan

Merge with left join

“Left outer join produces a complete set of records from Table A, with the matching records (where available) in Table B. If there is no match, the right side will contain null.” – source

pd.merge(df_a, df_b, on='subject_id', how='left')
subject_id first_name_x last_name_x first_name_y last_name_y
0 1 Alex Anderson NaN NaN
1 2 Amy Ackerman NaN NaN
2 3 Allen Ali NaN NaN
3 4 Alice Aoni Billy Bonder
4 5 Ayoung Atiches Brian Black

Merge while adding a suffix to duplicate column names

pd.merge(df_a, df_b, on='subject_id', how='left', suffixes=('_left', '_right'))
subject_id first_name_left last_name_left first_name_right last_name_right
0 1 Alex Anderson NaN NaN
1 2 Amy Ackerman NaN NaN
2 3 Allen Ali NaN NaN
3 4 Alice Aoni Billy Bonder
4 5 Ayoung Atiches Brian Black

Merge based on indexes

pd.merge(df_a, df_b, right_index=True, left_index=True)
subject_id_x first_name_x last_name_x subject_id_y first_name_y last_name_y
0 1 Alex Anderson 4 Billy Bonder
1 2 Amy Ackerman 5 Brian Black
2 3 Allen Ali 6 Bran Balwner
3 4 Alice Aoni 7 Bryce Brice
4 5 Ayoung Atiches 8 Betty Btisan

Python Example for Beginners

Two Machine Learning Fields

There are two sides to machine learning:

  • Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
  • Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.

Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes

Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!

Latest end-to-end Learn by Coding Recipes in Project-Based Learning:

Applied Statistics with R for Beginners and Business Professionals

Data Science and Machine Learning Projects in Python: Tabular Data Analytics

Data Science and Machine Learning Projects in R: Tabular Data Analytics

Python Machine Learning & Data Science Recipes: Learn by Coding

R Machine Learning & Data Science Recipes: Learn by Coding

Comparing Different Machine Learning Algorithms in Python for Classification (FREE)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.  

Google –> SETScholars