Pandas Merge

The merge operation in Pandas merges two DataFrames based on their indexes or a specified column.

The merge() in Pandas works similar to JOINs in SQL.

Let's see an example.

 import pandas as pd # create dataframes from the dictionaries data1 = { 'EmployeeID' : ['E001', 'E002', 'E003', 'E004', 'E005'], 'Name' : ['John Doe', 'Jane Smith', 'Peter Brown', 'Tom Johnson', 'Rita Patel'], 'DeptID': ['D001', 'D003', 'D001', 'D002', 'D003'], } employees = pd.DataFrame(data1) data2 = { 'DeptID': ['D001', 'D002', 'D003'], 'DeptName': ['Sales', 'HR', 'Admin'] } departments = pd.DataFrame(data2) 
# merge dataframes employees and departments merged_df = pd.merge(employees, departments)
# display DataFrames print("Employees:") print(employees) print() print("Departments:") print(departments) print() print("Merged DataFrame:") print(merged_df)

Output

 Employees: EmployeeID Name DeptID 0 E001 John Doe D001 1 E002 Jane Smith D003 2 E003 Peter Brown D001 3 E004 Tom Johnson D002 4 E005 Rita Patel D003 Departments: DeptID DeptName 0 D001 Sales 1 D002 HR 2 D003 Admin Merged DataFrame: EmployeeID Name DeptID DeptName 0 E001 John Doe D001 Sales 1 E003 Peter Brown D001 Sales 2 E002 Jane Smith D003 Admin 3 E005 Rita Patel D003 Admin 4 E004 Tom Johnson D002 HR

In this example, we merged the DataFrames employees and departments using the merge() method.

Notice that the two DataFrames are merged based on the DeptID column as it's common to both the DataFrames.


merge() Syntax in Pandas

The syntax of the merge() method in Pandas is:

 pd.merge(left, right, on=None, how='inner', left_on=None, right_on=None, sort=False)

Here,

  • left: specifies the left DataFrame to be merged
  • right: specifies the right DataFrame to be merged
  • on (optional): specifies column(s) to join on
  • how (optional): specifies the type of join to perform
  • left_on (optional): specifies column(s) from the left DataFrame to use as key(s) for merging
  • right_on (optional): specifies column(s) from the right DataFrame to use as key(s) for merging
  • sort (optional): if True, sort the result DataFrame by the join keys

Example: Merge DataFrames Based on Keys

When there are no common columns between two DataFrames, we can merge them by specifying the columns (as keys) in the left_on and right_on arguments. For example,

 import pandas as pd # create dataframes from the dictionaries data1 = { 'EmployeeID': ['E001', 'E002', 'E003', 'E004', 'E005'], 'Name': ['John Doe', 'Jane Smith', 'Peter Brown', 'Tom Johnson', 'Rita Patel'], 'DeptID1': ['D001', 'D003', 'D001', 'D002', 'D006'], } employees = pd.DataFrame(data1) data2 = { 'DeptID2': ['D001', 'D002', 'D003', 'D004'], 'DeptName': ['Sales', 'HR', 'Admin', 'Marketing'] } departments = pd.DataFrame(data2) 
# merge the dataframes df_merge = pd.merge(employees, departments, left_on='DeptID1', right_on = 'DeptID2', sort = True)
print(df_merge)

Output

  EmployeeID Name DeptID1 DeptID2 DeptName 0 E001 John Doe D001 D001 Sales 1 E003 Peter Brown D001 D001 Sales 2 E004 Tom Johnson D002 D002 HR 3 E002 Jane Smith D003 D003 Admin

In the above example, we performed a merge operation on two DataFrames employees and departments using the merge() method with various arguments.

Here, we used DeptID1 and DeptID2 as the key for merging the DataFrames. Then, we sorted the resulting DataFrame using sort = True.


Types of Join Operations In merge()

So far, we've not defined how to merge the dataframes, thus it defaults to an inner join.

However, we can specify the join type in the how argument. Here are the 5 join types we can use in the merge() method:

  • Left Join
  • Right Join
  • Outer Join
  • Inner Join (Default)
  • Cross Join

Left Join

A left join combines two DataFrames based on a common key and returns a new DataFrame that contains all rows from the left DataFrame and the matched rows from the right DataFrame.

If values are not found in the right dataframe, it fills the space with NaN. For example,

 import pandas as pd # create dataframes from the dictionaries data1 = { 'EmployeeID': ['E001', 'E002', 'E003', 'E004', 'E005'], 'Name': ['John Doe', 'Jane Smith', 'Peter Brown', 'Tom Johnson', 'Rita Patel'], 'DeptID': ['D001', 'D003', 'D001', 'D002', 'D006'], } employees = pd.DataFrame(data1) data2 = { 'DeptID': ['D001', 'D002', 'D003', 'D004'], 'DeptName': ['Sales', 'HR', 'Admin', 'Marketing'] } departments = pd.DataFrame(data2) 
# left merge the dataframes df_merge = pd.merge(employees, departments, on = 'DeptID', how = 'left', sort = True)
print(df_merge)

Output

  EmployeeID Name DeptID DeptName 0 E001 John Doe D001 Sales 1 E003 Peter Brown D001 Sales 2 E004 Tom Johnson D002 HR 3 E002 Jane Smith D003 Admin 4 E005 Rita Patel D006 NaN

Right Join

A right join is the opposite of a left join. It returns a new DataFrame that contains all rows from the right DataFrame and the matched rows from the left DataFrame.

If values are not found in the left dataframe, it fills the space with NaN. For example,

 import pandas as pd # create dataframes from the dictionaries data1 = { 'EmployeeID': ['E001', 'E002', 'E003', 'E004', 'E005'], 'Name': ['John Doe', 'Jane Smith', 'Peter Brown', 'Tom Johnson', 'Rita Patel'], 'DeptID': ['D001', 'D003', 'D001', 'D002', 'D006'], } employees = pd.DataFrame(data1) data2 = { 'DeptID': ['D001', 'D002', 'D003', 'D004'], 'DeptName': ['Sales', 'HR', 'Admin', 'Marketing'] } departments = pd.DataFrame(data2) 
# right merge the dataframes df_merge = pd.merge(employees, departments, on = 'DeptID', how = 'right', sort = True)
print(df_merge)

Output

  EmployeeID Name DeptID DeptName 0 E001 John Doe D001 Sales 1 E003 Peter Brown D001 Sales 2 E004 Tom Johnson D002 HR 3 E002 Jane Smith D003 Admin 4 NaN NaN D004 Marketing

Inner Join

An inner join combines two DataFrames based on a common key and returns a new DataFrame that contains only rows that have matching values in both of the original DataFrames.

For example,

 import pandas as pd # create dataframes from the dictionaries data1 = { 'EmployeeID': ['E001', 'E002', 'E003', 'E004', 'E005'], 'Name': ['John Doe', 'Jane Smith', 'Peter Brown', 'Tom Johnson', 'Rita Patel'], 'DeptID': ['D001', 'D003', 'D001', 'D002', 'D006'], } employees = pd.DataFrame(data1) data2 = { 'DeptID': ['D001', 'D002', 'D003', 'D004'], 'DeptName': ['Sales', 'HR', 'Admin', 'Marketing'] } departments = pd.DataFrame(data2) 
# inner merge the dataframes df_merge = pd.merge(employees, departments, on = 'DeptID', how = 'inner', sort = True)
print(df_merge)

Output

  EmployeeID Name DeptID DeptName 0 E001 John Doe D001 Sales 1 E003 Peter Brown D001 Sales 2 E004 Tom Johnson D002 HR 3 E002 Jane Smith D003 Admin

Outer Join

An outer join combines two DataFrames based on a common key. Unlike an inner join, an outer join returns a new DataFrame that contains all rows from both original DataFrames.

If values are not found in the DataFrames, it fills the space with NaN.

For example,

 import pandas as pd # create dataframes from the dictionaries data1 = { 'EmployeeID': ['E001', 'E002', 'E003', 'E004', 'E005'], 'Name': ['John Doe', 'Jane Smith', 'Peter Brown', 'Tom Johnson', 'Rita Patel'], 'DeptID': ['D001', 'D003', 'D001', 'D002', 'D006'], } employees = pd.DataFrame(data1) data2 = { 'DeptID': ['D001', 'D002', 'D003', 'D004'], 'DeptName': ['Sales', 'HR', 'Admin', 'Marketing'] } departments = pd.DataFrame(data2) 
# outer merge the dataframes df_merge = pd.merge(employees, departments, on = 'DeptID', how = 'outer', sort = True)
print(df_merge)

Output

  EmployeeID Name DeptID DeptName 0 E001 John Doe D001 Sales 1 E003 Peter Brown D001 Sales 2 E004 Tom Johnson D002 HR 3 E002 Jane Smith D003 Admin 4 NaN NaN D004 Marketing 5 E005 Rita Patel D006 NaN

Cross Join

A cross join in Pandas creates the cartesian product of both DataFrames while preserving the order of the left DataFrame.

For example,

 import pandas as pd # create dataframes from the dictionaries data1 = { 'EmployeeID': ['E001', 'E002', 'E003', 'E004', 'E005'], 'Name': ['John Doe', 'Jane Smith', 'Peter Brown', 'Tom Johnson', 'Rita Patel'], 'DeptID': ['D001', 'D003', 'D001', 'D002', 'D006'], } employees = pd.DataFrame(data1) data2 = { 'DeptID': ['D001', 'D002', 'D003', 'D004'], 'DeptName': ['Sales', 'HR', 'Admin', 'Marketing'] } departments = pd.DataFrame(data2) 
# merge the dataframes df_merge = pd.merge(employees, departments, how = 'cross')
print(df_merge)

Output

  EmployeeID Name DeptID_x DeptID_y DeptName 0 E001 John Doe D001 D001 Sales 1 E001 John Doe D001 D002 HR 2 E001 John Doe D001 D003 Admin 3 E001 John Doe D001 D004 Marketing 4 E002 Jane Smith D003 D001 Sales 5 E002 Jane Smith D003 D002 HR 6 E002 Jane Smith D003 D003 Admin 7 E002 Jane Smith D003 D004 Marketing 8 E003 Peter Brown D001 D001 Sales 9 E003 Peter Brown D001 D002 HR 10 E003 Peter Brown D001 D003 Admin 11 E003 Peter Brown D001 D004 Marketing 12 E004 Tom Johnson D002 D001 Sales 13 E004 Tom Johnson D002 D002 HR 14 E004 Tom Johnson D002 D003 Admin 15 E004 Tom Johnson D002 D004 Marketing 16 E005 Rita Patel D006 D001 Sales 17 E005 Rita Patel D006 D002 HR 18 E005 Rita Patel D006 D003 Admin 19 E005 Rita Patel D006 D004 Marketing

Join vs Merge vs Concat

There are three different methods to combine DataFrames in Pandas:

  • join(): joins two DataFrames based on their indexes, performs left join by default
  • merge(): joins two DataFrames based on any specified columns, performs inner join by default
  • concat(): stacks two DataFrames along the vertical or horizontal axis

Our premium learning platform, created with over a decade of experience and thousands of feedbacks.

Learn and improve your coding skills like never before.

Try Programiz PRO
  • Interactive Courses
  • Certificates
  • AI Help
  • 2000+ Challenges