Python Pandas - Concatenation



Concatenation in Pandas refers to the process of joining two or more Pandas objects (like DataFrames or Series) along a specified axis. This operation is very useful when you need to merge data from different sources or datasets.

The primary tool for this operation is pd.concat() function, which can useful for Series, DataFrame objects, whether you're combining rows or columns. Concatenation in Pandas involves combining multiple DataFrame or Series objects either row-wise or column-wise.

In this tutorial, we'll explore how to concatenate Pandas objects using the pd.concat() function. By discussing the different scenarios including concatenating along rows, using keys to distinguish concatenated DataFrames, ignoring indexes during concatenation, and concatenating along columns.

Understanding the pd.concat() Function

The pandas.concat() function is the primary method used for concatenation in Pandas. It allows you to concatenate pandas objects along a particular axis with various options for handling indexes.

The syntax of the pd.concat() functions as follows −

 pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=None)

Where,

  • objs: This is a sequence or mapping of Series, DataFrame, or Panel objects.

  • axis: {0, 1, ...}, default 0. This is the axis to concatenate along.

  • join: {"inner", "outer"}, default "outer". How to handle indexes on other axis(es). Outer for union and inner for intersection.

  • ignore_index: boolean, default False. If True, do not use the index values on the concatenation axis. The resulting axis will be labeled 0, ..., n - 1.

  • keys: Used to create a hierarchical index along the concatenation axis.

  • levels: Specific levels to use for the MultiIndex in the result.

  • names: Names for the levels in the resulting hierarchical index.

  • verify_integrity: If True, checks for duplicate entries in the new axis and raises an error if duplicates are found.

  • sort: When combining DataFrames with unaligned columns, this parameter ensures the columns are sorted.

  • copy: default None. If False, do not copy data unnecessarily.

The concat() function does all of the heavy lifting of performing concatenation operations along an axis. Let us create different objects and do concatenation.

Example: Concatenating DataFrames

In this example, the two DataFrames are concatenated along rows, with the resulting DataFrame having duplicated indices.

 import pandas as pd # Creating two DataFrames one = pd.DataFrame({ 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5'], 'Marks_scored':[98,90,87,69,78]}, index=[1,2,3,4,5]) two = pd.DataFrame({ 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5'], 'Marks_scored':[89,80,79,97,88]}, index=[1,2,3,4,5]) # Concatenating DataFrames result = pd.concat([one, two]) print(result) 

Its output is as follows −

 Name subject_id Marks_scored 1 Alex sub1 98 2 Amy sub2 90 3 Allen sub4 87 4 Alice sub6 69 5 Ayoung sub5 78 1 Billy sub2 89 2 Brian sub4 80 3 Bran sub3 79 4 Bryce sub6 97 5 Betty sub5 88 

Example: Concatenating with Keys

If you want to distinguish between the concatenated DataFrames, you can use the keys parameter to associate specific keys with each part of the DataFrame.

 import pandas as pd one = pd.DataFrame({ 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5'], 'Marks_scored':[98,90,87,69,78]}, index=[1,2,3,4,5]) two = pd.DataFrame({ 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5'], 'Marks_scored':[89,80,79,97,88]}, index=[1,2,3,4,5]) print(pd.concat([one,two],keys=['x','y'])) 

Its output is as follows −

 Name subject_id Marks_scored x 1 Alex sub1 98 2 Amy sub2 90 3 Allen sub4 87 4 Alice sub6 69 5 Ayoung sub5 78 y 1 Billy sub2 89 2 Brian sub4 80 3 Bran sub3 79 4 Bryce sub6 97 5 Betty sub5 88 

Here, the x and y keys create a hierarchical index, allowing easy identification of which original DataFrame each row came from.

Example: Ignoring Indexes During Concatenation

If the resultant object has to follow its own indexing, set ignore_index to True.

 import pandas as pd one = pd.DataFrame({ 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5'], 'Marks_scored':[98,90,87,69,78]}, index=[1,2,3,4,5]) two = pd.DataFrame({ 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5'], 'Marks_scored':[89,80,79,97,88]}, index=[1,2,3,4,5]) print(pd.concat([one,two],keys=['x','y'],ignore_index=True)) 

Its output is as follows −

 Name subject_id Marks_scored 0 Alex sub1 98 1 Amy sub2 90 2 Allen sub4 87 3 Alice sub6 69 4 Ayoung sub5 78 5 Billy sub2 89 6 Brian sub4 80 7 Bran sub3 79 8 Bryce sub6 97 9 Betty sub5 88 

Observe, the index changes completely and the Keys are also overridden.

Example: Concatenating Along Columns

Instead of concatenating along rows, you can concatenate along columns by setting the axis parameter to 1.

 import pandas as pd one = pd.DataFrame({ 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5'], 'Marks_scored':[98,90,87,69,78]}, index=[1,2,3,4,5]) two = pd.DataFrame({ 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5'], 'Marks_scored':[89,80,79,97,88]}, index=[1,2,3,4,5]) print(pd.concat([one,two],axis=1)) 

Its output is as follows −

 Name subject_id Marks_scored Name subject_id Marks_scored 1 Alex sub1 98 Billy sub2 89 2 Amy sub2 90 Brian sub4 80 3 Allen sub4 87 Bran sub3 79 4 Alice sub6 69 Bryce sub6 97 5 Ayoung sub5 78 Betty sub5 88 
Advertisements