Pandas Select

Pandas select refers to the process of extracting specific portions of data from a DataFrame.

Data selection involves choosing specific rows and columns based on labels, positions, or conditions.

Pandas provides various methods, such as basic indexing, slicing, boolean indexing, and querying, to efficiently extract, filter, and transform data, enabling users to focus on relevant information for analysis and decision-making.


Select Data Using Indexing and Slicing

In Pandas, we can use square brackets and their labels or positions to select the data we want.

Let's look at an example.

 import pandas as pd # create a DataFrame data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'Age': [25, 30, 22, 27, 29], 'Salary': [50000, 60000, 45000, 55000, 52000] } df = pd.DataFrame(data) 
# selecting a single column name_column = df['Name']
print("Selecting single column: Name") print(name_column) print()
# selecting multiple columns age_salary_columns = df[['Age', 'Salary']]
print("Selecting multiple columns: Age and Salary") print(age_salary_columns.to_string(index=False)) print()
# selecting rows using slicing selected_rows = df[1:4]
print("Selecting rows 1 to 3") print(selected_rows.to_string(index=False)) print()

Output

 Selecting single column: Name 0 Alice 1 Bob 2 Charlie 3 David 4 Eve Name: Name, dtype: object Selecting multiple columns: Age and Salary Age Salary 25 50000 30 60000 22 45000 27 55000 29 52000 Selecting rows 1 to 3 Name Age Salary Bob 30 60000 Charlie 22 45000 David 27 55000

In the above example, we have created a DataFrame named df that is using dictionary data containing three columns: Name, Age, and Salary. Each column is represented by a list of values.

Then we,

  1. selected a single column Name using df['Name']
  2. selected multiple columns Age and Salary using df[['Age', 'Salary']]
  3. selected rows from 1 to 3 using slicing df[1:4]

Note: The .to_string(index=False) is used to display values without the index.


Using loc and iloc to Select Data

The loc and iloc methods in Pandas are used to access data by using label or integer index.

  1. loc selects rows and columns with specific labels
  2. iloc selects rows and columns at specific index

Let's take a look at an example.

 import pandas as pd data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'], 'Age': [25, 30, 22, 27, 29], 'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'San Francisco'] } df = pd.DataFrame(data) print(f"Original DataFrame \n {df} \n") 
# loc to select rows and columns by labels # select rows 1 to 3 and columns Name and Age selected_data_loc = df.loc[1:3, ['Name', 'Age']]
print(selected_data_loc.to_string(index = False)) print()
# iloc to select rows and columns by index # select rows 1 to 3 and columns 0 and 2 selected_data_iloc = df.iloc[1:4, [0, 2]]
print(selected_data_iloc.to_string(index = False))

Output

 Original DataFrame Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 22 Chicago 3 David 27 Houston 4 Emily 29 San Francisco Name Age Bob 30 Charlie 22 David 27 Name City Bob Los Angeles Charlie Chicago David Houston

Here,

  • Using df.loc[1:3, ['Name', 'Age']] - selects rows 1 to 3 and columns Name and Age from df
  • Using df.iloc[1:4, [0, 2]] - selects rows 1 to 3 and columns at index positions 0 and 2 from df

Select Rows Based on Specific Criteria

In Pandas, we can use boolean conditions to filter rows based on specific criteria. For example,

 import pandas as pd # creating a DataFrame data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'], 'Age': [25, 30, 22, 28, 24], 'Gender': ['Female', 'Male', 'Male', 'Male', 'Female'] } df = pd.DataFrame(data) # select rows where Age is greater than 25 selected_rows = df[df['Age'] > 25] print(selected_rows)

Output

  Name Age Gender 1 Bob 30 Male 3 David 28 Male

In this example, we have selected the rows where the age is greater than 25.

The boolean indexing is done using the condition

 df['Age'] > 25

This creates a boolean mask. And when this mask is applied to the DataFrame, it selects only the rows where the condition is True.


query() to Select Data

The query() method in Pandas allows you to select data using a more SQL-like syntax.

Let's take a look at an example.

 import pandas as pd data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'], 'Age': [25, 30, 22, 28, 35], 'Score': [85, 90, 75, 80, 95] } df = pd.DataFrame(data) 
# select the rows where the age is greater than 25 selected_rows = df.query('Age > 25')
print(selected_rows.to_string(index = False))

Output

  Name Age Score Bob 30 90 David 28 80 Eva 35 95

In this example, the query Age > 25 selects the rows where the Age column's values are greater than 25.


Select Rows Based on a List of Values

Pandas provides us with the method named isin() to filter rows based on a list of values. For example,

 import pandas as pd data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'], 'Age': [25, 30, 22, 28, 24] } df = pd.DataFrame(data) # create a list of names to select names_to_filter = ['Bob', 'David'] 
# use isin() to select rows based on the 'Name' column selected_rows = df[df['Name'].isin(names_to_filter)]
print(selected_rows.to_string(index = False))

Output

 Name Age Bob 30 David 28

In this example, we want to select only the rows where the name is either Bob or David.

We created a list names_to_filter with the names we want to filter by and then used the isin() method to filter the rows based on the values in the Name column.

Our premium learning platform, created with over a decade of experience and thousands of feedbacks.

Learn and improve your coding skills like never before.

Try Programiz PRO
  • Interactive Courses
  • Certificates
  • AI Help
  • 2000+ Challenges