Boolean Indexing in Pandas

Boolean Indexing in Pandas

Boolean indexing is a powerful feature in pandas that allows you to filter data from a DataFrame or Series based on a condition or a set of conditions. It's a critical tool in any data analyst's toolbox. This tutorial will guide you through using boolean indexing in pandas.

1. Setup:

First, let's set up the environment and create a sample DataFrame:

import pandas as pd # Sample DataFrame data = { 'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50], 'C': ['p', 'q', 'r', 's', 't'] } df = pd.DataFrame(data) print("Original DataFrame:") print(df) 

2. Basic Boolean Indexing:

2.1 Single Condition:

Filter rows where values in column 'A' are greater than 3:

filtered_df = df[df['A'] > 3] print("\nFiltered DataFrame (A > 3):") print(filtered_df) 

2.2 Multiple Conditions:

Filter rows where values in column 'A' are greater than 2 and values in column 'B' are less than 40:

filtered_df = df[(df['A'] > 2) & (df['B'] < 40)] print("\nFiltered DataFrame (2 < A and B < 40):") print(filtered_df) 

Note: Always use & (and), | (or), and ~ (not) with parentheses around each condition when combining conditions.

3. Using isin():

If you want to filter data based on a list of values:

values = ['p', 's'] filtered_df = df[df['C'].isin(values)] print("\nFiltered DataFrame (C in ['p', 's']):") print(filtered_df) 

4. Using ~ for Negation:

To select rows where column 'C' is NOT in the list of values:

values = ['p', 's'] filtered_df = df[~df['C'].isin(values)] print("\nFiltered DataFrame (C not in ['p', 's']):") print(filtered_df) 

5. Combining Boolean Indexing with Other Operations:

You can combine boolean indexing with other DataFrame operations:

5.1 Count rows that meet a condition:

count = (df[df['A'] > 2]).shape[0] print(f"\nNumber of rows where A > 2: {count}") 

5.2 Calculate mean of a column based on a condition:

mean_val = df[df['A'] > 2]['B'].mean() print(f"\nMean of column 'B' where A > 2: {mean_val}") 

Full Code:

Combining all the steps, you'll get:

import pandas as pd # Sample DataFrame data = { 'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50], 'C': ['p', 'q', 'r', 's', 't'] } df = pd.DataFrame(data) print("Original DataFrame:") print(df) # Boolean Indexing print("\nFiltered DataFrame (A > 3):") print(df[df['A'] > 3]) print("\nFiltered DataFrame (2 < A and B < 40):") print(df[(df['A'] > 2) & (df['B'] < 40)]) print("\nFiltered DataFrame (C in ['p', 's']):") print(df[df['C'].isin(['p', 's'])]) print("\nFiltered DataFrame (C not in ['p', 's']):") print(df[~df['C'].isin(['p', 's'])]) print(f"\nNumber of rows where A > 2: {(df[df['A'] > 2]).shape[0]}") print(f"\nMean of column 'B' where A > 2: {df[df['A'] > 2]['B'].mean()}") 

This tutorial offers a foundational understanding of boolean indexing in pandas. It's a versatile tool that can be combined with other functions and methods for more complex data manipulations.

Examples

  1. Python Pandas boolean indexing examples:

    import pandas as pd # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) # Boolean mask for filtering mask = df['A'] > 2 # Apply boolean mask to filter data filtered_data = df[mask] 
  2. Filtering data with boolean conditions in Pandas:

    import pandas as pd # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) # Filter data using boolean conditions filtered_data = df[df['A'] > 2] 
  3. Indexing and selecting data with boolean arrays in Pandas:

    import pandas as pd # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) # Create a boolean array bool_array = [True, False, True, False, True] # Select data using boolean array selected_data = df[bool_array] 
  4. Applying multiple boolean conditions to Pandas DataFrame:

    import pandas as pd # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) # Multiple boolean conditions condition1 = df['A'] > 2 condition2 = df['B'] == 'X' # Combine conditions using logical operators combined_condition = condition1 & condition2 # Apply combined condition to filter data filtered_data = df[combined_condition] 
  5. Creating boolean masks for advanced data selection in Pandas:

    import pandas as pd # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) # Create boolean masks mask1 = df['A'] > 2 mask2 = df['B'] == 'X' # Combine masks using logical operators combined_mask = mask1 & mask2 # Apply combined mask to filter data filtered_data = df[combined_mask] 
  6. Combining boolean indexing with other Pandas operations:

    import pandas as pd # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) # Boolean mask for filtering mask = df['A'] > 2 # Select and perform operations on filtered data df.loc[mask, 'B'] = 'Z' 
  7. Boolean indexing for missing data handling in Pandas:

    import pandas as pd # Sample DataFrame with missing values df = pd.DataFrame({'A': [1, 2, 3, None, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) # Boolean mask for missing values mask = df['A'].isna() # Replace missing values based on the boolean mask df.loc[mask, 'A'] = 0 
  8. Using boolean indexing with categorical data in Pandas:

    import pandas as pd # Sample DataFrame with categorical column df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) df['B'] = df['B'].astype('category') # Boolean mask for categorical values mask = df['B'] == 'X' # Apply boolean mask to filter data filtered_data = df[mask] 
  9. Efficient boolean indexing techniques in Pandas:

    import pandas as pd # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) # Efficient boolean indexing using query method filtered_data = df.query('A > 2 and B == "X"') 
  10. Pandas boolean indexing vs. traditional indexing:

    import pandas as pd # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) # Traditional indexing traditional_data = df[df['A'] > 2] # Boolean indexing boolean_data = df.query('A > 2') 
  11. Applying boolean indexing to time-series data in Pandas:

    import pandas as pd import datetime # Sample DataFrame with time-series data df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}, index=pd.date_range('2022-01-01', periods=5, freq='D')) # Boolean mask for time-based filtering mask = df.index > datetime.datetime(2022, 1, 3) # Apply boolean mask to filter time-series data filtered_data = df[mask] 
  12. Code examples for effective boolean indexing in Pandas:

    import pandas as pd # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) # Boolean mask for filtering mask = (df['A'] > 2) & (df['B'] == 'X') # Apply boolean mask to filter data filtered_data = df[mask] 

More Tags

cobertura excel-udf aws-appsync http semantic-segmentation runtime-error jmeter console.log google-sheets-api locking

More Programming Guides

Other Guides

More Programming Examples