Sklearn train_test_split on pandas stratify by multiple columns

Sklearn train_test_split on pandas stratify by multiple columns

The train_test_split function in scikit-learn doesn't natively support stratification based on multiple columns in a Pandas DataFrame. However, you can achieve stratification by combining the values from multiple columns into a single column and then using that combined column for stratification. Here's how you can do it:

import pandas as pd from sklearn.model_selection import train_test_split # Create a sample DataFrame data = { 'feature1': [1, 2, 2, 3, 3, 3], 'feature2': ['A', 'B', 'A', 'B', 'A', 'B'], 'target': [0, 1, 0, 1, 0, 1] } df = pd.DataFrame(data) # Combine multiple columns into a single stratify column df['combined_stratify'] = df['feature1'].astype(str) + df['feature2'] # Split the data using train_test_split with stratification train_df, test_df = train_test_split(df, test_size=0.2, stratify=df['combined_stratify'], random_state=42) # Drop the combined column train_df.drop(columns=['combined_stratify'], inplace=True) test_df.drop(columns=['combined_stratify'], inplace=True) print("Train Data:") print(train_df) print("Test Data:") print(test_df) 

In this example, we first create a sample DataFrame with multiple columns (feature1, feature2, and target). We then create a new column combined_stratify that combines the values from feature1 and feature2 to create unique combinations. This combined column is used for stratification when using train_test_split. After the split, we remove the combined_stratify column.

Keep in mind that this approach creates a synthetic column for stratification, and it may not be suitable for all scenarios. Also, stratification based on multiple columns could potentially lead to small sample sizes in certain groups, depending on your data distribution. Be cautious and assess the validity of this approach based on your specific use case.

Examples

    from sklearn.model_selection import train_test_split import pandas as pd # Sample DataFrame with features and target data = pd.DataFrame({ 'feature1': [1, 2, 3, 4, 5, 6], 'feature2': ['A', 'B', 'C', 'A', 'B', 'C'], 'feature3': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6], 'target': [0, 1, 0, 1, 0, 1] }) # Splitting data while stratifying by 'feature2' and 'target' columns X_train, X_test, y_train, y_test = train_test_split(data[['feature1', 'feature3']], data['target'], test_size=0.2, stratify=data[['feature2', 'target']]) 
      from sklearn.model_selection import train_test_split import pandas as pd # Sample DataFrame with features and target data = pd.DataFrame({ 'feature1': [1, 2, 3, 4, 5, 6], 'feature2': ['A', 'B', 'C', 'A', 'B', 'C'], 'feature3': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6], 'target': [0, 1, 0, 1, 0, 1] }) # Splitting data while stratifying by multiple columns X_train, X_test, y_train, y_test = train_test_split(data[['feature1', 'feature3']], data['target'], test_size=0.2, stratify=data[['feature2', 'target']]) 
        from sklearn.model_selection import train_test_split import pandas as pd # Sample DataFrame with features and target data = pd.DataFrame({ 'feature1': [1, 2, 3, 4, 5, 6], 'feature2': ['A', 'B', 'C', 'A', 'B', 'C'], 'feature3': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6], 'target': [0, 1, 0, 1, 0, 1] }) # Splitting data while stratifying by multiple columns X_train, X_test, y_train, y_test = train_test_split(data[['feature1', 'feature3']], data['target'], test_size=0.2, stratify=data[['feature2', 'target']]) 
          from sklearn.model_selection import train_test_split import pandas as pd # Sample DataFrame with categorical features and target data = pd.DataFrame({ 'feature1': [1, 2, 3, 4, 5, 6], 'feature2': ['A', 'B', 'C', 'A', 'B', 'C'], 'feature3': ['X', 'Y', 'X', 'Y', 'X', 'Y'], 'target': [0, 1, 0, 1, 0, 1] }) # Splitting data while stratifying by multiple categorical columns X_train, X_test, y_train, y_test = train_test_split(data[['feature1']], data['target'], test_size=0.2, stratify=data[['feature2', 'feature3', 'target']]) 
            from sklearn.model_selection import train_test_split import pandas as pd # Sample DataFrame with features and target data = pd.DataFrame({ 'feature1': [1, 2, 3, 4, 5, 6], 'feature2': ['A', 'B', 'C', 'A', 'B', 'C'], 'feature3': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6], 'target': [0, 1, 0, 1, 0, 1] }) # Splitting data while stratifying by multiple conditions X_train, X_test, y_train, y_test = train_test_split(data[['feature1', 'feature3']], data['target'], test_size=0.2, stratify=data[['feature2', 'feature3', 'target']]) 
              from sklearn.model_selection import train_test_split import pandas as pd # Sample DataFrame with features and target data = pd.DataFrame({ 'feature1': [1, 2, 3, 4, 5, 6], 'feature2': ['A', 'B', 'C', 'A', 'B', 'C'], 'feature3': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6], 'target': [0, 1, 0, 1, 0, 1] }) # Splitting data while stratifying by multiple columns X_train, X_test, y_train, y_test = train_test_split(data[['feature1', 'feature3']], data['target'], test_size=0.2, stratify=data[['feature2', 'target']]) 

                More Tags

                windows-defender pocketpc migration nsurlprotocol distance reverse-proxy internet-explorer-9 one-to-one querydsl tcp

                More Python Questions

                More Date and Time Calculators

                More Physical chemistry Calculators

                More Cat Calculators

                More Statistics Calculators