Sklearn train_test_split on pandas stratify by multiple columns

The train_test_split function in scikit-learn doesn't natively support stratification based on multiple columns in a Pandas DataFrame. However, you can achieve stratification by combining the values from multiple columns into a single column and then using that combined column for stratification. Here's how you can do it:

import pandas as pd from sklearn.model_selection import train_test_split # Create a sample DataFrame data = { 'feature1': [1, 2, 2, 3, 3, 3], 'feature2': ['A', 'B', 'A', 'B', 'A', 'B'], 'target': [0, 1, 0, 1, 0, 1] } df = pd.DataFrame(data) # Combine multiple columns into a single stratify column df['combined_stratify'] = df['feature1'].astype(str) + df['feature2'] # Split the data using train_test_split with stratification train_df, test_df = train_test_split(df, test_size=0.2, stratify=df['combined_stratify'], random_state=42) # Drop the combined column train_df.drop(columns=['combined_stratify'], inplace=True) test_df.drop(columns=['combined_stratify'], inplace=True) print("Train Data:") print(train_df) print("Test Data:") print(test_df)

In this example, we first create a sample DataFrame with multiple columns (feature1, feature2, and target). We then create a new column combined_stratify that combines the values from feature1 and feature2 to create unique combinations. This combined column is used for stratification when using train_test_split. After the split, we remove the combined_stratify column.

Keep in mind that this approach creates a synthetic column for stratification, and it may not be suitable for all scenarios. Also, stratification based on multiple columns could potentially lead to small sample sizes in certain groups, depending on your data distribution. Be cautious and assess the validity of this approach based on your specific use case.

Examples

from sklearn.model_selection import train_test_split import pandas as pd # Sample DataFrame with features and target data = pd.DataFrame({ 'feature1': [1, 2, 3, 4, 5, 6], 'feature2': ['A', 'B', 'C', 'A', 'B', 'C'], 'feature3': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6], 'target': [0, 1, 0, 1, 0, 1] }) # Splitting data while stratifying by 'feature2' and 'target' columns X_train, X_test, y_train, y_test = train_test_split(data[['feature1', 'feature3']], data['target'], test_size=0.2, stratify=data[['feature2', 'target']])

from sklearn.model_selection import train_test_split import pandas as pd # Sample DataFrame with features and target data = pd.DataFrame({ 'feature1': [1, 2, 3, 4, 5, 6], 'feature2': ['A', 'B', 'C', 'A', 'B', 'C'], 'feature3': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6], 'target': [0, 1, 0, 1, 0, 1] }) # Splitting data while stratifying by multiple columns X_train, X_test, y_train, y_test = train_test_split(data[['feature1', 'feature3']], data['target'], test_size=0.2, stratify=data[['feature2', 'target']])

from sklearn.model_selection import train_test_split import pandas as pd # Sample DataFrame with features and target data = pd.DataFrame({ 'feature1': [1, 2, 3, 4, 5, 6], 'feature2': ['A', 'B', 'C', 'A', 'B', 'C'], 'feature3': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6], 'target': [0, 1, 0, 1, 0, 1] }) # Splitting data while stratifying by multiple columns X_train, X_test, y_train, y_test = train_test_split(data[['feature1', 'feature3']], data['target'], test_size=0.2, stratify=data[['feature2', 'target']])

from sklearn.model_selection import train_test_split import pandas as pd # Sample DataFrame with categorical features and target data = pd.DataFrame({ 'feature1': [1, 2, 3, 4, 5, 6], 'feature2': ['A', 'B', 'C', 'A', 'B', 'C'], 'feature3': ['X', 'Y', 'X', 'Y', 'X', 'Y'], 'target': [0, 1, 0, 1, 0, 1] }) # Splitting data while stratifying by multiple categorical columns X_train, X_test, y_train, y_test = train_test_split(data[['feature1']], data['target'], test_size=0.2, stratify=data[['feature2', 'feature3', 'target']])

from sklearn.model_selection import train_test_split import pandas as pd # Sample DataFrame with features and target data = pd.DataFrame({ 'feature1': [1, 2, 3, 4, 5, 6], 'feature2': ['A', 'B', 'C', 'A', 'B', 'C'], 'feature3': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6], 'target': [0, 1, 0, 1, 0, 1] }) # Splitting data while stratifying by multiple conditions X_train, X_test, y_train, y_test = train_test_split(data[['feature1', 'feature3']], data['target'], test_size=0.2, stratify=data[['feature2', 'feature3', 'target']])

from sklearn.model_selection import train_test_split import pandas as pd # Sample DataFrame with features and target data = pd.DataFrame({ 'feature1': [1, 2, 3, 4, 5, 6], 'feature2': ['A', 'B', 'C', 'A', 'B', 'C'], 'feature3': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6], 'target': [0, 1, 0, 1, 0, 1] }) # Splitting data while stratifying by multiple columns X_train, X_test, y_train, y_test = train_test_split(data[['feature1', 'feature3']], data['target'], test_size=0.2, stratify=data[['feature2', 'target']])

More Tags

windows-defender pocketpc migration nsurlprotocol distance reverse-proxy internet-explorer-9 one-to-one querydsl tcp

Sklearn train_test_split on pandas stratify by multiple columns

Examples

More Tags

More Python Questions

More Date and Time Calculators

More Physical chemistry Calculators

More Cat Calculators

More Statistics Calculators

Fitness Calculators

Auto Calculators

Financial Calculators

Date and Time Calculators

Internet Calculators

Pregnancy Calculators

Investment Calculators

Math Calculators

Housing/Building Calculators

Health Calculators

Retirement Calculators

Statistics Calculators

Various Measurements/Units Calculators

Everyday Utility Calculators

Weather Calculators

Real Estate Calculators

Tax and Salary Calculators

Geometry Calculators

Electronics/Circuits Calculators

Transportation Calculators

Entertainment/Anecdotes Calculators