ML | Kaggle Breast Cancer Wisconsin Diagnosis using Logistic Regression

ML | Kaggle Breast Cancer Wisconsin Diagnosis using Logistic Regression

Breast Cancer Wisconsin (Diagnostic) dataset is one of the classic datasets available on Kaggle (and UCI Machine Learning Repository). It contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. These features describe the characteristics of the cell nuclei present in the image.

The goal is to predict whether a tumor is malignant (cancerous) or benign (not cancerous) based on these features.

Let's walk through the steps of building a logistic regression model using this dataset:

1. Load the Dataset

import pandas as pd # Load dataset from Kaggle or local path data = pd.read_csv('breast_cancer_wisconsin_data.csv') 

2. Data Preprocessing

# Drop the id column as it's not a feature data = data.drop('id', axis=1) # Convert diagnosis column to binary: M (Malignant) to 1 and B (Benign) to 0 data['diagnosis'] = data['diagnosis'].map({'M': 1, 'B': 0}) # Split dataset into features (X) and target variable (y) X = data.drop('diagnosis', axis=1) y = data['diagnosis'] 

3. Splitting the Dataset

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

4. Scaling the Features

Logistic Regression is influenced by the scale of the features. Hence, it's often a good idea to scale them:

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) 

5. Building the Logistic Regression Model

from sklearn.linear_model import LogisticRegression # Initialize the model clf = LogisticRegression() # Train the model clf.fit(X_train, y_train) 

6. Model Evaluation

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix # Predictions y_pred = clf.predict(X_test) # Accuracy print("Accuracy:", accuracy_score(y_test, y_pred)) # Classification Report print(classification_report(y_test, y_pred)) # Confusion Matrix print(confusion_matrix(y_test, y_pred)) 

Conclusion

This is a basic walkthrough of building a logistic regression model for the Breast Cancer Wisconsin (Diagnostic) dataset. In real-world applications, you may want to consider additional steps like feature selection, hyperparameter tuning, and cross-validation to improve and validate the model's performance.


More Tags

return-type ntfs mbstring typescript2.0 bulkinsert multipartform-data plsql redux-saga ant-design-pro regex-greedy

More Programming Guides

Other Guides

More Programming Examples