ONLINE PAYMENT FRAUD DETECTION Team Members: ▪ JAYASHREE S(23031717610422088) ▪ KABILNITHI R(2303717610421089) ▪ KANISHKA N(2303717610422090) ▪ KARTHI S(2303171610421091) Department of Computer Science and Engineering Guided by Dr. S. Nithya, M.E., PhD, Department of Computer Science and Engineering
AGENDA • Introduction • Problem Statement • Literature Survey • Proposed Work/Proposed Methodology • Work Flow • Project Modules • 50% Project Implementation(source code) • References
INTRODUCTION Background With the rapid growth of online transactions and digital payments, fraudulent activities have become increasingly common and sophisticated. Financial institutions and e-commerce platforms are especially vulnerable to fraud, which can result in significant financial losses and reputational damage. As a result, there is a pressing need for robust systems that can detect and prevent fraud in real-time. Project Domain This project falls under the domain of Data Analysis and Machine Learning . The core objective is to build a fraud detection system using a pre-trained machine learning model—specifically, an XGBoost classifier— integrated into a Flask web application. Users input transaction-related information via a web form, and the backend model predicts whether the transaction is likely to be fraudulent. Applications of the Project • Banking and Financial Services: To flag suspicious credit card transactions before they are completed. • E-commerce Platforms: For real-time fraud checks during payment and order processing. • Insurance Sector: Detecting fraudulent insurance claims. • Telecommunication: Identifying abnormal calling patterns or identity theft. • Government and Law Enforcement: Monitoring suspicious activities in public benefit schemes and tax systems.
PROBLEM STATEMENT •Detecting Fraudulent Transactions The primary problem is to accurately identify whether a given financial transaction is fraudulent or legitimate based on user-provided and system-generated features. Fraudulent transactions often mimic legitimate ones, making detection a complex task that requires advanced machine learning techniques. •Integrating Machine Learning with a User Interface Another challenge is to effectively integrate the pre-trained machine learning model with a web application so that end-users can easily input data and receive real-time predictions. This includes ensuring the model receives inputs in the right format and that the interface is user-friendly and responsive. •Providing Explainable and Interpretable Results It is important not just to make predictions, but also to present them in a way that is understandable. Users should be able to see not only whether a transaction is flagged as fraudulent but also the probability or confidence level of the prediction.
PROBLEM STATEMENT • Handling Unexpected Inputs or Errors Gracefully The system should be robust enough to handle invalid or incomplete inputs, backend errors, or model prediction failures without crashing or producing misleading results. A clear error-handling mechanism should be in place to maintain usability. • Analyzing Dataset Trends and Feature Impact Before model training and deployment, a thorough data analysis was conducted. This included:  Identifying class imbalance in the dataset (few fraudulent vs. many legitimate transactions).  Investigating key features that strongly correlate with fraudulent behavior (e.g., transaction amount, time patterns, geographic anomalies).  Visualizing patterns through histograms, correlation heatmaps, and outlier detection.  Understanding which features contributed the most to the model’s decision-making using feature importance scores from the XGBoost model.
Literature Survey 6
OBJECTIVES OF THE PROPOSED SYSTEM •Develop an Accurate Fraud Detection Model Train and fine-tune a machine learning model (XGBoost) capable of accurately identifying fraudulent transactions based on various input features. •Perform Exploratory Data Analysis (EDA) Analyze the transaction dataset to understand patterns, detect anomalies, and identify features that significantly contribute to fraud detection. This includes visualizing data distributions, correlations, and class imbalances. •Design a User-Friendly Web Interface Build a Flask-based web application where users can enter transaction details and receive instant fraud prediction feedback. •Automate Feature Generation Implement a system to auto-fill or simulate certain features that are not user-entered but required by the model, ensuring complete input for reliable prediction.
OBJECTIVES OF THE PROPOSED SYSTEM •Integrate Prediction with Probability Output Display not only the predicted classification (fraudulent or legitimate) but also the associated probability/confidence level to enhance transparency and user trust. •Ensure Robust Error Handling Implement error-catching mechanisms to gracefully handle invalid inputs, model issues, or system errors, thereby maintaining a smooth and reliable user experience. .
PROPOSED METHODOLOGY 1.Data Collection and Preprocessing • Dataset Source: The dataset used contains historical transaction records, including both legitimate and fraudulent entries. • Cleaning: Missing values, duplicate records, and irrelevant features are removed to maintain data quality. • Feature Engineering: New features may be derived from existing ones (e.g., time-related features, transaction frequency). • Handling Imbalanced Data: Since fraud datasets are typically imbalanced, techniques such as: • SMOTE (Synthetic Minority Over-sampling Technique) • Undersampling the majority class • or using class weights may be applied to improve model performance. 2. Exploratory Data Analysis (EDA) • Statistical Summary: Analyze distributions, averages, and ranges of key features. • Visualization: Histograms, boxplots, and heatmaps are used to understand the relationships between features and to detect outliers. • Correlation Analysis: Identify strong correlations between features and the fraud label. • Fraud Pattern Identification: Look for trends or behavioral differences in fraudulent vs.
PROPOSED METHODOLOGY 3. Model Building • Model Selection: • The XGBoost Classifier is chosen for its efficiency, accuracy, and ability to handle unbalanced datasets effectively. • Model Training: • The cleaned and balanced dataset is split into training and testing sets. • The model is trained using training data and validated using cross-validation techniques. • Hyperparameter Tuning: • Techniques like Grid Search or Random Search are used to find the best set of parameters for the XGBoost model. 4. Model Evaluation • Metrics Used: • Accuracy • Precision, Recall, F1-Score • ROC-AUC Score • Special focus is placed on recall and AUC, as identifying true frauds is more critical than just overall accuracy.
PROPOSED METHODOLOGY 5. Model Serialization •Once the model is finalized, it is saved using joblib so that it can be loaded into the web application without retraining. 6. Flask Web Application •Frontend: •A simple HTML form is created where users can enter transaction details. •Backend (Flask): •Accepts form input via POST requests. •Uses helper functions to extract user data, generate system features (get_random_row_features), and encode inputs (encode_inputs). •The processed input is sent to the model for prediction. •The result and probability are returned and rendered on the same form page. •Error Handling: •The application catches and displays any errors in a user-friendly way without crashing. 7. Result Interpretation • Classification Output: • The result shows whether a transaction is "Fraudulent" or "Legitimate". • Confidence Score: • Also displays the probability (%) that the transaction is fraudulent, giving users a better sense of certainty.
WORKFLOW
MODULES Completion status : completed
IMPLEMENTATION dataanalysis.py import pandas as pd import matplotlib.pyplot as plt import seaborn as sns file_path = "e:/pyproject/data/processed/cleaned_transactions.csv" df = pd.read_csv(file_path) print("nDataset Overview:") print(df.info()) print("nSummary Statistics:n", df.describe()) plt.figure(figsize=(6, 4)) sns.countplot(x="Fraud_Label", data=df, hue="Fraud_Label", palette="Set1", legend=False) plt.title("Fraud vs Non-Fraud Transactions") plt.xlabel("Fraud (1) vs Non-Fraud (0)") plt.ylabel("Count") plt.show()
IMPLEMENTATION fraud_percentage = df["Fraud_Label"].mean() * 100 print(f"nFraud Cases: {fraud_percentage:.2f}% of total transactions.") plt.figure(figsize=(8, 5)) sns.histplot(df["Transaction_Amount"], bins=50, kde=True) plt.title("Transaction Amount Distribution") plt.xlabel("Transaction Amount") plt.ylabel("Frequency") plt.show() plt.figure(figsize=(8, 5)) sns.histplot(df["Risk_Score"], bins=50, kde=True, color='red') plt.title("Risk Score Distribution") plt.xlabel("Risk Score") plt.ylabel("Frequency") plt.show()
IMPLEMENTATION plt.figure(figsize=(8, 5)) sns.boxplot(x="Fraud_Label", y="Transaction_Amount", data=df, hue="Fraud_Label", palette="coolwarm", legend=False) plt.title("Transaction Amount by Fraud Status") plt.xlabel("Fraud (1) vs Non-Fraud (0)") plt.ylabel("Transaction Amount") plt.show() plt.figure(figsize=(8, 5)) sns.boxplot(x="Fraud_Label", y="Risk_Score", data=df, hue="Fraud_Label", palette="coolwarm", legend=False) plt.title("Risk Score by Fraud Status") plt.xlabel("Fraud (1) vs Non-Fraud (0)") plt.ylabel("Risk Score") plt.show() plt.figure(figsize=(10, 5)) sns.countplot(x="Device_Type", hue="Fraud_Label", data=df, palette="Set1") plt.xticks(rotation=45) plt.title("Fraud Distribution Across Device Types") plt.xlabel("Device Type") plt.ylabel("Count") plt.show()
IMPLEMENTATION numeric_cols = df.select_dtypes(include=['number']) plt.figure(figsize=(15, 8)) sns.heatmap(numeric_cols.corr(), annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5) plt.title("Feature Correlation Heatmap") plt.show() selected_features = ["Transaction_Amount", "Risk_Score", "Daily_Transaction_Count", "Fraud_Label"] sns.pairplot(df[selected_features], hue="Fraud_Label", diag_kind="kde", palette="coolwarm") plt.show()
IMPLEMENTATION Model_training.py import xgboost as xgb import joblib from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score from xgboost import XGBClassifier import matplotlib.pyplot as plt import seaborn as sns import os from src.data_preprocessing import load_data, preprocess_data os.makedirs("models", exist_ok=True) os.makedirs("outputs", exist_ok=True) df = load_data("data/processed/cleaned_transactions.csv") df = df.drop(columns=["Risk_Score"], errors="ignore") X_train, X_test, y_train, y_test, label_col = preprocess_data(df, mode="train") model = xgb.XGBClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) joblib.dump(model, "models/xgb_fraud_model.pkl") joblib.dump(X_train.columns.tolist(), "models/features_list.pkl") print(" Model and feature list saved to 'models/' directory") y_pred = model.predict(X_test) print("n Classification Report:n", classification_report(y_test, y_pred)) print("n Confusion Matrix:n", confusion_matrix(y_test, y_pred)) roc_score = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
IMPLEMENTAION print(f"n ROC AUC Score: {roc_score:.4f}") plt.figure(figsize=(6, 4)) sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='Blues') plt.title("Confusion Matrix") plt.xlabel("Predicted") plt.ylabel("Actual") plt.tight_layout() plt.savefig("outputs/conf_matrix.png") print(" Confusion matrix saved to outputs/conf_matrix.png") def train_xgboost(X_train, y_train, X_test, y_test): model = XGBClassifier(eval_metric='logloss') eval_set = [(X_train, y_train), (X_test, y_test)] model.fit( X_train, y_train, eval_set=eval_set, verbose=True ) model_path = "models/xgb_fraud_model.pkl" joblib.dump(model, model_path) print(f"Model saved to {model_path}") return model 19
IMPLEMENTATION Evaluation.py import os import joblib import matplotlib.pyplot as plt import seaborn as sns from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score from sklearn.model_selection import learning_curve def evaluate_model(model, X_train, y_train, X_test, y_test): y_pred = model.predict(X_test) y_proba = model.predict_proba(X_test)[:, 1] print("n Classification Report:n", classification_report(y_test, y_pred)) cm = confusion_matrix(y_test, y_pred) print("n Confusion Matrix:n", cm) roc_score = roc_auc_score(y_test, y_proba) print(f"n ROC AUC Score: {roc_score:.4f}") plot_learning_curve(model, X_train, y_train) return y_pred def plot_confusion_matrix(y_test, y_pred, output_path="outputs/conf_matrix.png"): cm = confusion_matrix(y_test, y_pred) plt.figure(figsize=(6, 4)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues') plt.title("Confusion Matrix") plt.xlabel("Predicted")
IMPLEMENTATION plt.ylabel("Actual") plt.tight_layout() os.makedirs(os.path.dirname(output_path), exist_ok=True) plt.savefig(output_path) print(f" Confusion matrix saved to {output_path}") def plot_learning_curve(model, X_train, y_train): train_sizes, train_scores, test_scores = learning_curve( model, X_train, y_train, cv=5, n_jobs=-1, train_sizes=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] ) train_mean = train_scores.mean(axis=1) test_mean = test_scores.mean(axis=1) plt.figure(figsize=(8, 6)) plt.plot(train_sizes, train_mean, label="Training score", color="blue") plt.plot(train_sizes, test_mean, label="Cross-validation score", color="green") plt.xlabel("Training Size") plt.ylabel("Score") plt.title("Learning Curve") plt.legend(loc="best") plt.grid(True) plt.tight_layout() os.makedirs("outputs", exist_ok=True) learning_curve_path = "outputs/learning_curve.png" plt.savefig(learning_curve_path) print(f" Learning curve saved to {learning_curve_path}") plt.show() 21
IMPLEMENTATION
IMPLEMENTATION
IMPLEMENTATION
IMPLEMENTATION
IMPLEMENTATION
IMPLEMENTATION
IMPLEMENTATION
IMPLEMENTATION
IMPLEMENTATION 30
IMPLEMENTATION
IMPLEMENTATION 32
IMPLEMENTATION 33
IMPLEMENTATION 34
REFERENCES 1.Data science and machine learning using python ,Dr.Reema thaereja 2.Data Analysis using Python ,Dr.Dante

PYTHON PROJECT[1] Online Fraud Fraud Detection Cybersecurity Anomaly Detection Binary Classification Real-Time Detection

  • 1.
    ONLINE PAYMENT FRAUD DETECTION TeamMembers: ▪ JAYASHREE S(23031717610422088) ▪ KABILNITHI R(2303717610421089) ▪ KANISHKA N(2303717610422090) ▪ KARTHI S(2303171610421091) Department of Computer Science and Engineering Guided by Dr. S. Nithya, M.E., PhD, Department of Computer Science and Engineering
  • 2.
    AGENDA • Introduction • ProblemStatement • Literature Survey • Proposed Work/Proposed Methodology • Work Flow • Project Modules • 50% Project Implementation(source code) • References
  • 3.
    INTRODUCTION Background With the rapidgrowth of online transactions and digital payments, fraudulent activities have become increasingly common and sophisticated. Financial institutions and e-commerce platforms are especially vulnerable to fraud, which can result in significant financial losses and reputational damage. As a result, there is a pressing need for robust systems that can detect and prevent fraud in real-time. Project Domain This project falls under the domain of Data Analysis and Machine Learning . The core objective is to build a fraud detection system using a pre-trained machine learning model—specifically, an XGBoost classifier— integrated into a Flask web application. Users input transaction-related information via a web form, and the backend model predicts whether the transaction is likely to be fraudulent. Applications of the Project • Banking and Financial Services: To flag suspicious credit card transactions before they are completed. • E-commerce Platforms: For real-time fraud checks during payment and order processing. • Insurance Sector: Detecting fraudulent insurance claims. • Telecommunication: Identifying abnormal calling patterns or identity theft. • Government and Law Enforcement: Monitoring suspicious activities in public benefit schemes and tax systems.
  • 4.
    PROBLEM STATEMENT •Detecting FraudulentTransactions The primary problem is to accurately identify whether a given financial transaction is fraudulent or legitimate based on user-provided and system-generated features. Fraudulent transactions often mimic legitimate ones, making detection a complex task that requires advanced machine learning techniques. •Integrating Machine Learning with a User Interface Another challenge is to effectively integrate the pre-trained machine learning model with a web application so that end-users can easily input data and receive real-time predictions. This includes ensuring the model receives inputs in the right format and that the interface is user-friendly and responsive. •Providing Explainable and Interpretable Results It is important not just to make predictions, but also to present them in a way that is understandable. Users should be able to see not only whether a transaction is flagged as fraudulent but also the probability or confidence level of the prediction.
  • 5.
    PROBLEM STATEMENT • HandlingUnexpected Inputs or Errors Gracefully The system should be robust enough to handle invalid or incomplete inputs, backend errors, or model prediction failures without crashing or producing misleading results. A clear error-handling mechanism should be in place to maintain usability. • Analyzing Dataset Trends and Feature Impact Before model training and deployment, a thorough data analysis was conducted. This included:  Identifying class imbalance in the dataset (few fraudulent vs. many legitimate transactions).  Investigating key features that strongly correlate with fraudulent behavior (e.g., transaction amount, time patterns, geographic anomalies).  Visualizing patterns through histograms, correlation heatmaps, and outlier detection.  Understanding which features contributed the most to the model’s decision-making using feature importance scores from the XGBoost model.
  • 6.
  • 7.
    OBJECTIVES OF THEPROPOSED SYSTEM •Develop an Accurate Fraud Detection Model Train and fine-tune a machine learning model (XGBoost) capable of accurately identifying fraudulent transactions based on various input features. •Perform Exploratory Data Analysis (EDA) Analyze the transaction dataset to understand patterns, detect anomalies, and identify features that significantly contribute to fraud detection. This includes visualizing data distributions, correlations, and class imbalances. •Design a User-Friendly Web Interface Build a Flask-based web application where users can enter transaction details and receive instant fraud prediction feedback. •Automate Feature Generation Implement a system to auto-fill or simulate certain features that are not user-entered but required by the model, ensuring complete input for reliable prediction.
  • 8.
    OBJECTIVES OF THEPROPOSED SYSTEM •Integrate Prediction with Probability Output Display not only the predicted classification (fraudulent or legitimate) but also the associated probability/confidence level to enhance transparency and user trust. •Ensure Robust Error Handling Implement error-catching mechanisms to gracefully handle invalid inputs, model issues, or system errors, thereby maintaining a smooth and reliable user experience. .
  • 9.
    PROPOSED METHODOLOGY 1.Data Collectionand Preprocessing • Dataset Source: The dataset used contains historical transaction records, including both legitimate and fraudulent entries. • Cleaning: Missing values, duplicate records, and irrelevant features are removed to maintain data quality. • Feature Engineering: New features may be derived from existing ones (e.g., time-related features, transaction frequency). • Handling Imbalanced Data: Since fraud datasets are typically imbalanced, techniques such as: • SMOTE (Synthetic Minority Over-sampling Technique) • Undersampling the majority class • or using class weights may be applied to improve model performance. 2. Exploratory Data Analysis (EDA) • Statistical Summary: Analyze distributions, averages, and ranges of key features. • Visualization: Histograms, boxplots, and heatmaps are used to understand the relationships between features and to detect outliers. • Correlation Analysis: Identify strong correlations between features and the fraud label. • Fraud Pattern Identification: Look for trends or behavioral differences in fraudulent vs.
  • 10.
    PROPOSED METHODOLOGY 3. ModelBuilding • Model Selection: • The XGBoost Classifier is chosen for its efficiency, accuracy, and ability to handle unbalanced datasets effectively. • Model Training: • The cleaned and balanced dataset is split into training and testing sets. • The model is trained using training data and validated using cross-validation techniques. • Hyperparameter Tuning: • Techniques like Grid Search or Random Search are used to find the best set of parameters for the XGBoost model. 4. Model Evaluation • Metrics Used: • Accuracy • Precision, Recall, F1-Score • ROC-AUC Score • Special focus is placed on recall and AUC, as identifying true frauds is more critical than just overall accuracy.
  • 11.
    PROPOSED METHODOLOGY 5. ModelSerialization •Once the model is finalized, it is saved using joblib so that it can be loaded into the web application without retraining. 6. Flask Web Application •Frontend: •A simple HTML form is created where users can enter transaction details. •Backend (Flask): •Accepts form input via POST requests. •Uses helper functions to extract user data, generate system features (get_random_row_features), and encode inputs (encode_inputs). •The processed input is sent to the model for prediction. •The result and probability are returned and rendered on the same form page. •Error Handling: •The application catches and displays any errors in a user-friendly way without crashing. 7. Result Interpretation • Classification Output: • The result shows whether a transaction is "Fraudulent" or "Legitimate". • Confidence Score: • Also displays the probability (%) that the transaction is fraudulent, giving users a better sense of certainty.
  • 12.
  • 13.
  • 14.
    IMPLEMENTATION dataanalysis.py import pandas aspd import matplotlib.pyplot as plt import seaborn as sns file_path = "e:/pyproject/data/processed/cleaned_transactions.csv" df = pd.read_csv(file_path) print("nDataset Overview:") print(df.info()) print("nSummary Statistics:n", df.describe()) plt.figure(figsize=(6, 4)) sns.countplot(x="Fraud_Label", data=df, hue="Fraud_Label", palette="Set1", legend=False) plt.title("Fraud vs Non-Fraud Transactions") plt.xlabel("Fraud (1) vs Non-Fraud (0)") plt.ylabel("Count") plt.show()
  • 15.
    IMPLEMENTATION fraud_percentage = df["Fraud_Label"].mean()* 100 print(f"nFraud Cases: {fraud_percentage:.2f}% of total transactions.") plt.figure(figsize=(8, 5)) sns.histplot(df["Transaction_Amount"], bins=50, kde=True) plt.title("Transaction Amount Distribution") plt.xlabel("Transaction Amount") plt.ylabel("Frequency") plt.show() plt.figure(figsize=(8, 5)) sns.histplot(df["Risk_Score"], bins=50, kde=True, color='red') plt.title("Risk Score Distribution") plt.xlabel("Risk Score") plt.ylabel("Frequency") plt.show()
  • 16.
    IMPLEMENTATION plt.figure(figsize=(8, 5)) sns.boxplot(x="Fraud_Label", y="Transaction_Amount",data=df, hue="Fraud_Label", palette="coolwarm", legend=False) plt.title("Transaction Amount by Fraud Status") plt.xlabel("Fraud (1) vs Non-Fraud (0)") plt.ylabel("Transaction Amount") plt.show() plt.figure(figsize=(8, 5)) sns.boxplot(x="Fraud_Label", y="Risk_Score", data=df, hue="Fraud_Label", palette="coolwarm", legend=False) plt.title("Risk Score by Fraud Status") plt.xlabel("Fraud (1) vs Non-Fraud (0)") plt.ylabel("Risk Score") plt.show() plt.figure(figsize=(10, 5)) sns.countplot(x="Device_Type", hue="Fraud_Label", data=df, palette="Set1") plt.xticks(rotation=45) plt.title("Fraud Distribution Across Device Types") plt.xlabel("Device Type") plt.ylabel("Count") plt.show()
  • 17.
    IMPLEMENTATION numeric_cols = df.select_dtypes(include=['number']) plt.figure(figsize=(15,8)) sns.heatmap(numeric_cols.corr(), annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5) plt.title("Feature Correlation Heatmap") plt.show() selected_features = ["Transaction_Amount", "Risk_Score", "Daily_Transaction_Count", "Fraud_Label"] sns.pairplot(df[selected_features], hue="Fraud_Label", diag_kind="kde", palette="coolwarm") plt.show()
  • 18.
    IMPLEMENTATION Model_training.py import xgboost asxgb import joblib from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score from xgboost import XGBClassifier import matplotlib.pyplot as plt import seaborn as sns import os from src.data_preprocessing import load_data, preprocess_data os.makedirs("models", exist_ok=True) os.makedirs("outputs", exist_ok=True) df = load_data("data/processed/cleaned_transactions.csv") df = df.drop(columns=["Risk_Score"], errors="ignore") X_train, X_test, y_train, y_test, label_col = preprocess_data(df, mode="train") model = xgb.XGBClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) joblib.dump(model, "models/xgb_fraud_model.pkl") joblib.dump(X_train.columns.tolist(), "models/features_list.pkl") print(" Model and feature list saved to 'models/' directory") y_pred = model.predict(X_test) print("n Classification Report:n", classification_report(y_test, y_pred)) print("n Confusion Matrix:n", confusion_matrix(y_test, y_pred)) roc_score = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
  • 19.
    IMPLEMENTAION print(f"n ROC AUCScore: {roc_score:.4f}") plt.figure(figsize=(6, 4)) sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='Blues') plt.title("Confusion Matrix") plt.xlabel("Predicted") plt.ylabel("Actual") plt.tight_layout() plt.savefig("outputs/conf_matrix.png") print(" Confusion matrix saved to outputs/conf_matrix.png") def train_xgboost(X_train, y_train, X_test, y_test): model = XGBClassifier(eval_metric='logloss') eval_set = [(X_train, y_train), (X_test, y_test)] model.fit( X_train, y_train, eval_set=eval_set, verbose=True ) model_path = "models/xgb_fraud_model.pkl" joblib.dump(model, model_path) print(f"Model saved to {model_path}") return model 19
  • 20.
    IMPLEMENTATION Evaluation.py import os import joblib importmatplotlib.pyplot as plt import seaborn as sns from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score from sklearn.model_selection import learning_curve def evaluate_model(model, X_train, y_train, X_test, y_test): y_pred = model.predict(X_test) y_proba = model.predict_proba(X_test)[:, 1] print("n Classification Report:n", classification_report(y_test, y_pred)) cm = confusion_matrix(y_test, y_pred) print("n Confusion Matrix:n", cm) roc_score = roc_auc_score(y_test, y_proba) print(f"n ROC AUC Score: {roc_score:.4f}") plot_learning_curve(model, X_train, y_train) return y_pred def plot_confusion_matrix(y_test, y_pred, output_path="outputs/conf_matrix.png"): cm = confusion_matrix(y_test, y_pred) plt.figure(figsize=(6, 4)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues') plt.title("Confusion Matrix") plt.xlabel("Predicted")
  • 21.
    IMPLEMENTATION plt.ylabel("Actual") plt.tight_layout() os.makedirs(os.path.dirname(output_path), exist_ok=True) plt.savefig(output_path) print(f" Confusionmatrix saved to {output_path}") def plot_learning_curve(model, X_train, y_train): train_sizes, train_scores, test_scores = learning_curve( model, X_train, y_train, cv=5, n_jobs=-1, train_sizes=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] ) train_mean = train_scores.mean(axis=1) test_mean = test_scores.mean(axis=1) plt.figure(figsize=(8, 6)) plt.plot(train_sizes, train_mean, label="Training score", color="blue") plt.plot(train_sizes, test_mean, label="Cross-validation score", color="green") plt.xlabel("Training Size") plt.ylabel("Score") plt.title("Learning Curve") plt.legend(loc="best") plt.grid(True) plt.tight_layout() os.makedirs("outputs", exist_ok=True) learning_curve_path = "outputs/learning_curve.png" plt.savefig(learning_curve_path) print(f" Learning curve saved to {learning_curve_path}") plt.show() 21
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
    REFERENCES 1.Data science andmachine learning using python ,Dr.Reema thaereja 2.Data Analysis using Python ,Dr.Dante