0% found this document useful (0 votes)
35 views13 pages

Synopsis ML Projectpdf

The project report focuses on developing an online payment fraud detection system using machine learning techniques to analyze transaction data and identify fraudulent activities. It details the methodology, including data collection, preprocessing, model building, and evaluation, with a final model achieving a prediction accuracy of 99.92% using a Decision Tree algorithm. The report emphasizes the need for robust detection mechanisms to combat the increasing threat of online payment fraud in e-commerce.

Uploaded by

17guptam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views13 pages

Synopsis ML Projectpdf

The project report focuses on developing an online payment fraud detection system using machine learning techniques to analyze transaction data and identify fraudulent activities. It details the methodology, including data collection, preprocessing, model building, and evaluation, with a final model achieving a prediction accuracy of 99.92% using a Decision Tree algorithm. The report emphasizes the need for robust detection mechanisms to combat the increasing threat of online payment fraud in e-commerce.

Uploaded by

17guptam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

A PROJECT REPORT

ON
“Online Payment Fraud Detection”

SESSION: 2024
SCHOOL OF COMPUTER SCIENCE & ENGINEERING
GREATER NOIDA, UTTAR PRADESH, INDIA

Submitted To:
DR. PRASHANT JOHRI

Submitted By:
Divy Anant Varshney - (23SCSE2030392) (MCA sec-04)

Yadvendra Singh Rathaur -(23SCSE2030453) (MCA sec-04)

Monika Gupta -(23SCSE2030438) (MCA sec-04)


TABLE OF CONTENTS

● Literature Review

● The Problem Statement

● Exploring Data

● Statistics

● Proposed System

● Flowchart

● Methodology

● Result

● References
Literature Review

Online transaction fraud is a simple and easy target. E-commerce and


other online sites have increased the number of online payment
methods, raising the danger of online fraud. With the rise in fraud
rates, machine learning approaches can be used to identify and
evaluate fraud in online transactions. The primary goal of this project
is to implement supervised machine learning models for fraud
detection, with the goal of analyzing prior transaction information.
Where transactions are classified into distinct groups based on the type
of transaction. Following that, various classifiers are trained
independently, and models are assessed for correctness. The classifier
with the highest rating score can then be picked as one of the best
approaches for predicting fraud. We worked with the
Kaggle Synthetic Financial Datasets for Fraud
Detection dataset collected by Edgar Lopez-Rojas.In this project K
Nearest Neighbor, Logistic Regression, Support Vector Machine
(SVM), Decision Tree, and Random Forest Machine Learning models
are implemented for detection of fraudulent transactions. A
comparative analysis of these algorithms is performed to identify an
optimal solution.
The Problem Statement
Online payment fraud poses a persistent challenge in e-commerce,
threatening financial losses and eroding consumer trust. The evolving tactics
of fraudsters necessitate advanced detection mechanisms to safeguard
online transactions. Despite existing preventive measures, the detection and
mitigation of fraudulent activities remain a significant concern. The problem
statement underscores the imperative for a robust and adaptive fraud
detection system capable of identifying suspicious transactions in real-time,
thereby minimizing risks for merchants and consumers alike. Addressing
this challenge requires a comprehensive understanding of fraudulent
patterns and behaviors, coupled with the implementation of sophisticated
data analysis techniques and machine learning algorithms.
Exploring Data
The first step of our project work was determining the right data set.
Many online resources exist with access to plethora of financial fraud
analysis datasets with transaction information without personal user
information.We came across many data sets
like datahack and dataworld data set. We selected the Synthetic
Financial Datasets for Fraud Detection dataset collected
by Edgar Lopez-Rojas for our task.PaySim simulates mobile money
transactions based on a sample of real transactions extracted from one
month of financial logs from a mobile money service implemented in
an African country. The original logs were provided by a multinational
company, who is the provider of the mobile financial service which is
currently running in more than 14 countries all around the world. This
synthetic dataset was scaled down to a quarter of the original dataset
and it is created just for Kaggle.This data source is obtained from
Kaggle for the detection of fraudulent online transactions. At present it
consists of 6,362,620 recordings of 5 different types of transactions and
11 columns. Among the total transactions 6,354,407(99.87%) are legal
transactions whereas 8,213(0.13%) are fraudulent transactions, which
is understandable as only a very small percentage of the total
transactions are fraud.\

The 11 columns of the dataset and what each column represents:

1. step: represents a unit of time where 1 step equals 1 hour

2. type: type of online transaction

3. amount: the amount of the transaction

4. nameOrig: customer starting the transaction


5. oldbalanceOrg: balance before the transaction

6. newbalanceOrig: balance after the transaction

7. nameDest: recipient of the transaction

8. oldbalanceDest: initial balance of recipient before the transaction

9. newbalanceDest: the new balance of recipient after the transaction

10. isFraud: fraud transaction

11. isFlaggedFraud — transfer of more than 200,000 in a single


transaction.
Statistics

In online payment fraud detection, statistics are instrumental in


analyzing transactional data and identifying fraudulent patterns.
Descriptive statistics, such as mean (μ), median, and standard
deviation (σ), provide a summary of transaction attributes like amount
and frequency. For instance, μ and σ help assess whether a transaction
amount significantly deviates from the norm, indicating potential
fraud. Correlation coefficient (ρ) quantifies relationships between
variables, like transaction amount and time, aiding in anomaly
detection. Hypothesis testing, represented by equations like t-test or
z-test, evaluates the significance of differences in transaction patterns
between normal and fraudulent activities. Moreover, predictive
models like logistic regression employ equations to estimate the
probability of fraud based on historical data, enhancing fraud
detection accuracy. These statistical tools empower fraud detection
systems to combat online payment fraud effectively.
Proposed System
The proposed system aims to bolster the security of online
payment systems by employing advanced data analysis techniques
and machine learning algorithms to detect and prevent fraudulent
transactions effectively.

Data Collection:
The system collects comprehensive transactional data from
various sources, including payment gateways, merchants, and
financial institutions. This data encompasses transaction
amounts, timestamps, user demographics, device information,
and transaction histories.
Data Preprocessing:
Upon collection, the raw transactional data undergoes
preprocessing, including data cleaning, normalization, and feature
engineering. Missing values are handled, outliers are identified
and treated, and relevant features are extracted or transformed to
enhance model performance.
Feature Selection:
Feature selection techniques, such as correlation analysis and
feature importance ranking, are employed to identify the most
discriminative features for fraud detection. This step helps reduce
dimensionality and improve model efficiency.
Model Building:
The system utilizes machine learning algorithms, including
supervised and unsupervised techniques, to build robust fraud
detection models. Supervised algorithms such as logistic
regression, decision trees, and ensemble methods learn from
labeled data to classify transactions as either legitimate or
fraudulent. Unsupervised algorithms such as clustering and
anomaly detection identify unusual patterns indicative of
fraudulent activities without the need for labeled data.
Model Training and Evaluation:
The selected models are trained on historical transaction data and
evaluated using appropriate performance metrics such as
accuracy, precision, recall, and F1-score. Cross-validation
techniques ensure the generalizability of the models, while
hyperparameter tuning optimizes their performance.
Real-time Monitoring:
The trained models are deployed in a real-time monitoring system
that continuously analyzes incoming transactions for signs of
fraud. Transactions flagged as suspicious trigger immediate alerts
for further investigation by fraud analysts or automated response
mechanisms.
Adaptive Learning:
The system incorporates adaptive learning mechanisms to
continuously update and refine the fraud detection models based
on new data and emerging fraud trends. Feedback loops enable the
system to adapt to evolving fraud tactics and maintain high
detection accuracy over time.
Reporting and Visualization:
Comprehensive reports and visualizations are generated to
provide insights into the effectiveness of the fraud detection
system. Key performance indicators, trends, and patterns are
communicated to stakeholders to support decision-making and
strategic planning.
Flowchart

The flowchart concludes with the end symbol, indicating the completion of the decision tree

algorithm. The flowchart shown in figure 3 provides a visual representation of the steps involved

in training and evaluating the decision tree model, aiding in understanding the overall process

and facilitating communication between different stakeholders.

Here is a brief explanation of the flowchart of model training in figure 3. 1. Start: The flowchart

begins with the start symbol, indicating the beginning of the decision tree algorithm. 2. Load

Dataset: The algorithm loads the dataset, which contains the input features and target variable. 3.

Define Features and Target: The feature columns and target column are defined, specifying the

variables to be used for training the decision tree. 4. Split Data: The dataset is split into training

and testing sets using the train_test_split function, allocating a portion of the data for model

evaluation. 5. Data Imputation: The SimpleImputer object is used to handle missing values in the

dataset, replacing them with the mean value of the respective feature. 6. Build Decision Tree:

The DecisionTreeClassifier object is created, representing the decision tree model. It is trained
on the training data using the fit function. 7. Predictions: The trained decision tree is utilized to

make predictions on the test set, using the predict function.

Methodology

The methodologies include the algorithm used, dataset used and flowchart of
the data used and implemented. Below is the provided step by step explanation
of the algorithm used.

Algorithm Used: The decision tree algorithm is a widely used supervised


learning technique employed for both classification and regression tasks. It
constructs a structured model resembling a flowchart, driven by input features.

Tree Construction: The algorithm commences by considering the entire dataset


as the root node, and selects the optimal feature for partitioning the data.

Feature Split: The chosen feature is utilized to divide the data into subsets,
thereby creating branches or paths within the decision tree. Recursive Splitting:
The process of feature splitting is iteratively applied to each subset until a
predefined stopping criterion is satisfied.

Leaf Node Assignment: Leaf nodes are assigned class labels or regression values
based on the majority class or mean value of the target variable within each
respective subset.

Prediction: To make predictions, the algorithm traverses the decision tree by


evaluating feature values and ultimately reaching a leaf node to obtain the final
prediction.Easy to comprehend and interpret accommodates numerical and
categorical data handles missing values gracefully captures nonlinear
relationships effectively. Prone to over fitting, necessitating proper
regularization techniques - Can be sensitive to changes in the dataset, leading
to instability. Exhibits bias towards features with high cardinality or many levels
In conclusion, decision trees offer versatility and transparency in model
interpretation. However, caution must be exercised to address overfitting issues
and effectively manage the algorithm's limitations.

Result
The goal was to predict whether a transaction is a legal transaction or a
fraudulent transaction, this falls under the scope of a classification
problem. We intend to deploy Supervised Machine Learning models in
order to achieve the highest prediction accuracy.K Nearest Neighbor,
Logistic Regression, Support Vector Machine, Decision Tree and
Random Forest models were trained using k-fold technique, training
contained total 5 folds and with each fold accuracy of the model kept
increasing up to 5th fold. After the 5th fold, accuracy started decreasing
because our dataset was not sufficient enough for more than 5 folds.
So, the final model was trained on 5 folds with 88.55% average
accuracy. This means that if someone would train Random Forest with
a bigger data set using the k-fold technique then the average accuracy
of the model would be even higher.

As a result, the Decision Tree model had the greatest prediction


accuracy of 99.92% and recall of 86.96%
Due to huge amount of data models for Support Vector Machine and
Random Forest were unable to compile, even on Google Collab. Further
work can be done by under sampling of data by 50:50, that would
reduce data size even more and as a result SVM and Random Forest
results can be compiled accurately.Initial results, Final results could
not be compiled due to insufficient computing power.

References

1. Design and development of financial fraud detection using


machine learning. (2024). International Journal of Emerging
Trends in Engineering Research, 8(9), 5838–
5843. https://doi.org/10.30534/ijeter/2020/152892020

2. Rucco, M., Giannini, F., Lupinetti, K., & Monti, M. (2019). A


methodology for part classification with supervised machine
learning. Artificial Intelligence for Engineering Design, Analysis
and Manufacturing, 33(1), 100–
113. https://doi.org/10.1017/S0890060418000197

3. Saarikoski, J., Joutsijoki, H., Järvelin, K., Laurikkala, J., & Juhola,
M. (2015). On the influence of training data quality on text
document classification using machine learning methods.
International Journal of Knowledge Engineering and Data Mining,
3(2), 143. https://doi.org/10.1504/IJKEDM.2015.071284

You might also like