Abstract
In a previous article, we saw the ease with which we could install and use Apache Spark within the SingleStore notebook environment. Continuing our series on Spark, we'll now use it to classify fraudulent credit card transactions.
The notebook file used in this article is available on GitHub.
Fraud dataset selection
We can find actual credit card data on Kaggle. The data are anonymised credit card transactions containing genuine and fraudulent cases.
The transactions occurred over two days during September 2013, and the dataset includes a total of 284,807 transactions, of which 492 are fraudulent, representing just 0.172% of the total.
This dataset, therefore, presents some challenges for analysis as it is highly unbalanced.
The dataset consists of the following fields:
- Time: The number of seconds elapsed between a transaction and the first transaction in the dataset
- V1 to V28: Details not available due to confidentiality reasons
- Amount: The monetary value of the transaction
- Class: The response variable (0 = no fraud, 1 = fraud)
One method to prepare the data for analysis is to keep all the fraudulent transactions and randomly sample 1% of the non-fraudulent transactions without replacement. The data would be sorted on the Time
column and provide a total of 3265 rows. However, many other approaches are possible.
We'll show the following metrics:
Predicted | Positive | Negative | Actual | | | ----------------+----------+----------+ Positive | TP | FN | ----------------+----------+----------+ Negative | FP | TN | ----------------+----------+----------+
- Accuracy = (TP + TN) / (TP + TN + FP + FN)
- Precision = TP / (TP + FP)
- Recall = TP / (TP + FN)
- F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
Where
- Accuracy: Measures the proportion of correctly classified instances among all instances
- Precision: Quantifies the proportion of correctly identified positive cases out of all cases identified as positive
- Recall: Evaluates the proportion of correctly identified positive cases out of all actual positive cases
- F1 Score: Combines precision and recall into a single metric, balancing both measures to provide a comprehensive evaluation of a model's performance
Create a SingleStore Cloud account
A previous article showed the steps to create a free SingleStore Cloud account. We'll use the following settings:
- Workspace Group Name: Spark Demo Group
- Cloud Provider: AWS
- Region: US East 1 (N. Virginia)
- Workspace Name: spark-demo
- Size: S-00
Create a new notebook
From the left navigation pane in the cloud portal, we'll select DEVELOP > Data Studio.
In the top right of the web page, we'll select New Notebook > New Notebook, as shown in Figure 1.
We'll call the notebook spark_fraud_demo, select a Blank notebook template from the available options, and save it in the Personal location.
Fill out the notebook
First, let's install Java:
!conda install -y --quiet -c conda-forge openjdk=8
Next, we'll obtain the reduced dataset, already prepared, and load it into a Pandas DataFrame:
url = "https://raw.githubusercontent.com/VeryFatBoy/gpt-workshop/main/data/creditcard.csv" pandas_df = pd.read_csv(url)
We can check the number of rows:
pandas_df.shape[0]
The output should be:
3265
We can check the Class
:
pandas_df.groupby("Class").size()
The output should be:
Class 0 2773 1 492 dtype: int64
We can also output the first 5 rows, as follows:
pandas_df.head(5)
Since the details for the columns V1 to V28 are not available, we can only check the Amount
:
pandas_df["Amount"].describe()
The output should be:
count 3265.000000 mean 86.715210 std 195.568876 min 0.000000 25% 4.490000 50% 21.900000 75% 80.310000 max 2917.640000 Name: Amount, dtype: float64
We can produce a quick plot of the Amount
values using the following:
fig = px.scatter( pandas_df, y = "Amount", color = pandas_df["Class"].astype(str), hover_data = ["Amount"] ) fig.update_layout( # yaxis_type = "log", title = "Amount and Class" ) fig.show()
The output is shown in Figure 2.
Another way we can look at the data is as a histogram:
fig = px.histogram( pandas_df, x = "Amount", nbins = 50 ) fig.show()
The output is shown in Figure 3.
Figures 2 and 3 show that the vast majority of transactions were small in value.
Next, let's create a SparkSession
:
# Create a Spark session spark = SparkSession.builder.appName("Fraud Detection").getOrCreate() spark.sparkContext.setLogLevel("ERROR")
and then use Logistic Regression:
# Convert pandas DataFrame to Spark DataFrame spark_df = spark.createDataFrame(pandas_df) # Select features and labels features = spark_df.columns[1:30] labels = "Class" # Assemble features into vector assembler = VectorAssembler( inputCols = features, outputCol = "features" ) spark_df = assembler.transform(spark_df).select("features", labels) # Split the data into training and testing sets train, test = spark_df.cache().randomSplit([0.7, 0.3], seed = 42) # Initialise logistic regression model lr = LogisticRegression( maxIter = 1000, featuresCol = "features", labelCol = labels ) # Train the logistic regression model train_model = lr.fit(train) # Make predictions on the test set predictions = train_model.transform(test) # Calculate the accuracy, precision, recall, and F1 score of the model accuracy = predictions.filter(predictions.Class == predictions.prediction).count() / float(test.count()) evaluator = MulticlassClassificationEvaluator( labelCol = labels, predictionCol = "prediction" ) precision = evaluator.evaluate( predictions, {evaluator.metricName: "precisionByLabel"} ) recall = evaluator.evaluate( predictions, {evaluator.metricName: "recallByLabel"} ) f1 = evaluator.evaluate( predictions, {evaluator.metricName: "fMeasureByLabel"} )
Next, we'll create a Confusion Matrix:
# Create confusion matrix cm = predictions.select("Class", "prediction") cm = cm.groupBy("Class", "prediction").count() cm = cm.toPandas() # Pivot the confusion matrix cm = cm.pivot( index = "Class", columns = "prediction", values = "count" ) # Generate and plot the confusion matrix fig = px.imshow( cm, x = ["Genuine (0)", "Fraudulent (1)"], y = ["Genuine (0)", "Fraudulent (1)"], color_continuous_scale = "Reds", labels = dict(x = "Predicted Label", y = "True Label") ) # Add annotations to the heatmap for i in range(len(cm)): for j in range(len(cm)): fig.add_annotation( x = j, y = i, text = str(cm.iloc[i, j]), font = dict(color = "white" if cm.iloc[i, j] > cm.values.max() / 2 else "black"), showarrow = False ) fig.update_layout( title_text = "Confusion Matrix - Logistic Regression", coloraxis_showscale = False ) fig.show()
The output is shown in Figure 4.
Overall, the model has made some good predictions without too many errors.
We can also print some metrics:
# Print the accuracy, precision, recall and f1 of the model print("Accuracy: %.4f" % accuracy) print("Precision: %.4f" % precision) print("Recall: %.4f" % recall) print("F1: %.4f" % f1)
Example output:
Accuracy: 0.9817 Precision: 0.9862 Recall: 0.9924 F1: 0.9893
Finally, we'll stop Spark:
spark.stop()
Summary
In this short article, we've been able to use Apache Spark to build the first iteration of a fraud detection model using SingleStore notebooks. In the next article in this series, we'll use the SingleStore Spark Connector to read and write data using the SingleStore Data Platform. Stay tuned.
Top comments (0)