Posted on Mar 31, 2024 • Edited on Dec 18, 2024

Quick tip: Using Apache Spark with SingleStore Notebooks for Fraud Detection

#singlestoredb #apachespark #frauddetection

Abstract

In a previous article, we saw the ease with which we could install and use Apache Spark within the SingleStore notebook environment. Continuing our series on Spark, we'll now use it to classify fraudulent credit card transactions.

The notebook file used in this article is available on GitHub.

Fraud dataset selection

We can find actual credit card data on Kaggle. The data are anonymised credit card transactions containing genuine and fraudulent cases.

The transactions occurred over two days during September 2013, and the dataset includes a total of 284,807 transactions, of which 492 are fraudulent, representing just 0.172% of the total.

This dataset, therefore, presents some challenges for analysis as it is highly unbalanced.

The dataset consists of the following fields:

Time: The number of seconds elapsed between a transaction and the first transaction in the dataset
V1 to V28: Details not available due to confidentiality reasons
Amount: The monetary value of the transaction
Class: The response variable (0 = no fraud, 1 = fraud)

One method to prepare the data for analysis is to keep all the fraudulent transactions and randomly sample 1% of the non-fraudulent transactions without replacement. The data would be sorted on the Time column and provide a total of 3265 rows. However, many other approaches are possible.

We'll show the following metrics:

 Predicted | Positive | Negative | Actual | | | ----------------+----------+----------+ Positive | TP | FN | ----------------+----------+----------+ Negative | FP | TN | ----------------+----------+----------+

Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

Where

Accuracy: Measures the proportion of correctly classified instances among all instances
Precision: Quantifies the proportion of correctly identified positive cases out of all cases identified as positive
Recall: Evaluates the proportion of correctly identified positive cases out of all actual positive cases
F1 Score: Combines precision and recall into a single metric, balancing both measures to provide a comprehensive evaluation of a model's performance

Create a SingleStore Cloud account

A previous article showed the steps to create a free SingleStore Cloud account. We'll use the following settings:

Workspace Group Name: Spark Demo Group
Cloud Provider: AWS
Region: US East 1 (N. Virginia)
Workspace Name: spark-demo
Size: S-00

Create a new notebook

From the left navigation pane in the cloud portal, we'll select DEVELOP > Data Studio.

In the top right of the web page, we'll select New Notebook > New Notebook, as shown in Figure 1.

Figure 1. New Notebook.

We'll call the notebook spark_fraud_demo, select a Blank notebook template from the available options, and save it in the Personal location.

Fill out the notebook

First, let's install Java:

!conda install -y --quiet -c conda-forge openjdk=8

Next, we'll obtain the reduced dataset, already prepared, and load it into a Pandas DataFrame:

url = "https://raw.githubusercontent.com/VeryFatBoy/gpt-workshop/main/data/creditcard.csv" pandas_df = pd.read_csv(url)

We can check the number of rows:

pandas_df.shape[0]

The output should be:

We can check the Class:

pandas_df.groupby("Class").size()

The output should be:

Class 0 2773 1 492 dtype: int64

We can also output the first 5 rows, as follows:

pandas_df.head(5)

Since the details for the columns V1 to V28 are not available, we can only check the Amount:

pandas_df["Amount"].describe()

The output should be:

count 3265.000000 mean 86.715210 std 195.568876 min 0.000000 25% 4.490000 50% 21.900000 75% 80.310000 max 2917.640000 Name: Amount, dtype: float64

We can produce a quick plot of the Amount values using the following:

fig = px.scatter( pandas_df, y = "Amount", color = pandas_df["Class"].astype(str), hover_data = ["Amount"] ) fig.update_layout( # yaxis_type = "log",  title = "Amount and Class" ) fig.show()

The output is shown in Figure 2.

Figure 2. Amount and Class.

Another way we can look at the data is as a histogram:

fig = px.histogram( pandas_df, x = "Amount", nbins = 50 ) fig.show()

The output is shown in Figure 3.

Figure 3. Histogram.

Figures 2 and 3 show that the vast majority of transactions were small in value.

Next, let's create a SparkSession:

# Create a Spark session spark = SparkSession.builder.appName("Fraud Detection").getOrCreate() spark.sparkContext.setLogLevel("ERROR")

and then use Logistic Regression:

# Convert pandas DataFrame to Spark DataFrame spark_df = spark.createDataFrame(pandas_df) # Select features and labels features = spark_df.columns[1:30] labels = "Class" # Assemble features into vector assembler = VectorAssembler( inputCols = features, outputCol = "features" ) spark_df = assembler.transform(spark_df).select("features", labels) # Split the data into training and testing sets train, test = spark_df.cache().randomSplit([0.7, 0.3], seed = 42) # Initialise logistic regression model lr = LogisticRegression( maxIter = 1000, featuresCol = "features", labelCol = labels ) # Train the logistic regression model train_model = lr.fit(train) # Make predictions on the test set predictions = train_model.transform(test) # Calculate the accuracy, precision, recall, and F1 score of the model accuracy = predictions.filter(predictions.Class == predictions.prediction).count() / float(test.count()) evaluator = MulticlassClassificationEvaluator( labelCol = labels, predictionCol = "prediction" ) precision = evaluator.evaluate( predictions, {evaluator.metricName: "precisionByLabel"} ) recall = evaluator.evaluate( predictions, {evaluator.metricName: "recallByLabel"} ) f1 = evaluator.evaluate( predictions, {evaluator.metricName: "fMeasureByLabel"} )

Next, we'll create a Confusion Matrix:

# Create confusion matrix cm = predictions.select("Class", "prediction") cm = cm.groupBy("Class", "prediction").count() cm = cm.toPandas() # Pivot the confusion matrix cm = cm.pivot( index = "Class", columns = "prediction", values = "count" ) # Generate and plot the confusion matrix fig = px.imshow( cm, x = ["Genuine (0)", "Fraudulent (1)"], y = ["Genuine (0)", "Fraudulent (1)"], color_continuous_scale = "Reds", labels = dict(x = "Predicted Label", y = "True Label") ) # Add annotations to the heatmap for i in range(len(cm)): for j in range(len(cm)): fig.add_annotation( x = j, y = i, text = str(cm.iloc[i, j]), font = dict(color = "white" if cm.iloc[i, j] > cm.values.max() / 2 else "black"), showarrow = False ) fig.update_layout( title_text = "Confusion Matrix - Logistic Regression", coloraxis_showscale = False ) fig.show()

The output is shown in Figure 4.

Figure 4. Confusion Matrix.

Overall, the model has made some good predictions without too many errors.

We can also print some metrics:

# Print the accuracy, precision, recall and f1 of the model print("Accuracy: %.4f" % accuracy) print("Precision: %.4f" % precision) print("Recall: %.4f" % recall) print("F1: %.4f" % f1)

Example output:

Accuracy: 0.9817 Precision: 0.9862 Recall: 0.9924 F1: 0.9893

Finally, we'll stop Spark:

spark.stop()

Summary

In this short article, we've been able to use Apache Spark to build the first iteration of a fraud detection model using SingleStore notebooks. In the next article in this series, we'll use the SingleStore Spark Connector to read and write data using the SingleStore Data Platform. Stay tuned.