Posted on Aug 13, 2019 • Edited on Aug 18, 2019

A short brief about Classification

#python #machinelearning #classification #datascience

What is Classification?

This is the first question I did when I heard the term Classification. The definition says it is, fundamentally, a model to predict labels. It brought me a new question. What are labels? Well, in a dataset for a classification models we will find features and labels, where a feature is a column used as an input data and the label is the value we want to predict.
So, when we know what value we want to predict with a Machine Learning model we have a Classification Problem.

Testing Classification Models

Let's get a NBA Log dataset. The goal is to predict if a player will last longer than 5 years in league. This data contains a target column, the TARGET_5Yrs column can be 0 (< 5 years) or 1 (>= 5 years). As we know our target (label), we can say for sure this is a Classification problem.

This dataset can be found here.

Requirements

Here are the libraries we will use in this example.

import os import numpy as np import pandas as pd import matplotlib.pyplot as plt from pandas.plotting import scatter_matrix from sklearn import model_selection from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC # define the dataset path DATASET_PATH = os.path.join("datasets") SEED = 7

Loading the dataset

The first thing to do is load the dataset, let's use Pandas to do it and check how data is available.

# create a function to load the dataset def load_nba_data(dataset_path=DATASET_PATH): csv_path = os.path.join(dataset_path, "nba_logreg.csv") return pd.read_csv(csv_path) # load the dataset nba_data = load_nba_data() # replace NaN fields with 0 nba_data.fillna(0, inplace=True) # show 10 first rows nba_data.head(10)

	Name	GP	MIN	PTS	FGM	FGA	FG%	3P Made	3PA	3P%	...	FTA	FT%	OREB	DREB	REB	AST	STL	BLK	TOV	TARGET_5Yrs
0	Brandon Ingram	36	27.4	7.4	2.6	7.6	34.7	0.5	2.1	25.0	...	2.3	69.9	0.7	3.4	4.1	1.9	0.4	0.4	1.3	0.0
1	Andrew Harrison	35	26.9	7.2	2.0	6.7	29.6	0.7	2.8	23.5	...	3.4	76.5	0.5	2.0	2.4	3.7	1.1	0.5	1.6	0.0
2	JaKarr Sampson	74	15.3	5.2	2.0	4.7	42.2	0.4	1.7	24.4	...	1.3	67.0	0.5	1.7	2.2	1.0	0.5	0.3	1.0	0.0
3	Malik Sealy	58	11.6	5.7	2.3	5.5	42.6	0.1	0.5	22.6	...	1.3	68.9	1.0	0.9	1.9	0.8	0.6	0.1	1.0	1.0
4	Matt Geiger	48	11.5	4.5	1.6	3.0	52.4	0.0	0.1	0.0	...	1.9	67.4	1.0	1.5	2.5	0.3	0.3	0.4	0.8	1.0
5	Tony Bennett	75	11.4	3.7	1.5	3.5	42.3	0.3	1.1	32.5	...	0.5	73.2	0.2	0.7	0.8	1.8	0.4	0.0	0.7	0.0
6	Don MacLean	62	10.9	6.6	2.5	5.8	43.5	0.0	0.1	50.0	...	1.8	81.1	0.5	1.4	2.0	0.6	0.2	0.1	0.7	1.0
7	Tracy Murray	48	10.3	5.7	2.3	5.4	41.5	0.4	1.5	30.0	...	0.8	87.5	0.8	0.9	1.7	0.2	0.2	0.1	0.7	1.0
8	Duane Cooper	65	9.9	2.4	1.0	2.4	39.2	0.1	0.5	23.3	...	0.5	71.4	0.2	0.6	0.8	2.3	0.3	0.0	1.1	0.0
9	Dave Johnson	42	8.5	3.7	1.4	3.5	38.3	0.1	0.3	21.4	...	1.4	67.8	0.4	0.7	1.1	0.3	0.2	0.0	0.7	0.0

10 rows × 21 columns

Here we have a small sample of our dataset. Let's discard the Name and the TARGET_5Yrs columns, all the others are the features of every player, these will tell us if the player will last longer than 5 years in a league or not. The TARGET_5Yrs has the answer for every combination of features.

Let's check a quick description of our dataset with the info() function.

# first let's remove the uneeded Name column, 'cause it's not relevant for this experiment nba_data = nba_data.drop('Name', 1) nba_data.info()

# first let's remove the uneeded Name column, 'cause it's not relevant for this experiment nba_data = nba_data.drop('Name', 1) nba_data.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 1340 entries, 0 to 1339 Data columns (total 20 columns): GP 1340 non-null int64 MIN 1340 non-null float64 PTS 1340 non-null float64 FGM 1340 non-null float64 FGA 1340 non-null float64 FG% 1340 non-null float64 3P Made 1340 non-null float64 3PA 1340 non-null float64 3P% 1340 non-null float64 FTM 1340 non-null float64 FTA 1340 non-null float64 FT% 1340 non-null float64 OREB 1340 non-null float64 DREB 1340 non-null float64 REB 1340 non-null float64 AST 1340 non-null float64 STL 1340 non-null float64 BLK 1340 non-null float64 TOV 1340 non-null float64 TARGET_5Yrs 1340 non-null float64 dtypes: float64(19), int64(1) memory usage: 209.5 KB

We can also check some statistics information on our dataset with the describe function.

nba_data.describe()

	GP	MIN	PTS	FGM	FGA	FG%	3P Made	3PA	3P%	FTM	FTA	FT%	OREB	DREB	REB	AST	STL	BLK	TOV	TARGET_5Yrs
count	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000	1340.000000
mean	60.414179	17.624627	6.801493	2.629104	5.885299	44.169403	0.247612	0.779179	19.149627	1.297687	1.821940	70.300299	1.009403	2.025746	3.034478	1.550522	0.618507	0.368582	1.193582	0.620149
std	17.433992	8.307964	4.357545	1.683555	3.593488	6.137679	0.383688	1.061847	16.051861	0.987246	1.322984	10.578479	0.777119	1.360008	2.057774	1.471169	0.409759	0.429049	0.722541	0.485531
min	11.000000	3.100000	0.700000	0.300000	0.800000	23.800000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.200000	0.300000	0.000000	0.000000	0.000000	0.100000	0.000000
25%	47.000000	10.875000	3.700000	1.400000	3.300000	40.200000	0.000000	0.000000	0.000000	0.600000	0.900000	64.700000	0.400000	1.000000	1.500000	0.600000	0.300000	0.100000	0.700000	0.000000
50%	63.000000	16.100000	5.550000	2.100000	4.800000	44.100000	0.100000	0.300000	22.200000	1.000000	1.500000	71.250000	0.800000	1.700000	2.500000	1.100000	0.500000	0.200000	1.000000	1.000000
75%	77.000000	22.900000	8.800000	3.400000	7.500000	47.900000	0.400000	1.200000	32.500000	1.600000	2.300000	77.600000	1.400000	2.600000	4.000000	2.000000	0.800000	0.500000	1.500000	1.000000
max	82.000000	40.900000	28.200000	10.200000	19.800000	73.700000	2.300000	6.500000	100.000000	7.700000	10.200000	100.000000	5.300000	9.600000	13.900000	10.600000	2.500000	3.900000	4.400000	1.000000

Let's take a look at our target is distributed over the dataset.

nba_data.groupby('TARGET_5Yrs').size()

TARGET_5Yrs 0.0 509 1.0 831 dtype: int64

In resume, we have 1340 objects in our dataset where 509 will not last longer than 5 years in league and the others 831 will.

Machine Learning Models Evaluation

As we saw in the beginning of this post, this is a Classification Problem. We will create some models with different ML algorithms and check their accuracy.

Spliting Data

Let's split our dataset into two new datasets. We will use 80% of the dataset to train our classification models and 20% of it to perform the validation.

data = nba_data.values # data = np.array(data)  # now let's separate the features columns from the target column X = data[:, 0:19] Y = data[:, 19] # as said before, we will use 20% of the dataset for validation validation_size = 0.20 # split the data into traning and testing X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=SEED)

Now that we have our training and testing set, we are going to create an array with the models we want to evaluate. We will use each model with the default settings.

models = [] models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr'))) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append(('SVM', SVC(gamma='auto')))

To evaluate the models we will user K-Fold cross-validation and measure the accuracy for each model. This techinique randomly splits the training set into K distincts subsets (folds), then it trains and evaluates the model K times picking a different fold for every evaliation. The result will be an array with the K evaluation scores. For this example we will a Cross-Validation using the StratifiedKFold from SKLearn. We will use the Mean of the accuracies of each model to determinate which one has the best results.

scoring = 'accuracy' models_results = [] for name, model in models: results = [] skfolds = model_selection.StratifiedKFold(n_splits=10, random_state=SEED) for train_index, test_index in skfolds.split(X_train, Y_train): X_train_folds = X_train[train_index] Y_train_folds = (Y_train[train_index]) X_test_folds = X_train[test_index] Y_test_folds = (Y_train[test_index]) model.fit(X_train_folds, Y_train_folds) pred = model.predict(X_test_folds) correct = sum(pred == Y_test_folds) results.append(correct / len(pred)) models_results.append((name, results)) names = [] scores = [] # the snippet bellow calculates the mean of the accuracies for name, results in models_results: mean = np.array(results).mean() std = np.array(results).std() print("Model: %s, Accuracy Mean: %f (%f)" % (name, mean, std)) names.append(name) scores.append(results)

Model: LR, Accuracy Mean: 0.705244 (0.026186) Model: LDA, Accuracy Mean: 0.706205 (0.027503) Model: KNN, Accuracy Mean: 0.674429 (0.026029) Model: CART, Accuracy Mean: 0.634372 (0.047236) Model: NB, Accuracy Mean: 0.632433 (0.040794) Model: SVM, Accuracy Mean: 0.619384 (0.021099)

The results above show us that the Linear Discriminant Analysis has the best accuracy score among the models we tested. The boxplot below shows the accuracy scores spread accross each fold.

fig = plt.figure() fig.suptitle('Models Comparison') ax = fig.add_subplot(111) plt.boxplot(scores) ax.set_xticklabels(names) plt.show()

Making Predictions

Now we will check the accuracy of the LDA model by making some predictions with the validation set we've prepared before. To do so, we will create an instance of the model and use the method predict.

model = LinearDiscriminantAnalysis() model.fit(X_train, Y_train) predictions = model.predict(X_test) print("Accuracy: {}".format(accuracy_score(Y_test, predictions)))

Accuracy: 0.6902985074626866

We can also check the Confusion Matrix for this model

print(confusion_matrix(Y_test, predictions))

[[ 52 45] [ 38 133]]

Each row in a confusion matrix represents an actual target and each column represents a predicted target. The first row of this matrix contains the true negatives and the false positives. Which means that 52 samples were correctly classified and 45 were wrongly classified. The second row shows us the false negatives and the true positives, wich means that 38 samples were wrongly classified and 133 were classified correctly.

The confusion matrix provides a lot of information, but if you want to get a more concise metrics you can use the classification_report function of Scikit-Learn. It will provide the precision, recall and f1-score metrics.

print(classification_report(Y_test, predictions))

 precision recall f1-score support 0.0 0.58 0.54 0.56 97 1.0 0.75 0.78 0.76 171 accuracy 0.69 268 macro avg 0.66 0.66 0.66 268 weighted avg 0.69 0.69 0.69 268

The accuracy of the positive predictions is called precision. It's defined by the formula: TP/(TP + FP), where TP is the number of True Positives and FP is the number of False Positives. This metric is tipically used along the recall which is the true positive rate - the ratio of positive instances that are correctly detected by the model. It's equation is: TP / (TP + FN) where FN is the False Negatives.

Conclusion

This is a short brief about Classification with Python and Scikit-Learning. There is a lot more to cover, we can improve our models results by normalizing the data for example. There's also others metrics to cover. But the firts steps into Machine Learning world can be done with this tutorial. Hope you enjoy it!!

You can access the notebook for this example here.

References

Your First Machine Learning Project in Python Step-By-Step - https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: Concepts, tools, and techniques to build intelligent systems.