Multiclass classification evaluation with ROC Curves and ROC AUC

Adapting the most used classification evaluation metric to the multiclass classification problem with OvR and OvO strategies

Vinícius Trevisan

Feb 12, 2022

6 min read

When evaluating multiclass classification models, we sometimes need to adapt the metrics used in binary classification to work in this setting. We can do that by using OvR and OvO strategies.

In this article I will show how to adapt ROC Curve and ROC AUC metrics for multiclass classification.

The ROC Curve and the ROC AUC score are important tools to evaluate binary classification models. In summary they show us the separability of the classes by all possible thresholds, or in other words, how well the model is classifying each class.

As I already explained in another article, we can compare the ROC Curves (top image) with their respective histograms (bottom image). The more separate the histograms are, the better the ROC Curves are as well.

Class separation histograms comparison. Image by author.

But this concept is not immediately applicable for muticlass classifiers. In order to use ROC Curves and ROC AUC in this scenario, we need another way to compare classes: OvR and OvO.

In the following sections I will explain it better, and you can also check the code on my github:

Articles/ROC Curve and ROC AUC at main · vinyluis/Articles

OvR or OvO?

OvR – One vs Rest

OvR stands for "One vs Rest", and as the name suggests is one method to evaluate multiclass models by comparing each class against all the others at the same time. In this scenario we take one class and consider it as our "positive" class, while all the others (the rest) are considered as the "negative" class.

By doing this, we reduce the multiclass classification output into a binary classification one, and so it is possible to use all the known binary classification metrics to evaluate this scenario.

We must repeat this for each class present on the data, so for a 3-class dataset we get 3 different OvR scores. In the end, we can average them (simple or weighted average) to have a final OvR model score.

OvR combinations for a three-class setting. Image by author.

OvO – One vs One

Now as you might imagine, OvO stands for "One vs One" and is really similar to OvR, but instead of comparing each class with the rest, we compare all possible two-class combinations of the dataset.

Let’s say we have a 3-class scenario and we chose the combination "Class1 vs Class2" as the first one. The first step is to get a copy of the dataset that only contains the two classes and discard all the others. Then we define observations with real class = "Class1" as our positive class and the ones with real class = "Class2" as our negative class. Now that the problem is binary we can also use the same metrics we use for binary classification.

Note that "Class1 vs Class2" is different than "Class2 vs Class1", so both cases should be accounted. Because of that, in a 3-class dataset we get 6 OvO scores, and in a 4-class dataset we get 12 OvO scores.

As in OvR we can average all the OvO scores to get a final OvO model score.

OvO combinations for a three-class setting. Image by author.

OvR ROC Curves and ROC AUC

I will use the functions I used on the Binary Classification ROC article to plot the curve, with only a few adaptations, which are available here. You can also use the scikit-learn version, if you want.

In this example I will use a synthetic dataset with three classes: "apple", "banana" and "orange". They have some overlap in every combination of classes, to make it difficult for the classifier to learn correctly all instances. The dataset has only two features: "x" and "y", and is the following:

Multiclass scatterplot. Image by author.

For the model, I trained a default instance of the scikit-learn’s RandomForestClassifier.

In the code below we:

Iterate over all classes
Prepare an auxiliar dataframe using one class as "1" and the others as "0"
Plots the histograms of the class distributions
Plots the ROC Curve for each case
Calculate the AUC for that specific class

The code above outputs the histograms and the ROC Curves for each class vs rest:

ROC Curves and histograms OvR. Image by author.

As we can see, the scores for the "orange" class were a little lower than the other two classes, but in all cases the classifier did a good job in predicting every class. We can also note on the histograms that the overlap we see in the real data also exists on the predictions.

To display each OvR AUC score we can simply print them. We can also take the average score of the classifier:

# Displays the ROC AUC for each class avg_roc_auc = 0 i = 0 for k in roc_auc_ovr: avg_roc_auc += roc_auc_ovr[k] i += 1 print(f"{k} ROC AUC OvR: {roc_auc_ovr[k]:.4f}") print(f"average ROC AUC OvR: {avg_roc_auc/i:.4f}")

And the output is:

apple ROC AUC OvR: 0.9425 banana ROC AUC OvR: 0.9525 orange ROC AUC OvR: 0.9281 average ROC AUC OvR: 0.9410

The average ROC AUC OvR in this case is 0.9410, a really good score that reflects how well the classifier was in predicting each class.

OvO ROC Curves and ROC AUC

With the same setup as the previous experiment, the first thing that needs to be done is build a list with all possible pairs of classes:

classes_combinations = [] class_list = list(classes) for i in range(len(class_list)): for j in range(i+1, len(class_list)): classes_combinations.append([class_list[i], class_list[j]]) classes_combinations.append([class_list[j], class_list[i]])

The classes_combinations list will have all combinations:

[['apple', 'banana'], ['banana', 'apple'], ['apple', 'orange'], ['orange', 'apple'], ['banana', 'orange'], ['orange', 'banana']]

Then we iterate over all combinations, and similarly to the OvR case we

Prepare an auxiliar dataframe with only instances of both classes
Define instances of Class 1 as "1" and instances of Class 2 as "0"
Plots the histograms of the class distributions
Plots the ROC Curve for each case
Calculate the AUC for that specific combination

The code above plots all histograms and ROC Curves:

ROC Curves and histograms OvO. Image by author.

Notice that, as expected, the "apple vs banana" plots are different from the "banana vs apple" ones. As in the previous case, we can evaluate each combination individually, and check for model inconsistencies.

We can also display the AUCs and calculate the average OvO AUC:

# Displays the ROC AUC for each class avg_roc_auc = 0 i = 0 for k in roc_auc_ovo: avg_roc_auc += roc_auc_ovo[k] i += 1 print(f"{k} ROC AUC OvO: {roc_auc_ovo[k]:.4f}") print(f"average ROC AUC OvO: {avg_roc_auc/i:.4f}")

And the output is:

apple vs banana ROC AUC OvO: 0.9561 banana vs apple ROC AUC OvO: 0.9547 apple vs orange ROC AUC OvO: 0.9279 orange vs apple ROC AUC OvO: 0.9231 banana vs orange ROC AUC OvO: 0.9498 orange vs banana ROC AUC OvO: 0.9336 average ROC AUC OvO: 0.9409

The average ROC AUC in this case is 0.9409, and is close to the score obtained on the OvR scenario (0.9410).

Conclusion

OvR and OvO strategies can (and should) be used to adapt any binary classification metric to the multiclass classification task.

Evaluating OvO and OvR results also can help understanding which classes the model is struggling to describe, and which features you can add or remove to improve the result of the model.