Comparing Performance of Support-Vector Machine and Logistic Regression

Description of Task

This notebook is a solution to the following task:

Compare the performance of a Support-Vector Machine with that of Logistic Regression. Try to optimize both algorithms' parameters and determine which one is best for thisdataset. At the end of the analysis, you should have chosen an algorithm and its optimal set of parameters: write this choice explicitly in the conclusions of your notebook.

Description of Dataset

This dataset is composed of 1300 samples with 25 features each. The first column is the sample id. The second column in the dataset represents the label. There are 2 possible values for the labels. The remaining columns are numeric features.

Importing Dataset:

#Observations #Features #NA Values #Duplicates label ---------------------------------------------------------------- 1300 25 None None 49.7 %

This quickly verifies that we have 1300 observations and 25 features. In addition, we see that there are no NA value or duplicated rows and that there is a ~50:50 proportion between the two labels in the dataset. Next, let's perform some dataset analysis.

Exploratory Data Analysis

In this section we will perform brief exploration into our data, looking at summary statistics, plotting the histogram spread of each feature, and looking at the correlation between each feature. This will help us to understand what the nature of the data and give us the opportunity to notice relationships in the data.

	feature_1	feature_2	feature_3	feature_4	feature_5	feature_6	feature_7	feature_8	feature_9	feature_10	...	feature_16	feature_17	feature_18	feature_19	feature_20	feature_21	feature_22	feature_23	feature_24	feature_25
count	1300.000000	1300.000000	1300.000000	1300.000000	1300.000000	1300.000000	1300.000000	1300.000000	1300.000000	1300.000000	...	1300.000000	1300.000000	1300.000000	1300.000000	1300.000000	1300.000000	1300.000000	1300.000000	1300.000000	1300.000000
mean	-0.338439	-0.151954	-0.185151	0.311299	0.279897	0.003881	0.178295	-0.542319	0.002054	-0.004932	...	-0.429254	-0.347421	-0.502387	0.433979	0.204415	-0.582756	-1.053712	0.107390	-0.032372	-0.374981
std	1.427770	1.670306	0.981991	2.404899	1.223137	1.136962	1.051852	1.701212	1.200918	1.043793	...	1.861464	1.435908	1.197181	2.039115	1.199795	0.666962	2.567385	1.100071	1.963108	1.321986
min	-5.577567	-6.132911	-3.953686	-8.213834	-3.473453	-4.393443	-3.724541	-5.207074	-4.245091	-3.596323	...	-8.235007	-4.290421	-5.107172	-5.532940	-4.604010	-2.660711	-11.418329	-3.258889	-6.199655	-4.451162
25%	-1.211653	-1.213953	-0.827902	-1.101432	-0.557659	-0.720631	-0.580719	-1.774042	-0.809177	-0.701703	...	-1.537929	-1.293797	-1.256219	-1.062781	-0.572558	-1.034037	-2.328044	-0.625966	-1.419217	-1.263837
50%	-0.248385	-0.031338	-0.155223	0.535549	0.231154	-0.039458	0.139593	-0.575434	0.039064	-0.010828	...	-0.384438	-0.331566	-0.463897	0.551644	0.237767	-0.570126	-0.624524	0.079981	-0.016846	-0.378328
75%	0.640682	0.916374	0.530636	2.052738	1.067619	0.772724	0.886920	0.562960	0.798489	0.686624	...	0.775868	0.562251	0.306079	1.949728	1.016510	-0.132454	0.632363	0.838287	1.310980	0.439124
max	3.720374	5.298651	2.773804	6.133258	4.615081	4.220863	3.983645	5.448051	3.626921	3.492904	...	5.370952	4.217335	3.630840	5.730333	4.078821	1.407933	5.377527	3.244948	8.961113	6.407959

8 rows × 25 columns

From the histograms and summary statistics we can see that we have symmetrical data. In addition, we can see that our data is normally distributed with mean near 0 and a standard deviation of most 2.6. Next, the correlation heatmap will tell us how our features relate to one another.

Unsupervised Learning

Next, we can perform Principal Component Analysis to look at how much we can reduce dimensions and retain the information of our data. We will look at how many components are needed to retain the data's variance, specifically looking at 80, 90, and 95 % variance. We will also then plot the variance explained by each component to see a visual representation of this.

 Variance Explained #Components Dimensionality Kept --------------------------------------------------------------- 80% 7 28.00% 90% 13 52.00% 95% 16 64.00%

Now that we have reduced the data, we can take our first three principal components and plot the points in the space spanned by these three components. We are also going to color the points according to their label to see if there is a clear split between the two labels of the data.

Supervised Learning

Now that we have some idea of the nature of our data, we can move on to our models.

Let us prepare and scale the data and define our models. We will be using logistic regression, support vector machine with a linear kernel, and then we will extend our analysis to SVM with other kernels too.

Logistic Regression

In our parameter grid we will experiment with different values for C, varying the regularization of the classifier, as well as the l1_ratio, which weighs our l1 vs l2 norms regularization.

 param_C param_l1_ratio mean_test_score rank_test_score 0 0.01 0.5 0.811538 10 1 0.01 0.7 0.791346 11 2 0.01 0.9 0.782692 12 3 0.10 0.5 0.823077 1 4 0.10 0.7 0.822115 2 5 0.10 0.9 0.817308 8 6 1.00 0.5 0.821154 6 7 1.00 0.7 0.818269 7 8 1.00 0.9 0.817308 8 9 10.00 0.5 0.822115 2 10 10.00 0.7 0.822115 2 11 10.00 0.9 0.822115 2 Best parameters: {'C': 0.1, 'l1_ratio': 0.5} Best score: 0.823076923076923

The logistic regression gives us a score of 0.82, which is a good start, but let's try to make use of polynomial features to see if this will give us a clearer split in our data. Logistic regression (and LSVC) works best with linearly separable data. Through polynomial features, we perform a nonlinear transformation of the data which might facilitate finding this linear separation.

Indeed, we notice an improved performance of 0.94. Note that the initial values for the C parameter were [0.01, 0.1, 1, 10], yet upon seeing that 0.1 performed the best, I decided to look around 0.1, changing to [0.1, 0.3, 0.5].

Support Vector Machine (linear kernel)

Next, let's move to the linear kernel Support Vector Machine, where again we experiment with the C parameter, and this time with the l1 and l2 penalties separately.

 param_C param_penalty mean_test_score rank_test_score 0 0.01 l1 0.811538 10 1 0.01 l2 0.818269 9 2 0.10 l1 0.819231 8 3 0.10 l2 0.824038 1 4 1.00 l1 0.822115 7 5 1.00 l2 0.824038 1 6 10.00 l1 0.824038 1 7 10.00 l2 0.824038 1 8 100.00 l1 0.824038 1 9 100.00 l2 0.824038 1 Best parameters: {'C': 0.1, 'penalty': 'l2'} Best score: 0.8240384615384615

As expected, given both classifiers are geared towards linearly separable data, we see a similar performance for the LinearSVC on the basic data.As we did for Logistic Regression, let's try performing LinearSVC on polynomial features as well.

 param_C param_penalty mean_test_score rank_test_score 0 0.01 l1 0.909615 3 1 0.01 l2 0.932692 2 2 0.10 l1 0.944231 1 3 0.10 l2 0.896154 4 4 1.00 l1 0.888462 5 5 1.00 l2 0.874038 6 6 10.00 l1 0.867308 7 7 10.00 l2 0.865385 9 8 100.00 l1 0.866346 8 9 100.00 l2 0.864423 10 Best parameters: {'C': 0.1, 'penalty': 'l1'} Best score: 0.9442307692307692

Again, a better performance on the polynomial features. Given that so far we have only looked at linear kernel, let's expand our exploration to the Gaussian and sigmoid kernels.

Support Vector Machine

In our parameter grid we will define the regularization parameter and the kernel, we will use the default values for gamma.

 param_C param_kernel mean_test_score rank_test_score 0 0.01 rbf 0.707692 14 1 0.01 sigmoid 0.799038 10 2 0.01 poly 0.630769 15 3 0.10 rbf 0.902885 4 4 0.10 sigmoid 0.802885 9 5 0.10 poly 0.821154 8 6 1.00 rbf 0.934615 1 7 1.00 sigmoid 0.740385 11 8 1.00 poly 0.874038 5 9 10.00 rbf 0.929808 2 10 10.00 sigmoid 0.712500 13 11 10.00 poly 0.850962 6 12 100.00 rbf 0.914423 3 13 100.00 sigmoid 0.720192 12 14 100.00 poly 0.828846 7 Best parameters: {'C': 1, 'kernel': 'rbf'} Best score: 0.9346153846153846

Not quite as good as the linear kernel, put a good performance nonetheless. Next, let's take the better performing kernel, rbf, and let's try a few different gamma values:

 param_C param_kernel param_gamma mean_test_score rank_test_score 0 0.01 rbf 0.01 0.635577 9 1 0.01 rbf 0.10 0.581731 10 2 0.01 rbf 1.00 0.499038 14 3 0.10 rbf 0.01 0.843269 8 4 0.10 rbf 0.10 0.891346 7 5 0.10 rbf 1.00 0.499038 14 6 1.00 rbf 0.01 0.898077 6 7 1.00 rbf 0.10 0.928846 2 8 1.00 rbf 1.00 0.562500 11 9 10.00 rbf 0.01 0.931731 1 10 10.00 rbf 0.10 0.924038 3 11 10.00 rbf 1.00 0.562500 11 12 100.00 rbf 0.01 0.922115 5 13 100.00 rbf 0.10 0.924038 3 14 100.00 rbf 1.00 0.562500 11 Best parameters: {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'} Best score: 0.9317307692307694

Performance

Having performed GridSearch to find the best parameters for different variations of our models, let's plot all of our models' ROC curve:

Finally let's rank all of our models:

Position	| Classifier	| Accuracy -------------------------------------------------------------------------------- 1	| Support Vector Machine	| 0.95 2	| SVM w/ RBF Kernel Detailed	| 0.95 3	| LinearSVC Polynomial Features	| 0.94 4	| Logistic Reg Polynomial Features	| 0.93 5	| Logistic Regression	| 0.82 6	| LinearSVC	| 0.82

Test Set

Given that my task was to look at Logistic Regression and Linear SVC, I will be implementing the best model out of that category. As seen above, the best performing model was the LinearSVC with polynomial features. This means that before using the model to predict the labels, we need to first create the polynomial features.

Conclusion

The task was to compare the performance of Logistic Regression and LinearSVC. I did this through a GridSearchCV in which I tried different parameters. First I split the data into a test and a train portion, and performed cross-validation through the training data, with 5 partitions.The initial choice of parameters gave me a basis to then explore with more care. In the Logistic Regression, I experimented with different C and l1ratio values. Starting with a range of different magnitudes for C, I was able to pinpoint the range which translated into the best performance. I then tested different parameters closer to this range, which provided marginal returns but still improved the accuracy. I also noticed that it was useful to run the data through polynomial features, because it was providing a nonlinear transformation, increasing the range of my classifiers. For LinearSVC, I experimented with different C parameters and penalties. I also used polynomial features for the LinearSVC. For the last model, since LinearSVC only uses the linear kernel, I tried the Support Vector Machine with the radial, sigmoid, and polynomial kernel. The rbf kernel performed the best, therefore a final extension was done, looking at different values for gamma.

Finally, in order to visualize the accuracy, the ROC curve was plotted for each best model, which tells us how well binary classification models perform. By plotting sensitivity by (1 - specificity) across different thresholds, we get a visual showcase of the comparitive perfomance of the models. The area under the curve, AUC, gives us the overall performance of the model, which is also an indicator that can be compared.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README_files		README_files
.gitignore		.gitignore
README.md		README.md
main.ipynb		main.ipynb
mldata.TEST_FEATURES.csv		mldata.TEST_FEATURES.csv
mldata.TEST_LABELS.csv		mldata.TEST_LABELS.csv
mldata.csv		mldata.csv
mldata.description.txt		mldata.description.txt
pca_plot.py		pca_plot.py
test_predictions.txt		test_predictions.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Comparing Performance of Support-Vector Machine and Logistic Regression

Description of Task

Description of Dataset

Importing Dataset:

Exploratory Data Analysis

Unsupervised Learning

Supervised Learning

Logistic Regression

Support Vector Machine (linear kernel)

Support Vector Machine

Performance

Test Set

Conclusion

About

Uh oh!

Releases

Packages

Languages

tferracina/ml-project

Folders and files

Latest commit

History

Repository files navigation

Comparing Performance of Support-Vector Machine and Logistic Regression

Description of Task

Description of Dataset

Importing Dataset:

Exploratory Data Analysis

Unsupervised Learning

Supervised Learning

Logistic Regression

Support Vector Machine (linear kernel)

Support Vector Machine

Performance

Test Set

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages