Facial Expression Recognition via Python

Data Science with Python Facial Expression Recognition Final Project Report - OPIM 5894 - Data Science with Python Team Brogrammers: Santanu Paul, Sree Inturi, Saurav Gupta, Vibhuti Upadhyay, Sunender Pothula Nov 30 2017

Table of Contents Facial Expression Recognition .............................................................................................. 1 1. Introduction................................................................................................................................ 1 1.1 Background – What is Representation Learning? .......................................................................1 1.2 Research Objectives..................................................................................................................2 2. Data Description and Exploration.............................................................................................. 2 2.1 About the Dataset.................................................................................................................... 2 2.2 Data Exploration.......................................................................................................................3 2.3 Data Preprocessing...................................................................................................................4 3. Dimensionality Reduction.......................................................................................................... 4 3.1 Curse of Dimensionality............................................................................................................4 3.2 Principal Component Analysis...................................................................................................5 Takeaway from the plot: ........................................................................................................6 Visualizing the Eigen Value:..................................................................................................6 Interactive visualizations of PCArepresentation...............................................................7 Improvements:........................................................................................................................8 3.3 Linear Discriminant Analysis......................................................................................................9 4. Modeling................................................................................................................................... 11 4.1 Support Vector Machine .........................................................................................................11 4.2 Neural Networks.....................................................................................................................14 4.3 Conclusion..............................................................................................................................16 5. Scope for Improvement ........................................................................................................... 18 5.1 CNN (Convolutional Neural Network) And Parameter Tuning...................................................18 Attachments – Python Notebooks and Code.............................................................................. 19

1 1. Introduction 1.1 Background– What is Representation Learning? Talking about the older machine learning algorithms, they rely on the input being a feature and then learn a classifier, regressor, etc. on top of that. Most of these features are hand crafted, i.e. designed by humans. Classical examples of features in computer vision include SIFT, LBP, etc. The problem with these is that they are designed by humans based on heuristics. Images can be represented using these features and ML algorithms can be applied on top of that. However, they may not be the most optimal in terms of the objective function, i.e., it may be possible to design better features that can lead to lower objective function values. Instead of hand crafting these image representations, we can learn them. That is known as representation learning. We can have a neural network which takes the image as an input and outputs a vector, which is the feature representation of the image. This is the representation learner. This be followed by another neural network that acts as the classifier, regressor, etc. For example: A wheel has a geometric shape, but its image may be complicated by shadows falling on the wheel, the sun glaring off the metal parts of the wheel, the fender of the car or an object in the foreground obscuring part of the wheel, and so on. We can try to manually describe how a wheel should look like and how it can be represented. Say, it should be circular, be black in color, have treads, etc. But these are all hand-crafted features and may not generalize to all situations. For example, if you look at the wheel from a different angle, it might be oval in shape. Or the lighting may cause it to have lighter and darker patches. These kinds of variations are hard to account for manually. Instead, we can let the representation learning neural network learn them from data by giving it several positive and negative examples of a wheel and training it end to end.

2 1.2 ResearchObjectives The major objective of this project is to classify an image using its facial expression. (1) Image Classification We are presenting a method for the classification of facial expression from the analysis of facial deformations. The classification process is based on Convolutional Neural Networks which classifies an image as “Happy” or “Sad”. Our Neural Network model extracts an expression skeleton of facial features. We also demonstrate the efficiency of our classifier. Our classifier was compared with PCA and LDA classifiers working on the same data. 2. Data Description and Exploration The data set used in this project is Challenges in Representation Learning: Facial Expression Recognition Challenge, which contains 48x48 pixel grayscale images of faces. The faces have been automatically registered so that the face is more or less centered and occupies about the same amount of space in each image. The task is to categorize each face based on the emotion shown in the facial expression in to one of two categories (3=Happy, 4=Sad) 2.1 About the Dataset The training set consists of 15,066 examples (Happy:8989, Sad:6077) and two columns, "emotion" and "pixels”. The "emotion" column contains a numeric code i.e. 3 & 4, inclusive, for the emotion that is present in the image. The "pixels" column contains a string surrounded in quotes for each image. The contents of this string a space-separated pixel values in row major order. Similarly test set used for the leaderboard consists of 3,589 examples and contains only the "pixels" column and our task is to predict the emotion (Happy or Sad) There were no missing values in our data set, it was a clean dataset. Value Counts of our data points: Our data set is quite balanced.

3 Screenshotofdata 2.2 Data Exploration As we had pixel information in the pixel column, our first goal is to split the pixel column into multiple fields, so that we get a rough idea how a final 48*48 picture looks like.

4 Let’s see how the emotions looks like Happy Face Sad Face 2.3 Data Preprocessing Standardization: Standardization is a good practice for many machine learning algorithms. Although our data is on the same scale i.e. values (1 to 255) we still preferred to do Standardize our data. 3. Dimensionality Reduction 3.1 Curse of Dimensionality This term has often been thrown about, especially when PCA, LDA is thrown into the mix. This phrase refers to how our perfectly good and reliable Machine Learning methods may suddenly perform badly when we are dealing in a very high-dimensional space. But what exactly do all these two acronyms do? They are essentially transformation methods used for dimensionality reduction. Therefore, if we are able to project our data from a higher-dimensional space to a lower one while keeping most of the relevant information, that would make life a lot easier for our learning methods. In our data, there are 48 X 48 pixel images of data contributing to 2307 columns.Modeling in such high dimensional space our model could perform badly so it’s perfect time to introduce Dimensionality Reduction methods.

5 3.2 PrincipalComponentAnalysis In a nutshell, PCA is a linear transformation algorithm that seeks to project the original features of our data onto a smaller set of features (or subspace) while still retaining most of the information. To do this the algorithm tries to find the most appropriate directions/angles (which are the principal components) that maximize the variance in the new subspace. We know that principal components are orthogonal to each other. As such when generating the covariance matrix in our new subspace, the off-diagonal values of the covariance matrix will be zero and only the diagonals (or eigenvalues) will be non-zero. It is these diagonal values that represent the variances of the principal components i.e. the information about the variability of our features. This is how our final preprocessed data looks like: The method follows: 1. Standardize the data (already done) 2. Calculating Eigen Vectors and Eigen Values of Covariance matrix 3. Create a list of (Eigen Value, Eigen Vector) tuples

6 4. Sort the Eigen Value, Eigen Vector pair from high to low 5. Calculate the explained variance from Eigen Values Takeaway from the plot: There are two plots above, a smaller one embedded within the larger plot. The smaller plot (Green and Red) shows the distribution of the Individual and Explained variances across all features while the larger plot (Golden and black) portrays a zoomed section of the explained variances only. As we can see, out of our 2304 features or columns approximately 90% of the Explained Variance can be described by using just over 107 features. So, if we wanted to implement a PCA on this, extracting the top 107 features would be a very logical choice as they already account for the majority of the data Visualizing the Eigen Value: As alluded to above, since the PCA method seeks to obtain the optimal directions (or eigenvectors) that captures the most variance (spreads out the data points the most). Therefore,

7 it may be informative to visualize these directions and their associated eigenvalues. For the purposes of this notebook and for speed, I will invoke PCA to only extract the top 28. Of interest is when one compares the first component "Eigenvalue 1" to the 28th component "Eigenvalue 28", it is obvious that more complicated directions or components are being generated in the search to maximize variance in the new feature subspace. Interactive visualizations of PCA representation When it comes to these dimensionality reduction methods, scatter plots are most commonly implemented because they allow for great and convenient visualizations of clustering (if any existed) and this will be exactly what we will be doing as we plot the first 2 principal components as follows. We observed that there are no observable clusters for first two Principal Components.

8 Improvements: Looking at the reconstruction of the original image vs the image generated after PCA, it appears that reconstructed images are not very similar to the original ones so as to discern them categorically. Facial expressions can be subtle and lot more information will be needed to detect them. Sometimes, even naked eyes fail to understand the reconstructed images' emotions. Hence, 90% is not enough information. Let's move to 95% variance (259 components)

9 But as we know PCA is meant to be an unsupervised method and therefore not optimized for separating different class labels. Classifying more accurately is what we try to accomplish by the very next method i.e. LDA. 3.3 Linear DiscriminantAnalysis LDA, much like PCA is also a linear transformation method commonly used in dimensionality reduction tasks. However, unlike the latter which is an unsupervised learning algorithm, LDA falls into the class of supervised learning methods. As such the goal of LDA is that with available information about class labels, LDA will seek to maximize the separation between the different classes by computing the component axes (linear discriminants) which does this. LDA Implementation from Scratch The objective of LDA is to preserve the class separation information whilst still reducing the dimensions of the dataset. As such implementing the method from scratch can roughly be split into 4 distinct stages as below. A. Projected Means Since this method was designed to take into account class labels we therefore first need to establish a suitable metric with which to measure the 'distance' or separation between different

10 classes. Let's assume that we have a set of data points x that belong to one particular class w. Therefore, in LDA the first step is to the project these points onto a new line, Y that contains the class-specific information via the transformation $$Y = omega^intercal x $$ With this the idea is to find some method that maximizes the separation of these new projected variables. To do so, we first calculate the projected mean. B. Scatter Matrices and their solutions: Having introduced our projected means, we now need to find a function that can represent the difference between the means and then maximize it. Like in linear regression, where the most basic case is to find the line of best fit we need to find the equivalent of the variance in this context. And hence this is where we introduce scatter matrices where the scatter is the equivalent of the variance. $$ tilde{S}^{2} = (y - tilde{mu})^{2}$$ C. Selecting Optimal Projection Matrices D. Transforming features onto new subspace LDA Implementation via Sklearn: We used Sklearn inbuilt LDA function and hence we invoke an LDA model as follows: The syntax for the LDA implementation is very much like PCA whereby one calls the fit and transform methods which fits the LDA model with the data and then does a transformation by applying the LDA dimensionality reduction to it. However, since LDA is a supervised learning algorithm, there is a second argument to the method that the user must provide and this would be the class labels, which in this case is the target labels of the digits.

11 Interactive visualizations of LDA representation: From the scatter plot above, we can see that the data points are more clearly clustered when using LDA with as compared to implementing PCA with class labels. This is an inherent advantage in having class labels to supervise the method with. 4. Modeling 4.1 SupportVectorMachine SVM can be considered as an extension of the perceptron. Using the perceptron algorithm, we can minimize misclassification errors. However, in SVMs, our optimization objective is to maximize the margin between the classes. The margin is defined as the distance between the separating hyperplane (decision boundary) and the training samples (support vectors) that are closest to this hyperplane.

12 Input X: Components from PCA i.e. 107 Running a SVM classifier with default parameters on it we get the accuracy of 62% Input X: Components from PCA i.e. 259 components. Running a SVM classifier with default parameters on it we get the accuracy of 65%

14 Input X: Output from LDA, i.e. LD 1 Running a SVM classifier with default parameters we get accuracy of 66.4% Misclassification rate is 33.6%, we will try to fit a neural network model so that our model classifies with more accuracy. 4.2 NeuralNetworks A computational model that works in a similar way to the neurons in the human brain. Each neuron takes an input, performs some operations then passes the output to the following neuron. As we are done pre-processing and splitting our dataset we can start implementing our neural network We have designed a simple neural network with one hidden layer i.e. Vanilla NN with 50 nodes and the Hyperbolic Tangent Activation Function

15 We have used a simple neural network with one hidden layers having 50 nodes. The learning rate used is also quite low in order to find the optimum solution. A mix of gradient descent and momentum method is used. Tangent hyperbolic function is applied in the hidden layer, and a cross entropy loss function is used from the softmax output. An accuracy of 65.8% was achieved.

16 The maximum accuracy is achieved rather quickly in this method using gradient descent and momentum. 4.3 Conclusion As our model is misclassifying 33 times out of 100. We tried to look at the initial image, what features it is not able to predict right. Pictures like following is what our model is not able to predict right. Maybe because of the hair or the eyes or maybe because of the lightning. As the image set is very discrete there may be some error there. Because of the time constraint we were not able to run CNN (Convolutional Neural Network) on the dataset. But that would be our next step.

17 Many pictures in our data had watermarks just like this one, which were misclassified. Majority of our training data doesn’t have watermarks, that is also the reason it is not able to classify to the maximum capacity.

18 5. Scope for Improvement 5.1 CNN (ConvolutionalNeural Network)And ParameterTuning We were not able to tune the parameters of our neural network model because of time crunch and it took a lot of time in training this huge dataset. So, going forward not for the grades but for our self-learning we will be focusing on Tenserflow and CNN. Traditional neural networks that are very good at doing image classification have many more parameters and take a lot of time if trained on CPU. They are faster and are applied heavily in image and video recognition, recommender systems and natural language processing. CNNs share weights in convolutional layers, which means that the same filter weights bank is used for each receptive field in the layer; this reduces memory footprint and improves performance.

19 Attachments – Python Notebooks and Code 1.Python Project_Image Classification.ipynb Initial Data exploration and Preprocessing. Dimensionality Reduction by PCA, LDA 2. Python Project_Image Classification2.ipynb SVM Implementation on top of PCA and LDA (Comparison) 3. Vanilla Neural Network.ipynb Neural Network Implementation

Facial Expression Recognition via Python

More Related Content

What's hot

Similar to Facial Expression Recognition via Python

Recently uploaded

In this document

Facial Expression Recognition via Python