View on GitHub

data-science

Notebooks and Python about data science

If you like this project please add your Star

Linear classification, multi class based on a 2D (2 features) problem

Following the binary (2 class) classification (HTML / Jupyter), let's generalize to more classes.

Two models generated datasets are used : based on the Czech Republic and Norway flags. Both have three colors corresponding to the three classes we want to separate. The Czech flag is linearly separable, the Norway's one is not.

Learning goals :

  • Multiclass classification with softmax activation
  • Multilayer neural network to handle not linear functions
  • Use regularization to improve score and convergence
In [1]:
from tensorflow import keras import numpy as np import matplotlib.pyplot as plt import matplotlib.colors as pltcolors from sklearn import metrics as skMetrics import pandas import seaborn as sns 
In [2]:
usingTensorBoard = False 

Helpers

In [3]:
# Helper to plot the flag as a 2D parametric label def flagPlot(x0, x1, y, title, colors): ax = plt.gca() ax.set_facecolor((0.8, 0.8, 0.8)) plt.scatter(x0, x1, c=y, cmap=pltcolors.ListedColormap(colors), marker='x', alpha=0.5); plt.xlabel('x0') plt.ylabel('x1') plt.grid() plt.title(title) cb = plt.colorbar() loc = np.arange(0,2.1,1) cb.set_ticks(loc) cb.set_ticklabels([0,1,2]); def plotHeatMap(X, classes, title=None, fmt='.2g', ax=None, xlabel=None, ylabel=None): """ Fix heatmap plot from Seaborn with pyplot 3.1.0, 3.1.1  https://stackoverflow.com/questions/56942670/matplotlib-seaborn-first-and-last-row-cut-in-half-of-heatmap-plot  """ ax = sns.heatmap(X, xticklabels=classes, yticklabels=classes, annot=True, fmt=fmt, cmap=plt.cm.Blues, ax=ax) #notation: "annot" not "annote" bottom, top = ax.get_ylim() ax.set_ylim(bottom + 0.5, top - 0.5) if title: ax.set_title(title) if xlabel: ax.set_xlabel(xlabel) if ylabel: ax.set_ylabel(ylabel) def plotConfusionMatrix(yTrue, yEst, classes, title=None, fmt='.2g', ax=None): plotHeatMap(skMetrics.confusion_matrix(yTrue, yEst), classes, title, fmt, ax, xlabel='Estimations', ylabel='True values'); 

Czech flag split of the 2D plan

Let's use the Czech flag as a pattern to fit with a 3 class linear regression. Czech flag is made of three areas which are separated by segments. It should be separable with linear boundaries.

Data model for the Czech flag

In [4]:
# Czech flag colors czechColors = np.array(['blue', 'red', 'white']) nFeatures = 2 nClasses = 3 # Generate a multi class sample based on the Czech banner geometry. Tip of the triangle if at the origin def generateCzechBatch(n, noiseSigma=0.1): # xMax adjusted such that the 3 classes are quasi equiprobable xMin = np.array([-1, -1]) xMax = np.array([0.5, 1]) # X = np.random.uniform(xMin, xMax, (n, nFeatures)) noisyX = X + np.random.normal(0, noiseSigma, X.shape) y = np.zeros(n) y[noisyX[:,1] > 0] = 2 # White y[noisyX[:,1] <= 0] = 1 # Red y[(noisyX[:,0] <= 0) & (np.abs(noisyX[:,1]) < np.abs(noisyX[:,0]))] = 0 # Blue triangle return X, y 
In [5]:
N = 1000 xTrain, yTrain = generateCzechBatch(N) plt.figure(figsize=(15,4)) plt.subplot(1,2,1) flagPlot(xTrain[:,0], xTrain[:,1], yTrain, 'Generated', czechColors) plt.subplot(1,2,2) plt.hist(yTrain, 10, density=True, alpha=0.5) plt.title('Generated histogram'); 

Test data

In [6]:
xTest, yTest = generateCzechBatch(N) 

Softmax regression

Maximum a posteriori with softmax is the generalization of the logistic regression in case number of classes is greater than 2.

The softmax will normalize the set of output values to a sum of 1, outputs are then probabilities.

With K the number of classes: \begin{align} y \in \left\{ 1 ... K \right\} \end{align}

The softmax probability law of Y is: \begin{align} p (Y=c \mid x, w) & = \frac{e^{w_{c} x }}{\sum_{j=1}^{K}e^{w_{j} x}} \end{align}

We are looking for the value of w that maximizes the likelyhood: \begin{align} \hat{w} & = \arg \max_{w}{\prod_{i=0}^N{p(y_i \mid x_i, w)}} \\ & = \arg \max_{w}{\sum_{i=0}^N{\log \bigl(p(y_i \mid x_i, w)\bigr)} } \\ & = \arg \max_{w}{\sum_{c=1}^K\sum_{i=0}^N{\mathbb{1}_{y_i=c} * \log \biggl(\frac{e^{w_{c} x }}{\sum_{j=1}^{K}e^{w_{j} x}} \biggr)} } \\ \end{align}

Where $\mathbb{1}_{y_i=c}$ is the indicator function which takes value 1 if $y_i=c$ and 0 otherwise

When using an iterative gradient descent, the cost function is then defined as: \begin{align} J_w = - \sum_{c=1}^K\sum_{i=0}^N{\mathbb{1}_{y_i=c} * \log \biggl(\frac{e^{w_{c} x }}{\sum_{j=1}^{K}e^{w_{j} x}}\biggr)} \end{align}

Reference:

Using Keras with TensorFlow backend

Gradient descent with softmax output using Keras

Keras is used as a single layer of perceptron with softmax activation.

References:

The simplest model of linear regression is based on a single layers with as many neurons as the number of classes (=3). The activation of the layer is a softmax taking reference values (y/label train) in range [0,3)

In [7]:
# Number of epochs nEpoch = 1024 nBatch = 128 # 32 is default # Model model0 = keras.models.Sequential([ keras.layers.Dense(nClasses, activation=keras.activations.softmax, kernel_regularizer=keras.regularizers.l1(0.001), input_shape=[nFeatures]) ]) model0.compile(optimizer='adam', loss=keras.losses.sparse_categorical_crossentropy, metrics=['accuracy']) # Tensor board display callbacks = [] if usingTensorBoard: ks = keras.callbacks.TensorBoard(log_dir="./logs/", histogram_freq=1, write_graph=True, write_grads=True, batch_size=1) callbacks = [ks] # Fit hist0 = model0.fit(xTrain, yTrain, epochs=nEpoch, batch_size=nBatch, verbose=0, callbacks=callbacks) 
In [8]:
weights0, biases0 = model0.get_weights() print('Est W=', weights0) print('b=', biases0) 
Est W= [[-2.8781772 3.9222672 3.9757454 ] [ 0.27168787 -5.429385 6.0552115 ]] b= [-0.4286471 0.38129994 0.06073063] 
In [9]:
# Parametric plot of the linear model t = np.linspace(-1, 1) markers = ('.', 'v', '4') for i in range(0,3): plt.scatter(weights0[0, i] * t + biases0[i], weights0[1, i]*t, marker=markers[i], c=t, cmap="RdBu_r") plt.legend(['Blue (0)', 'Red (1)', 'White (2)']); plt.title('Parametric plot of the linear model, red = high, blue = low'); 
In [10]:
plt.figure(figsize=(15,4)) plt.subplot(1,3,1) plt.plot(hist0.history['loss']) plt.grid() plt.title('Loss') plt.subplot(1,3,2) plt.plot(hist0.history['accuracy']) plt.grid() plt.title('Accuracy'); 

Testing the model

In [11]:
yEst0 = model0.predict(xTest) labelEst0 = np.argmax(yEst0, axis=1) 
In [12]:
plt.figure(figsize=(16,4)) plt.subplot(1,3,1) flagPlot(xTest[:,0], xTest[:,1], labelEst0, 'Estimated', czechColors) plt.subplot(1,3,2) plt.hist(labelEst0, 10, density=True, alpha=0.5) plt.title('Estimated histo') plt.subplot(1,3,3) flagPlot(xTest[:,0], xTest[:,1], yTest, 'Generated', czechColors) 
In [13]:
plotConfusionMatrix(yTest, labelEst0, czechColors); 
In [14]:
print(skMetrics.classification_report(yTest, labelEst0)) 
 precision recall f1-score support 0.0 0.89 0.91 0.90 322 1.0 0.92 0.94 0.93 339 2.0 0.96 0.92 0.94 339 accuracy 0.92 1000 macro avg 0.92 0.92 0.92 1000 weighted avg 0.92 0.92 0.92 1000 

Conclusions on initial model

Fitting is excellent, the separation of the classes is very accurate

Verification is done that the Czech flag is linearly separable.

Norway flag regression

Norway's flag is more challenging than the Czech one as it contains a double white-blue cross.

Evidently it is not linearly separable. Actualy, this class of problems, the simplest being the XOR truth table, is one motivation to neural networks.

Data model for the Norway flag

Geometry is modified to get quasi equiprobable categories

In [15]:
# Norway flag colors norwayColors = ['red', 'white', 'navy'] # Generate a multi class sample based on the Norway banner geometry.  def generateNorwayBatch(n, noiseSigma=0.1): # xMax adjusted such that the 3 classes are quasi equiprobable xMin = np.array([-1, -1]) xMax = np.array([1, 1]) # X = np.random.uniform(xMin, xMax, (n, nFeatures)) a1 = 0.43 a2 = 0.18 noisyX = X + np.random.normal(0, noiseSigma, X.shape) y = np.zeros(n) # Red = background y[((noisyX[:,0] > -a1) & (noisyX[:,0] < a1)) | ((noisyX[:,1] > -a1) & (noisyX[:,1] < a1))] = 1 # White cross y[((noisyX[:,0] > -a2) & (noisyX[:,0] < a2)) | ((noisyX[:,1] > -a2) & (noisyX[:,1] < a2))] = 2 # Navy cross over white return X, y 
In [16]:
N = 1000 xTrain1, yTrain1 = generateNorwayBatch(N) plt.figure(figsize=(15,4)) plt.subplot(1,2,1) flagPlot(xTrain1[:,0], xTrain1[:,1], yTrain1, 'Generated', norwayColors) plt.subplot(1,2,2) plt.hist(yTrain1, 10, density=True, alpha=0.5) plt.title('Generated histogram'); 

Test data

In [17]:
xTest1, yTest1 = generateNorwayBatch(N) 

Single layer model

In [18]:
nEpoch = 128 nBatch = 128 # Model model1 = keras.models.Sequential([ keras.layers.Dense(nClasses, activation=keras.activations.softmax, input_shape=[nFeatures], kernel_regularizer=keras.regularizers.l1(0.001)) ]) model1.compile(optimizer='adam', loss=keras.losses.sparse_categorical_crossentropy, metrics=['accuracy']) # Fit single layer model on Norway data hist1 = model1.fit(xTrain1, yTrain1, epochs=nEpoch, batch_size=nBatch, verbose=0, callbacks=callbacks) 
In [19]:
weights1, biases1 = model1.get_weights() print('Est W=', weights1) print('b=', biases1) 
Est W= [[-0.38265452 -0.40274367 -0.5217999 ] [ 0.08239781 -0.30017456 0.03173269]] b= [-0.025961 0.01384929 0.00867589] 
In [20]:
plt.figure(figsize=(15,4)) plt.subplot(1,3,1) plt.plot(hist1.history['loss']) plt.grid() plt.title('Loss') plt.subplot(1,3,2) plt.plot(hist1.history['accuracy']) plt.grid() plt.title('Accuracy'); 

Testing the model

In [21]:
yEst1 = model1.predict(xTest1) labelEst1 = np.argmax(yEst1, axis=1) 
In [22]:
plt.figure(figsize=(16,4)) plt.subplot(1,3,1) flagPlot(xTest1[:,0], xTest1[:,1], labelEst1, 'Estimated', norwayColors) plt.subplot(1,3,2) plt.hist(labelEst1, 10, density=True, alpha=0.5) plt.title('Estimated histo') plt.subplot(1,3,3) flagPlot(xTest1[:,0], xTest1[:,1], yTest1, 'Generated', norwayColors) 

As expected, the linear model is not able to separate the classes.

Two layer model on Norway

In [23]:
nEpoch = 512 nBatch = 128 # Model model2 = keras.models.Sequential([ keras.layers.Dense(16, activation=keras.activations.relu, input_shape=[nFeatures], kernel_regularizer=keras.regularizers.l1(0.001), bias_regularizer=keras.regularizers.l1(0.001)), keras.layers.Dense(nClasses, activation=keras.activations.softmax, kernel_regularizer=keras.regularizers.l1(0.0001)) ]) model2.compile(optimizer='adam', loss=keras.losses.sparse_categorical_crossentropy, metrics=['accuracy']) # Fit two layer model on Norway data hist2 = model2.fit(xTrain1, yTrain1, epochs=nEpoch, batch_size=nBatch, verbose=0, callbacks=callbacks) 
In [24]:
plt.figure(figsize=(15,4)) plt.subplot(1,3,1) plt.plot(hist2.history['loss']) plt.grid() plt.title('Loss') plt.subplot(1,3,2) plt.plot(hist2.history['accuracy']) plt.grid() plt.title('Accuracy'); 

Testing the model

In [25]:
yEst2 = model2.predict(xTest1) labelEst2 = np.argmax(yEst2, axis=1) 
In [26]:
plt.figure(figsize=(16,4)) plt.subplot(1,3,1) flagPlot(xTest1[:,0], xTest1[:,1], labelEst2, 'Estimated', norwayColors) plt.subplot(1,3,2) plt.hist(labelEst2, 10, density=True, alpha=0.5) plt.title('Estimated histo') plt.subplot(1,3,3) flagPlot(xTest1[:,0], xTest1[:,1], yTest1, 'Generated', norwayColors) 
In [27]:
plotConfusionMatrix(yTest1, labelEst2, norwayColors); 

The two layer neural network is improving compared to the single layer (linear) model.

Accuracy is greatly improved ranging from ~60% to 95% depending on the fitting convergence

In [28]:
print(skMetrics.classification_report(yTest1, labelEst2)) 
 precision recall f1-score support 0.0 0.79 0.83 0.81 343 1.0 0.58 0.51 0.54 336 2.0 0.72 0.78 0.75 321 accuracy 0.70 1000 macro avg 0.70 0.71 0.70 1000 weighted avg 0.70 0.70 0.70 1000 

Three layer neural network on Norway

In [29]:
nEpoch = 512 nBatch = 128 # Model model3 = keras.models.Sequential([ keras.layers.Dense(8, activation=keras.activations.linear, input_shape=[nFeatures], kernel_regularizer=keras.regularizers.l1(0.001), bias_regularizer=keras.regularizers.l1(0.001)), keras.layers.Activation(keras.activations.relu), keras.layers.Dense(8, activation=keras.activations.linear, kernel_regularizer=keras.regularizers.l1(0.001), bias_regularizer=keras.regularizers.l1(0.001)), keras.layers.Activation(keras.activations.relu), keras.layers.Dense(nClasses, activation=keras.activations.softmax) ]) model3.compile(optimizer='adam', loss=keras.losses.sparse_categorical_crossentropy, metrics=['accuracy']) # Fit on Norway data hist3 = model3.fit(xTrain1, yTrain1, epochs=nEpoch, batch_size=nBatch, verbose=0, callbacks=callbacks) 
In [30]:
plt.figure(figsize=(15,4)) plt.subplot(1,3,1) plt.plot(hist3.history['loss']) plt.grid() plt.title('Loss') plt.subplot(1,3,2) plt.plot(hist3.history['accuracy']) plt.grid() plt.title('Accuracy'); 

Testing the model

In [31]:
yEst3 = model3.predict(xTest1) labelEst3 = np.argmax(yEst3, axis=1) 
In [32]:
plt.figure(figsize=(16,4)) plt.subplot(1,3,1) flagPlot(xTest1[:,0], xTest1[:,1], labelEst3, 'Estimated', norwayColors) plt.subplot(1,3,2) plt.hist(labelEst3, 10, density=True, alpha=0.5) plt.title('Estimated histo') plt.subplot(1,3,3) flagPlot(xTest1[:,0], xTest1[:,1], yTest1, 'Generated', norwayColors) 
In [33]:
plotConfusionMatrix(yTest1, labelEst3, norwayColors); 
In [34]:
print(skMetrics.classification_report(yTest1, labelEst3)) 
 precision recall f1-score support 0.0 0.85 0.85 0.85 343 1.0 0.68 0.61 0.65 336 2.0 0.78 0.86 0.82 321 accuracy 0.77 1000 macro avg 0.77 0.77 0.77 1000 weighted avg 0.77 0.77 0.77 1000 

The three layer neural network is improving the accuracy and F1-score

A large added value of the (deep) neural net is shown... at the expense of a much more complex model:

In [35]:
weights3 = model3.get_weights() model3.summary() 
Model: "sequential_3" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_4 (Dense) (None, 8) 24 _________________________________________________________________ activation (Activation) (None, 8) 0 _________________________________________________________________ dense_5 (Dense) (None, 8) 72 _________________________________________________________________ activation_1 (Activation) (None, 8) 0 _________________________________________________________________ dense_6 (Dense) (None, 3) 27 ================================================================= Total params: 123 Trainable params: 123 Non-trainable params: 0 _________________________________________________________________ 
In [36]:
t = np.linspace(-1, 1) width3_1 = len(weights3[1]) for i in range(0,width3_1): plt.scatter(weights3[0][0, i] * t + weights3[1][i], weights3[0][1, i] * t, marker='.', c=t, cmap="RdBu_r") plt.title('First stage weights parametric view ({:d} components)'.format(width3_1)); 

The direction graph above is showing that the two main directions (along axes) are identified by the learnt features of the neural network using pairs of neurons (one for each direction). Some other neurons are used within the top left quadrant.

In [37]:
weights3[0] 
Out[37]:
array([[-1.6474309e+00, 4.1438375e-02, -1.1377184e-01, 1.5975128e+00, 4.5577183e-01, -1.3892711e+00, 5.1281884e-02, 1.7773986e+00], [ 3.3753177e-03, 1.7328085e+00, -1.8991382e+00, 5.9235338e-02, 2.3455155e+00, 8.3160086e-04, -5.8959264e-01, 1.4065911e-01]], dtype=float32)

Where to go from here

  • Same multi-class classification problem solved with Decision Trees (HTML / Jupyter)