View on GitHub

data-science

Notebooks and Python about data science

If you like this project please add your Star

Local approximation of an univariate quadratic expression using linear regression

Learning goals:

  • Mathematical statement of the linear regression using the simples data: univariate continuous function (single feature)
  • Simple implementation using this closed form solution
  • Definition of the usual metrics
  • Move on to use existing libraries from Scipy or SKLearn to perform linear fitting and to perform an analysis of the model
  • Increase the capacity of the model with powers of the features (polynomial)
  • Understand iterative methods based on gradient descent, including the stochastic gradient descent
In [1]:
import math import numpy as np from numpy import random import matplotlib.pyplot as plt from sklearn import metrics as skMetrics from sklearn.linear_model import LinearRegression as skLinearRegression import scipy as sy import sympy import pandas 

Generator Model

In [2]:
def generateBatch(N, stochastic = False): # xMin = 0 xMax = 0.5 # b = 0.35 std = 0.01 # if stochastic: x = random.uniform(xMin, xMax, N) else: x = np.linspace(xMin, xMax, N) yClean = x**4 + (x-0.3)**3 + b y = yClean + random.normal(0, std, N) return (x, y, yClean) 

Train data

In [3]:
xs = sympy.Symbol('x') sympy.expand(xs**4 + (xs-0.3)**3 + 0.35) 
Out[3]:
$\displaystyle x^{4} + x^{3} - 0.9 x^{2} + 0.27 x + 0.323$
In [4]:
N = 100000 xTrain, yTrain, yClean = generateBatch(N) plt.plot(xTrain, yTrain, xTrain, yClean); plt.title('Training data'); 

Test data

In [5]:
xTest, yTest, yTestClean = generateBatch(N) 

Closed form / analiticaly

Linear, 1st degree approximation of y: \begin{align} y = x w + b \end{align}

Given $N_{feature} = 1$:

  • $x$ is a $N_{sample}$ vector
  • $w$ is a scalar
  • $y$ is a $N_{sample}$ vector
  • $b$ is a scalar

Using mean square error (Euclidian norm), we are looking for $\hat{\theta} = \{\hat{b}, \hat{w}\}$ such that:

\begin{align} \hat{\theta} \in {min}_{\theta} \lVert x w + b - y \rVert_2^2 \end{align}\begin{align} g(\theta) & = \lVert x w + b - y \rVert_2^2 \\ & = \sum_{i=0}^n \left(x_i^2 w^2 + y_i^2 + b^2 + 2x_iwb - 2x_iwy_i - 2y_ib \right)\\ \end{align}

Lookup the minimum through the partial derivates:

\begin{cases} \frac{\partial{g(\theta)}}{\partial b} = \sum_{i=0}^n 2(x_iw + b - y_i) \\ \frac{\partial{g(\theta)}}{\partial w} = \sum_{i=0}^n 2(x_iw + b - y_i)x_i \end{cases}

Let's define for any variable z: \begin{align} \overline{z_n} & = \frac 1n \sum_{i=0}^n z_i \end{align}

Then: \begin{align} & \begin{cases} \frac{\partial{g(\theta)}}{\partial b} = 0 \\ \frac{\partial{g(\theta)}}{\partial w} = 0 \end{cases} \\ \iff & \begin{cases} \overline{x_n} w + b = \overline{y_n} & Eq1\\ \overline{x_n^2 }w + \overline{x_n} b = \overline{x_n y_n} & Eq2\\ \end{cases} \end{align}

$Eq2 - Eq1 \overline{x_n}$ :

\begin{align} & w \left( \overline{x_n^2} - \overline{x_n}^2 \right) = \overline{x_n y_n} - \overline{x_n}.\overline{y_n}\\ \end{align}

Leading to:

\begin{align} w &= \frac{\overline{x_n y_n} - \overline{x_n}.\overline{y_n}}{\overline{x_n^2} - \overline{x_n}^2}\\ &= \frac{\sum_{i=0}^n \left(x_i - \overline{x_n} \right) \left(y_i - \overline{y_n} \right)}{\sum_{i=0}^n \left(x_i - \overline{x_n} \right) } \\ b &= \sum_{i=0}^n y_i - x_i w = \overline{y} - \overline{x} w \\ \end{align}

Remove biases

In [6]:
xUnB = xTrain - np.mean(xTrain) yUnB = yTrain - np.mean(yTrain) 
In [7]:
# In case x is univariate, the matrix product x^T x is a scalar wEst = np.dot(xUnB, yUnB) / np.dot(xUnB, xUnB) bEst = np.mean(yTrain - wEst * xTrain) print('Linear regression estimate: y = {:.3f} x + {:.3f}'.format(wEst, bEst)) 
Linear regression estimate: y = 0.145 x + 0.323 

Test model

In [8]:
yEst1 = wEst * xTest + bEst 
In [9]:
plt.plot(xTest, yTest, xTest, yEst1); res = yTest - yEst1 mse1 = np.dot(res, res) / N print('Closed form, MSE = {:.3e}'.format(mse1)); 
Closed form, MSE = 1.502e-04 
In [10]:
fit2 = np.polyfit(xTrain, yTrain, 1) fit2 
Out[10]:
array([0.14488612, 0.32303991])

Test model

In [11]:
yEst2 = fit2[0] * xTest + fit2[1] 
In [12]:
plt.plot(xTest, yTest, xTest, yEst2); mse2 = skMetrics.mean_squared_error(yTest, yEst2) print('Numpy polyfit 1st degree, MSE =', "{:.3e}".format(mse2)); 
Numpy polyfit 1st degree, MSE = 1.502e-04 

Numpy polyfit, 4th degree

In [13]:
fit3 = np.polyfit(xTrain, yTrain, 4) fit3 
Out[13]:
array([ 1.03565332, 0.96438978, -0.88586569, 0.26726443, 0.3231845 ])

Test model

In [14]:
yEst3 = xTest**4 * fit3[0] + xTest**3 * fit3[1] + xTest**2 * fit3[2] + xTest * fit3[3] + fit3[4] 
In [15]:
plt.plot(xTest, yTestClean, xTest, yEst3); mse3 = skMetrics.mean_squared_error(yTest, yEst3) plt.legend(['ori','fitted']) print('Numpy polyfit 4th degre, MSE =', "{:.3e}".format(mse3)); 
Numpy polyfit 4th degre, MSE = 1.003e-04 
In [16]:
fit4, residues, rank, s = np.linalg.lstsq(np.reshape(xUnB, (N,1)), yUnB) fit4 
//anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions. To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`. """Entry point for launching an IPython kernel. 
Out[16]:
array([0.14488612])

Test model

In [17]:
yEst4 = fit4 * xTest + bEst 
In [18]:
plt.plot(xTest, yTest, xTest, yEst4); mse4 = skMetrics.mean_squared_error(yTest, yEst4) print('Numpy Least square, MSE =', "{:.3e}".format(mse4)); 
Numpy Least square, MSE = 1.502e-04 
In [19]:
fit5 = sy.stats.linregress(xTrain, yTrain) fit5 
Out[19]:
LinregressResult(slope=0.14488612244983226, intercept=0.3230399081789981, rvalue=0.8633622269474659, pvalue=0.0, stderr=0.0002677762899562608)

Test model

In [20]:
yEst5 = fit5.slope * xTest + fit5.intercept 
In [21]:
plt.plot(xTest, yTest, xTest, yEst5) mse5 = skMetrics.mean_squared_error(yTest, yEst5) print('Scipy, MSE =', "{:.3e}".format(mse5)); 
Scipy, MSE = 1.502e-04 

SK Learn linear regression

In [22]:
model6 = skLinearRegression(normalize=False) model6.fit(xTrain.reshape(-1,1), yTrain.reshape(-1,1)) print('SciKit Learn linear regression, b =', model6.intercept_[0], ', w =', model6.coef_[0][0]) 
SciKit Learn linear regression, b = 0.323039908178998 , w = 0.1448861224498326 
In [23]:
print('SciKit Learn, R^2-score =', model6.score(xTest.reshape(-1,1), yTest)) 
SciKit Learn, R^2-score = 0.7448545267081639 

Test model

In [24]:
yEst6 = model6.predict(xTest.reshape(-1,1)) 
In [25]:
plt.plot(xTest, yTest, xTest, yEst6) mse6 = skMetrics.mean_squared_error(yTest, yEst6) print('SciKit Learn, MSE =', "{:.3e}".format(mse6)); 
SciKit Learn, MSE = 1.502e-04 

Gradient descent

Let's first plot MSE for several values of the slope in order to verify the convex shape of the cost function

In [26]:
Ns = 100 slope = np.linspace(-1, 1, Ns) sx = np.matmul(np.reshape(xUnB, (N, 1)), np.reshape(slope,(1,Ns))) er_sx_y = sx - np.reshape(yUnB, (N, 1)) mse = np.mean(er_sx_y.T**2, axis = 1) 
In [27]:
plt.plot(slope, mse) plt.xlabel('w') plt.ylabel('MSE') plt.title('Min MSE = %.3e' % np.min(mse)) plt.grid(); 

Convexity of the MSE as function of the linear regression slope (w coefficient) is shown

Gradient descent computation

In [28]:
def calcGradient(X, Y, b, w): A = w * X + b - Y F_db = np.sum(A) F_dw = np.dot(A, X) return (F_db, F_dw, math.sqrt(F_db**2 + F_dw**2)) 
In [29]:
# Initial coef b7, w7 = 1, 1 threshold = 1e-1 learningRate = 1e-6 nIterMax = 1e5 # Init gradient_b, gradient_w, gradientNorm = calcGradient(xTrain, yTrain, b7, w7) print('START: b = %.3e' % b7, ', w = %.3e' % w7, ', Gradient norm = %.3e' % gradientNorm) w7Learn = [np.array([b7, w7, gradientNorm])] nIter = 0 while (gradientNorm > threshold) & (nIter < nIterMax): b7 = b7 - learningRate * gradient_b w7 = w7 - learningRate * gradient_w gradient_b, gradient_w, gradientNorm = calcGradient(xTrain, yTrain, b7, w7) w7Learn.append(np.array([b7, w7, gradientNorm])) nIter += 1 print('END : b = %.3e' % b7, ', w = %.3e' % w7, ', Gradient norm = %.3e' % gradientNorm, ', num iteration =', len(w7Learn)) df7 = pandas.DataFrame(w7Learn, columns = ('b', 'w', 'Gradient norm')) 
START: b = 1.000e+00 , w = 1.000e+00 , Gradient norm = 9.226e+04 END : b = 3.230e-01 , w = 1.449e-01 , Gradient norm = 9.981e-02 , num iteration = 4832 
In [30]:
fig = plt.figure(figsize=(16,12)) plt.subplot(2,2,1) plt.plot(df7['b']) plt.grid() plt.title('b'); plt.subplot(2,2,2) plt.plot(df7['w']) plt.grid() plt.title('w'); plt.subplot(2,2,3) plt.semilogy(df7['Gradient norm']) plt.grid() plt.title('Gradient norm'); 
In [31]:
fig = plt.figure(figsize=(16,8)) plt.subplot(1,2,1) plt.semilogy(df7['b'], df7['Gradient norm']) plt.grid() plt.xlabel('b') plt.ylabel('gradient norm'); plt.title('Trajectory of w and Gradient norm'); plt.subplot(1,2,2) plt.semilogy(df7['w'], df7['Gradient norm']) plt.grid() plt.xlabel('w') plt.ylabel('gradient norm'); plt.title('Trajectory of w and Gradient norm'); 

Test model

In [32]:
yEst7 = w7*xTest + b7 
In [33]:
plt.plot(xTest, yTest, xTest, yEst7); mse7 = skMetrics.mean_squared_error(yTest, yEst7) print('Gradient descent, MSE =', '{:.3e}'.format(mse7)); 
Gradient descent, MSE = 1.502e-04 

Stochastic gradient descent

Using new data batch at each iteration.

Alternatively, on a finite data set: shuffle the samples.

In [34]:
nBatch = 100 b8, w8 = 2, 2 threshold = 1e-3 learningRate = 1e-3 nIterMax = 1e5 # Initial batch xBatch, yBatch, yBC = generateBatch(nBatch, True) gradient_b, gradient_w, gradientNorm = calcGradient(xBatch, yBatch, b8, w8) print('START: b = %.3e' % b8, ', w = %.3e' % w8, ', Gradient norm = %.3e' % gradientNorm) w8Learn = [np.array([b8, w8, gradientNorm])] # Continue nIter = 0 while (gradientNorm > threshold) & (nIter < nIterMax): b8 = b8 - learningRate * gradient_b w8 = w8 - learningRate * gradient_w xBatch, yBatch, yBC = generateBatch(nBatch, True) gradient_b, gradient_w, gradientNorm = calcGradient(xBatch, yBatch, b8, w8) w8Learn.append(np.array([b8, w8, gradientNorm])) learningRate = learningRate * 0.9999 # Decreasing learning rate nIter += 1 print('START: b = %.3e' % b8, ', w = %.3e' % w8, ', Gradient norm = %.3e' % gradientNorm, ', num iteration =', len(w8Learn)) df8 = pandas.DataFrame(w8Learn, columns = ('b', 'w', 'Gradient norm')) 
START: b = 2.000e+00 , w = 2.000e+00 , Gradient norm = 2.230e+02 START: b = 3.219e-01 , w = 1.492e-01 , Gradient norm = 9.648e-04 , num iteration = 3495 
In [35]:
fig = plt.figure(figsize=(16,12)) plt.subplot(2,2,1) plt.plot(df8['b']) plt.grid() plt.title('b'); plt.subplot(2,2,2) plt.plot(df8['w']) plt.grid() plt.title('w'); plt.subplot(2,2,3) plt.semilogy(df8['Gradient norm']) plt.grid() plt.title('Gradient norm'); 

The gradient norm is getting very noisy as the value is below $10^{-1}, a more adaptive learning rate would be needed there and a mean on the gradient norm to improve the stop condition of the gradient descent

In [36]:
fig = plt.figure(figsize=(16,8)) plt.subplot(1,2,1) plt.semilogy(df8['b'], df8['Gradient norm']) plt.grid() plt.xlabel('b') plt.ylabel('gradient norm'); plt.title('Trajectory of w and Gradient norm'); plt.subplot(1,2,2) plt.semilogy(df8['w'], df8['Gradient norm']) plt.grid() plt.xlabel('w') plt.ylabel('gradient norm'); plt.title('Trajectory of w and Gradient norm'); 

Test model

In [37]:
yEst8 = w8*xTest + b8 
In [38]:
plt.scatter(xBatch, yBatch, marker='.', color = 'black'); plt.plot(xTrain, yClean) plt.plot(xTest, yEst8) plt.legend(['Generator model', 'Estimated model', 'Last batch']) mse8 = skMetrics.mean_squared_error(yTest, yEst8) print('Stochastic gradient descent, MSE =', "{:.3e}".format(mse8)); 
Stochastic gradient descent, MSE = 1.506e-04 
In [39]:
from sklearn.linear_model import SGDRegressor as skSGDRegressor model9 = skSGDRegressor(alpha=0.0001, average=False, early_stopping=True, epsilon=0.1, eta0=0.0, fit_intercept=True, learning_rate='optimal', loss='squared_loss', max_iter=1000, n_iter_no_change=5, penalty='l2', power_t=0.5, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) # l1_ratio=0.15, 

Notes:

  • Regularizer is called 'penalty' and parameterized by 'alpha' (and 'l1_ratio')
  • Early stopping is available and parameterized by 'early_stopping', 'max_iter', 'tol' and 'n_iter_no_change'
  • Shuffling between epochs enabled by 'shuffle'
In [40]:
model9.fit(xTrain.reshape(-1, 1), yTrain) print('Y = {0} X + {1}'.format(model9.coef_, model9.intercept_)) 
Y = [0.14393348] X + [0.32146107] 
In [41]:
yEst9 = model9.predict(xTest.reshape(-1,1)) 
In [42]:
plt.plot(xTrain, yClean) plt.plot(xTest, yEst9) plt.legend(['Generator model', 'Estimated model']) mse9 = skMetrics.mean_squared_error(yTest, yEst9) print('Gradient descent with SK Learn, MSE =', "{:.3e}".format(mse9)); 
Gradient descent with SK Learn, MSE = 1.536e-04 

Main take-aways

Linear regression has a closed form leading to the best fit. Many Python libraries provide this linear or polynomial fitting.

We have also learnt gradient descent on this simple case, it will be very useful for coming projects based on neural networks.

Where to go from here ?

Other single feature linear implementation using TensorFlow (Notebook)

More complex bivariate models using "raw" Python (Notebook) up to the gradient descent with regularizer, or using Keras (Notebook)

Compare with the single feature binary classification using logistic regression using "raw" Python or libraries (Notebook)