Local approximation of an univariate quadratic expression using linear regression¶
Learning goals:¶
- Mathematical statement of the linear regression using the simples data: univariate continuous function (single feature)
- Simple implementation using this closed form solution
- Definition of the usual metrics
- Move on to use existing libraries from Scipy or SKLearn to perform linear fitting and to perform an analysis of the model
- Increase the capacity of the model with powers of the features (polynomial)
- Understand iterative methods based on gradient descent, including the stochastic gradient descent
import math import numpy as np from numpy import random import matplotlib.pyplot as plt from sklearn import metrics as skMetrics from sklearn.linear_model import LinearRegression as skLinearRegression import scipy as sy import sympy import pandas Generator Model¶
def generateBatch(N, stochastic = False): # xMin = 0 xMax = 0.5 # b = 0.35 std = 0.01 # if stochastic: x = random.uniform(xMin, xMax, N) else: x = np.linspace(xMin, xMax, N) yClean = x**4 + (x-0.3)**3 + b y = yClean + random.normal(0, std, N) return (x, y, yClean) Train data¶
xs = sympy.Symbol('x') sympy.expand(xs**4 + (xs-0.3)**3 + 0.35) N = 100000 xTrain, yTrain, yClean = generateBatch(N) plt.plot(xTrain, yTrain, xTrain, yClean); plt.title('Training data'); Test data¶
xTest, yTest, yTestClean = generateBatch(N) Closed form / analiticaly¶
Linear, 1st degree approximation of y: \begin{align} y = x w + b \end{align}
Given $N_{feature} = 1$:
- $x$ is a $N_{sample}$ vector
- $w$ is a scalar
- $y$ is a $N_{sample}$ vector
- $b$ is a scalar
Using mean square error (Euclidian norm), we are looking for $\hat{\theta} = \{\hat{b}, \hat{w}\}$ such that:
\begin{align} \hat{\theta} \in {min}_{\theta} \lVert x w + b - y \rVert_2^2 \end{align}\begin{align} g(\theta) & = \lVert x w + b - y \rVert_2^2 \\ & = \sum_{i=0}^n \left(x_i^2 w^2 + y_i^2 + b^2 + 2x_iwb - 2x_iwy_i - 2y_ib \right)\\ \end{align}Lookup the minimum through the partial derivates:
\begin{cases} \frac{\partial{g(\theta)}}{\partial b} = \sum_{i=0}^n 2(x_iw + b - y_i) \\ \frac{\partial{g(\theta)}}{\partial w} = \sum_{i=0}^n 2(x_iw + b - y_i)x_i \end{cases}Let's define for any variable z: \begin{align} \overline{z_n} & = \frac 1n \sum_{i=0}^n z_i \end{align}
Then: \begin{align} & \begin{cases} \frac{\partial{g(\theta)}}{\partial b} = 0 \\ \frac{\partial{g(\theta)}}{\partial w} = 0 \end{cases} \\ \iff & \begin{cases} \overline{x_n} w + b = \overline{y_n} & Eq1\\ \overline{x_n^2 }w + \overline{x_n} b = \overline{x_n y_n} & Eq2\\ \end{cases} \end{align}
$Eq2 - Eq1 \overline{x_n}$ :
\begin{align} & w \left( \overline{x_n^2} - \overline{x_n}^2 \right) = \overline{x_n y_n} - \overline{x_n}.\overline{y_n}\\ \end{align}Leading to:
\begin{align} w &= \frac{\overline{x_n y_n} - \overline{x_n}.\overline{y_n}}{\overline{x_n^2} - \overline{x_n}^2}\\ &= \frac{\sum_{i=0}^n \left(x_i - \overline{x_n} \right) \left(y_i - \overline{y_n} \right)}{\sum_{i=0}^n \left(x_i - \overline{x_n} \right) } \\ b &= \sum_{i=0}^n y_i - x_i w = \overline{y} - \overline{x} w \\ \end{align}Remove biases
xUnB = xTrain - np.mean(xTrain) yUnB = yTrain - np.mean(yTrain) # In case x is univariate, the matrix product x^T x is a scalar wEst = np.dot(xUnB, yUnB) / np.dot(xUnB, xUnB) bEst = np.mean(yTrain - wEst * xTrain) print('Linear regression estimate: y = {:.3f} x + {:.3f}'.format(wEst, bEst)) Test model¶
yEst1 = wEst * xTest + bEst plt.plot(xTest, yTest, xTest, yEst1); res = yTest - yEst1 mse1 = np.dot(res, res) / N print('Closed form, MSE = {:.3e}'.format(mse1)); Numpy polyfit, 1st degree¶
http://www.python-simple.com/python-numpy-scipy/fitting-regression.php
fit2 = np.polyfit(xTrain, yTrain, 1) fit2 Test model¶
yEst2 = fit2[0] * xTest + fit2[1] plt.plot(xTest, yTest, xTest, yEst2); mse2 = skMetrics.mean_squared_error(yTest, yEst2) print('Numpy polyfit 1st degree, MSE =', "{:.3e}".format(mse2)); Numpy polyfit, 4th degree¶
fit3 = np.polyfit(xTrain, yTrain, 4) fit3 Test model¶
yEst3 = xTest**4 * fit3[0] + xTest**3 * fit3[1] + xTest**2 * fit3[2] + xTest * fit3[3] + fit3[4] plt.plot(xTest, yTestClean, xTest, yEst3); mse3 = skMetrics.mean_squared_error(yTest, yEst3) plt.legend(['ori','fitted']) print('Numpy polyfit 4th degre, MSE =', "{:.3e}".format(mse3)); NumPy least square¶
http://www.python-simple.com/python-numpy-scipy/fitting-regression.php
fit4, residues, rank, s = np.linalg.lstsq(np.reshape(xUnB, (N,1)), yUnB) fit4 Test model¶
yEst4 = fit4 * xTest + bEst plt.plot(xTest, yTest, xTest, yEst4); mse4 = skMetrics.mean_squared_error(yTest, yEst4) print('Numpy Least square, MSE =', "{:.3e}".format(mse4)); Scipy linear regression¶
http://www.python-simple.com/python-numpy-scipy/fitting-regression.php
fit5 = sy.stats.linregress(xTrain, yTrain) fit5 Test model¶
yEst5 = fit5.slope * xTest + fit5.intercept plt.plot(xTest, yTest, xTest, yEst5) mse5 = skMetrics.mean_squared_error(yTest, yEst5) print('Scipy, MSE =', "{:.3e}".format(mse5)); SK Learn linear regression¶
model6 = skLinearRegression(normalize=False) model6.fit(xTrain.reshape(-1,1), yTrain.reshape(-1,1)) print('SciKit Learn linear regression, b =', model6.intercept_[0], ', w =', model6.coef_[0][0]) print('SciKit Learn, R^2-score =', model6.score(xTest.reshape(-1,1), yTest)) Test model¶
yEst6 = model6.predict(xTest.reshape(-1,1)) plt.plot(xTest, yTest, xTest, yEst6) mse6 = skMetrics.mean_squared_error(yTest, yEst6) print('SciKit Learn, MSE =', "{:.3e}".format(mse6)); Gradient descent¶
Let's first plot MSE for several values of the slope in order to verify the convex shape of the cost function
Ns = 100 slope = np.linspace(-1, 1, Ns) sx = np.matmul(np.reshape(xUnB, (N, 1)), np.reshape(slope,(1,Ns))) er_sx_y = sx - np.reshape(yUnB, (N, 1)) mse = np.mean(er_sx_y.T**2, axis = 1) plt.plot(slope, mse) plt.xlabel('w') plt.ylabel('MSE') plt.title('Min MSE = %.3e' % np.min(mse)) plt.grid(); Convexity of the MSE as function of the linear regression slope (w coefficient) is shown
Gradient descent computation¶
def calcGradient(X, Y, b, w): A = w * X + b - Y F_db = np.sum(A) F_dw = np.dot(A, X) return (F_db, F_dw, math.sqrt(F_db**2 + F_dw**2)) # Initial coef b7, w7 = 1, 1 threshold = 1e-1 learningRate = 1e-6 nIterMax = 1e5 # Init gradient_b, gradient_w, gradientNorm = calcGradient(xTrain, yTrain, b7, w7) print('START: b = %.3e' % b7, ', w = %.3e' % w7, ', Gradient norm = %.3e' % gradientNorm) w7Learn = [np.array([b7, w7, gradientNorm])] nIter = 0 while (gradientNorm > threshold) & (nIter < nIterMax): b7 = b7 - learningRate * gradient_b w7 = w7 - learningRate * gradient_w gradient_b, gradient_w, gradientNorm = calcGradient(xTrain, yTrain, b7, w7) w7Learn.append(np.array([b7, w7, gradientNorm])) nIter += 1 print('END : b = %.3e' % b7, ', w = %.3e' % w7, ', Gradient norm = %.3e' % gradientNorm, ', num iteration =', len(w7Learn)) df7 = pandas.DataFrame(w7Learn, columns = ('b', 'w', 'Gradient norm')) fig = plt.figure(figsize=(16,12)) plt.subplot(2,2,1) plt.plot(df7['b']) plt.grid() plt.title('b'); plt.subplot(2,2,2) plt.plot(df7['w']) plt.grid() plt.title('w'); plt.subplot(2,2,3) plt.semilogy(df7['Gradient norm']) plt.grid() plt.title('Gradient norm'); fig = plt.figure(figsize=(16,8)) plt.subplot(1,2,1) plt.semilogy(df7['b'], df7['Gradient norm']) plt.grid() plt.xlabel('b') plt.ylabel('gradient norm'); plt.title('Trajectory of w and Gradient norm'); plt.subplot(1,2,2) plt.semilogy(df7['w'], df7['Gradient norm']) plt.grid() plt.xlabel('w') plt.ylabel('gradient norm'); plt.title('Trajectory of w and Gradient norm'); Test model¶
yEst7 = w7*xTest + b7 plt.plot(xTest, yTest, xTest, yEst7); mse7 = skMetrics.mean_squared_error(yTest, yEst7) print('Gradient descent, MSE =', '{:.3e}'.format(mse7)); Stochastic gradient descent¶
Using new data batch at each iteration.
Alternatively, on a finite data set: shuffle the samples.
nBatch = 100 b8, w8 = 2, 2 threshold = 1e-3 learningRate = 1e-3 nIterMax = 1e5 # Initial batch xBatch, yBatch, yBC = generateBatch(nBatch, True) gradient_b, gradient_w, gradientNorm = calcGradient(xBatch, yBatch, b8, w8) print('START: b = %.3e' % b8, ', w = %.3e' % w8, ', Gradient norm = %.3e' % gradientNorm) w8Learn = [np.array([b8, w8, gradientNorm])] # Continue nIter = 0 while (gradientNorm > threshold) & (nIter < nIterMax): b8 = b8 - learningRate * gradient_b w8 = w8 - learningRate * gradient_w xBatch, yBatch, yBC = generateBatch(nBatch, True) gradient_b, gradient_w, gradientNorm = calcGradient(xBatch, yBatch, b8, w8) w8Learn.append(np.array([b8, w8, gradientNorm])) learningRate = learningRate * 0.9999 # Decreasing learning rate nIter += 1 print('START: b = %.3e' % b8, ', w = %.3e' % w8, ', Gradient norm = %.3e' % gradientNorm, ', num iteration =', len(w8Learn)) df8 = pandas.DataFrame(w8Learn, columns = ('b', 'w', 'Gradient norm')) fig = plt.figure(figsize=(16,12)) plt.subplot(2,2,1) plt.plot(df8['b']) plt.grid() plt.title('b'); plt.subplot(2,2,2) plt.plot(df8['w']) plt.grid() plt.title('w'); plt.subplot(2,2,3) plt.semilogy(df8['Gradient norm']) plt.grid() plt.title('Gradient norm'); The gradient norm is getting very noisy as the value is below $10^{-1}, a more adaptive learning rate would be needed there and a mean on the gradient norm to improve the stop condition of the gradient descent
fig = plt.figure(figsize=(16,8)) plt.subplot(1,2,1) plt.semilogy(df8['b'], df8['Gradient norm']) plt.grid() plt.xlabel('b') plt.ylabel('gradient norm'); plt.title('Trajectory of w and Gradient norm'); plt.subplot(1,2,2) plt.semilogy(df8['w'], df8['Gradient norm']) plt.grid() plt.xlabel('w') plt.ylabel('gradient norm'); plt.title('Trajectory of w and Gradient norm'); Test model¶
yEst8 = w8*xTest + b8 plt.scatter(xBatch, yBatch, marker='.', color = 'black'); plt.plot(xTrain, yClean) plt.plot(xTest, yEst8) plt.legend(['Generator model', 'Estimated model', 'Last batch']) mse8 = skMetrics.mean_squared_error(yTest, yEst8) print('Stochastic gradient descent, MSE =', "{:.3e}".format(mse8)); Gradient descent with SciKit Learn¶
from sklearn.linear_model import SGDRegressor as skSGDRegressor model9 = skSGDRegressor(alpha=0.0001, average=False, early_stopping=True, epsilon=0.1, eta0=0.0, fit_intercept=True, learning_rate='optimal', loss='squared_loss', max_iter=1000, n_iter_no_change=5, penalty='l2', power_t=0.5, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) # l1_ratio=0.15, Notes:
- Regularizer is called 'penalty' and parameterized by 'alpha' (and 'l1_ratio')
- Early stopping is available and parameterized by 'early_stopping', 'max_iter', 'tol' and 'n_iter_no_change'
- Shuffling between epochs enabled by 'shuffle'
model9.fit(xTrain.reshape(-1, 1), yTrain) print('Y = {0} X + {1}'.format(model9.coef_, model9.intercept_)) yEst9 = model9.predict(xTest.reshape(-1,1)) plt.plot(xTrain, yClean) plt.plot(xTest, yEst9) plt.legend(['Generator model', 'Estimated model']) mse9 = skMetrics.mean_squared_error(yTest, yEst9) print('Gradient descent with SK Learn, MSE =', "{:.3e}".format(mse9)); Main take-aways¶
Linear regression has a closed form leading to the best fit. Many Python libraries provide this linear or polynomial fitting.
We have also learnt gradient descent on this simple case, it will be very useful for coming projects based on neural networks.
Where to go from here ?¶
Other single feature linear implementation using TensorFlow (Notebook)
More complex bivariate models using "raw" Python (Notebook) up to the gradient descent with regularizer, or using Keras (Notebook)
Compare with the single feature binary classification using logistic regression using "raw" Python or libraries (Notebook)