Artificial neural networks have regained popularity in machine learning circles with recent advances in deep learning. Deep learning techniques trace their origins back to the concept of back-propagation in multi-layer perceptron (MLP) networks, the topic of this post. The complete code from this post is available on GitHub.
Multi-Layer Perceptron Networks for Regression
A MLP network consists of layers of artificial neurons connected by weighted edges. Neurons are denoted for the
-th neuron in the
-th layer of the MLP from left to right top to bottom. Inputs are fed into the leftmost layer and propagate through the network along weighted edges until reaching the final, or output, layer. An example of a MLP network can be seen below in Figure 1.

Figure 1: A Multi-Layer Perceptron Network
The input to a neuron , also known as the “net” denoted
, is the weighted sum of all incoming edges plus an optional bias. Its output
is computed by applying the
-th layer’s activation function
to
. By arranging the weights of each layer into a matrices , the output of the
-th layer of a MLP can be computed as:
.
where is vector of nets of the
-th layer,
is the weight matrix of the
-th layer,
is the bias vector of the
-th layer , and
is the activation function for the
-th layer. The activation function is applied component-wise to the resulting vector;
and the sigmoid function are commonly used. In a network with
layers, the final output
given an input
can be defined as:
.
Back-Propagation
Consider a supervised learning scenario where the -th input vector
is associated with the
-th target vector
with
input-target pairs in total. Further, assume the network has
layers of sizes:
. The L2 error of the network for this input at the
-th output neuron is then:
,
where the is for convenience,
is the
-th component of the target, and
is the
-th component of the predicted output. The total network error is simply:
.
Back-propagation uses gradient descent to minimize the network error by updating the weights and biases of the network. First, consider the case of the output layer of the MLP network. After computing the network error for the -th input sample, an update
is found for the entry in the
-th row and
-th column of the
-th weight matrix, that is
. By the definition of gradient descent:
,
where is the learning rate. Now, by the multivariate chain rule,
,
where is the
-th neuron’s net in the
-th layer of the network, and
is the
-th neuron’s output in layer
of the network. The final step is true since,
.
Only considering the -th component of this, as above, results in:
.
Continuing from earlier with the multivariate chain rule,
.
Consider the case of the output layer, that is layer . Using the definition of the component-wise error defined earlier, the RHS simplifies as follows:
.
If the activation function for a given layer is then
and thus can be easily computed. Thus, the update rule for output neurons is:
,
where is the
-th neuron’s delta value in the output layer ( layer
of the MLP) for the
-th input vector.
For hidden layers, can no longer be directed computed as the target values
are only provided for the output layer. The delta terms for these layers are obtained through back-propagation. That is, the delta terms for layer
are obtained using those for layer
as can be seen below in Figure 2.

Figure 2: Hidden Node Update Visualization
One difference in the derivation between hidden nodes and output nodes is that, since the layers in MLP networks are fully-connected, hidden neurons contribute to the error in all later neurons. Thus, the total network error term is used instead of the component-wise error
. So, similar to above,
.
Next, replace the total network error by its earlier definition,
.
The last step follows since the delta values for layer have already been computed. Thus, in summary, the weight update is computed as
with
for output layers and
for hidden layers. In order to update bias weights, use the appropriate formula above and replace with 1.
Implementation in Python with NumPy
The above formulas can be conveniently implemented using matrix and component-wise multiplications of matrices.
#bprop.py #Author: Nicholas Smith import numpy as np #Array of layer sizes ls = np.array([2, 4, 4, 1]) n = len(ls) #List of weight matrices (each a numpy array) W = [] #Initialize weights to small random values for i in range(n - 1): W.append(np.random.randn(ls[i], ls[i + 1]) * 0.1) #List of bias vectors initialized to small random values B = [] for i in range(1, n): B.append(np.random.randn(ls[i]) * 0.1) #List of output vectors O = [] for i in range(n): O.append(np.zeros([ls[i]])) #List of Delta vectors D = [] for i in range(1, n): D.append(np.zeros(ls[i])) #Input vectors (1 row per each) A = np.matrix([[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]]) #Target Vectors (1 row per each) y = np.matrix([[-0.5], [0.5], [0.5], [-0.5]]) #Activation function (tanh) for each layer #Linear activation for final layer actF = [] dF = [] for i in range(n - 1): actF.append(lambda (x) : np.tanh(x)) #Derivative of activation function in terms of itself dF.append(lambda (y) : 1 - np.square(y)) #Linear activation for final layer actF.append(lambda (x): x) dF.append(lambda (x) : np.ones(x.shape)) #Learning rate a = 0.5 #Number of iterations numIter = 250 #Loop for each iteration for c in range(numIter): #loop over all input vectors for i in range(len(A)): print(str(i)) #Target vector t = y[i, :] #Feed-forward step O[0] = A[i, :] for j in range(n - 1): O[j + 1] = actF[j](np.dot(O[j], W[j]) + B[j]) print('Out:' + str(O[-1])) #Compute output node delta values D[-1] = np.multiply((t - O[-1]), dF[-1](O[-1])) #Compute hidden node deltas for j in range(n - 2, 0, -1): D[j - 1] = np.multiply(np.dot(D[j], W[j].T), dF[j](O[j])) #Perform weight and bias updates for j in range(n - 1): W[j] = W[j] + a * np.outer(O[j], D[j]) B[j] = B[j] + a * D[j] print('\nFinal weights:') #Display final weights for i in range(n - 1): print('Layer ' + str(i + 1) + ':\n' + str(W[i]) + '\n') In the above code, the delta values for each layer are computed simultaneously by using matrix multiplication instead of evaluating each component separately. Similarly, the weight updates are computed simultaneously using an outer product between the delta and output vectors.