Understanding Backpropagation

A visual derivation of the equations that allow neural networks to learn

Jan 12, 2021

13 min read

At its most basic, a neural network takes input data and maps it to an output value. Whether you’re looking at images or words or raw numerical data all the network sees is numbers and it’s simply finding patterns in these numbers. The input data is filtered through a matrix of weights which are the parameters of the network and can number in the thousands to millions or billions. Fine tuning these weights to recognize the patterns is obviously not a task any human wants to or can do and so a method to do this was devised, several times but most notably in 1986 [1]. The method takes a neural networks output error and propagates this error backwards through the network determining which paths have the greatest influence on the output. This is of course backpropagation.

Backpropagation identifies which pathways are more influential in the final answer and allows us to strengthen or weaken connections to arrive at a desired prediction. It is such a fundamental component of deep learning that it will invariably be implemented for you in the package of your choosing. So you really could build something amazing without knowing a thing about it. In the same vain you could also drive your car, as many of us do, without the faintest idea of how an engine really works. You don’t have to think about it and can still drive without issue…that is until you’re broken down on the side of the road. Then you sure wish you understood something about what’s going on under the hood.

That broken down car, or stopping with the analogy now, broken model is what brought me to this point. I needed to understand, so I started digging and a quick Wikipedia search quickly shed light on the inner workings of how neural networks learn:

"Essentially, backpropagation evaluates the expression for the derivative of the cost function as a product of derivatives between each layer from left to right – "backwards" – with the gradient of the weights between each layer being a simple modification of the partial products (the "backwards propagated error)."

Come again?? I love Wikipedia but this is a serious mouthful. It quickly becomes clear that backpropagation isn’t an easy concept and does require some serious effort to digest the concepts and formulas that will be thrown at you. While you will never fully understand something until you put pen to paper yourself, the goal here is to provide a resource to make such a crucial concept more accessible to those already working with neural networks who wish to ‘peak under the hood’. Fundamental’s should not be hidden behind a veil of formulas that if only presented in a cohesive manner would present a road map rather than a road block.

In the case of understanding backpropagation we are provided with a convenient visual tool, literally a map. This map will visually guide us through the derivation and deliver us to our final destination, the formula’s of backpropagation. The map I refer to is the neural network itself, while it doesn’t follow the same convention as a computational graph I will use it in much the same way and refer to it as a computational map rather than a graph to differentiate it from the more formal graph structure. This visual method is really only enlightened when the reader can see the process occurring, writing it down in a series of steps runs into the common pitfall of generating a soup of formula’s where the connection is not immediately tangible and the reader becomes overwhelmed. To be clear, we will still end up with many formula’s that look intimidating on their own but after viewing the process by which they evolve each equation should make sense and things become very systematic.

The tool used here to convey this visual information is manim, a math animation library created by Grant Sanderson from the 3Blue1Brown YouTube channel. I must also attribute use of some code from his network class from the neural network series. If you’re not familiar with his channel do yourself a favor and check it out (3B1B Channel). While manim was my tool of choice it’s not the easiest and at some point between ‘I’ve gone too far to stop now’ and ‘I’ve bitten off way more than I can chew’ I may have regretted this decision, but here we are. If you’re beginning with neural networks and/or need a refresher on forward propagation, activation functions and the like see the 3B1B video in [ref. 2] to get some footing. Some calculus and linear algebra will also greatly assist you but I try to explain things at a fundamental level so hopefully you still grasp the basic concepts. While implementing a neural network in code can go a long way to developing understanding, you could easily implement a backprop algorithm without really understanding it (at least I’ve done so). Instead, the point here is to get a detailed understanding of what backpropagation is actually doing and that entails understanding the math.

Network and Notation

A simplified model is used to illustrate the concepts and to avoid overcomplicating the process too much. A 2 input, 2 output, 2 hidden layer network is used and illustrated in figure 1. The output nodes are denoted as e indicating the error, though you may also see them commonly denoted as C for the cost function. This would typically be a function like mean squared error (MSE) or binary cross entropy. The E node is the total error or summation of _e_₁ and _e_₂. The main difference here compared to a typical neural network layout is that I’ve explicitly broken hidden nodes into two separate functions, the weighted sum (z nodes) and the activations (a nodes). These are typically grouped under one node but it’s clearer and required here to show each function separately. Throughout I assume we are dealing with one training example, in reality you would have to average over all training examples in your training set.

Figure 1: The example neural network setup (Image by Author)

Now half the battle is getting the notation straight. Figure 2 indicates the notation for nodes and weights in the example network. Superscripts refer to layers and subscripts to nodes.

Figure 2: Indexing notation (Image by Author)

The weight subscript indexes may appear backwards but it will make more sense when we build the matrices. Indexing in this manner allows the rows of the matrix to line up with the rows of the neural network and the weight indexes agree with the typical (row, column) matrix indexing.

A final note before we get to the math, to convey the visual nature of the derivation I’ve used GIF’s. Sometimes you may want to stop or slow down an animation which a GIF is obviously not ideal for so please also see the accompanying YouTube video to have more control of pace.

Final Layer Equations

The ultimate goal of backpropagation is to find the change in the error with respect to the weights in the network. If we’re looking for the change of one value with respect to another, that’s a derivative. For our computational map each node represents a function and each edge performs an operation on the attached node (multiplication by the weight). We start at the error node and move back one node at a time taking the partial derivative of the current node with respect to the node in the preceding layer. Each term is chained onto the preceding term to get the total effect, this is of course the chain rule.

Figure 3: Tracing the map for the error with respect to w11 (Image by Author)

Note: the weights on this layer only effect one of the outputs either _e_₁ or _e_₂ so only the relevant error appears in the end equations.

This tracing out of the edges and nodes is done for each path from the error node to each weight in the final layer, running through it quickly in figure 4.

Figure 4: Tracing the map for the remaining weights in the final layer (Image by Author)

At this point we now have a bunch of formulas which will be hard to keep track of if we don’t do some bookkeeping. That means putting the terms into matrices so we can more easily manage/track the terms. Figure 5 shows how the terms are grouped, it’s worth noting the use of the Hadamard operator (circle with a dot inside). This is used for element wise matrix multiplication which helps simplify the matrix operations.

Figure 5: Arranging terms into matrices (Image by Author)

The final equations in matrix notation are shown in figure 6 where I’ve used capital letters to denote the matrix/vector form of the variable. The first two terms on the right hand side are re-factored to a delta term. As we trace the preceding layers out these terms will occur repeatedly so it makes sense to calculate them once and store it as the delta variable for future use.

Figure 6: Final matrix for the final layer (Image by Author)

Deeper Layers

For deeper layers the same methodology applies with two key updates: 1) the delta terms appear again in the later layers so we will make the appropriate substitutions; 2) There will now be two paths from the total error node to the weight of interest. When multiple branches converge on a single node we will add these branches and then proceed with multiplying the remaining chain of functions. Figure 7 shows the process for one of the first layer weights.

Figure 7: The first layer chain rule example (Image by Author)

Now if we look closely you’ll notice the repeated terms, figure 8 recalls the previous layer equations along side our current equation for w₁₁ and shows which terms are repeated and factors them as delta’s.

Figure 8: Finding like terms between the final layer and preceding layer calculations and factoring out as delta terms (Image by Author)

All subsequent equations follow the same methodology. Because this gets quite repetitive and because I only have so much length I can cram into a GIF, the process is repeated (very) quickly in figure 9 for all remaining weights.

Figure 9: Remaining gradient calculations for the first layer (don't blink) (Image by Author) — Figure 9: Remaining gradient calculations for the first layer (don’t blink) (Image by Author)

Once again we have a mess of formula’s that we’ll try to tidy up by entering them into matrices. The general groupings follow the same pattern as the first layer.

Figure 10: Matrix groupings for the first layer (Image by Author)

The left most matrix can of course be broken down further, we’ll want the delta value on its own so that we can simply plug in the value calculated from the previous layer. Now to follow figure 11 you’ll have to recall the dot product, it’s rows multiplied by columns so we also add a transpose to the delta terms.

Figure 11: Breaking out the delta and finalizing the matrix formulas (Image by Author)

Adding the matrix notation gives the final formula

Figure 12: Final matrix equations for the first layer (Image by Author)

Now, some of the derivative terms in figure 12 are going to be the same no matter what you have for an activation function. Nodes connected by edges produce a linear output, this is then fed to the activation function to introduce the non-linearity. Given the linear nature of the function, finding the derivative of a node function with respect to the previous node is simplified and can be determined visually. Taking dz/da terms this derivative tells us how the output (_z_₁) changes with respect to an input, _a_₁. These functions are only connected through the edge _w_₁₁ so the weight is the only way in which a₁ can change z₁ that i_s dz/da =_ w₁₁. Figure 13 indicates the nodes connected by an edge and substitutes the edge value into the matrix as the solution to the derivative term.

Figure 13: Visual indication of connection between nodes and the resulting derivative (Image by Author)

It’s a similar scenario for dz/dw. In this case the derivative is with respect to the edge and not the node but the same logic holds, there is only one function connected through the edge that effects the z node, that being the input (which I’ve labelled x⁰ but could have also labelled a⁰). The derivative solutions can then be subbed into the matrix equation per figure 14.

Figure 14: Visual indication of connection between the first nodes and the input layer and the resultant substitution of the derivative solutions into the matrix equation (Image by Author)

We finally have the equations for the final and initial layers.

Figure 15: Final equations for a two layer network (Image by Author)

Generalized Equations

If you’re following Michael Nielsen’s excellent online book [3] he notes some more general equations as does 3B1B [4]. The equations here can likewise be generalized further. Here the superscript 1 represents the current layer (l) and the superscript 0 represents the previous layer (l-1). The superscript 2 in the top equation refers to the next layer (l+1) while in the bottom equation it refers to the final layer (L)…or just watch the substitutions in figure 16.

Figure 16: Generalizing the equations for arbitrary layer depths (Image by Author)

The Bias

As tempting as it is to skip over the bias and tell you it’s simple and follows from the above, it really does help to see it worked out at least once. So I will cover one example here in figure 17.

Figure 17: Tracing out a bias gradient (Image by Author)

As you can see the process is the same. An important point to note is that for a given layer all but the last term will be the same as the equations we just found with respect to a given weight. The last terms is simply the bias which is assigned a value of 1 (the bias weight terms are used to adjust the bias). This allows us to simplify and generalize the bias equation relatively easily as in figure 18.

Figure 18: Generalizing the bias term (Image by Author)

Combining all the equations gives us the final generalized set of equations in matrix form.

Figure 19: The final backpropagation equations (Image by Author)

Final Remarks

If a picture is worth a thousand words than surely over a dozen GIFs is worth a good deal more (or maybe you just never want to see another GIF again). I do hope that this has helped shed some light on a tough concept. If you liked this I hope to have some more content soon. This article was actually a detour from my original project which was building a single shot object detector that at several points broke leading me down this rabbit hole.

Additional resources related to this article: