Derivation of Backpropagation

Figure 1 shows conceptually how network processing takes place.

  figure6
Figure 1: Neural network processing: Activation foward propagates and then error backward propagates. Weights on the connections mediate the passed values in both directions.

Figure 2 depicts the network components which affect a particular weight change. Notice that all the necessary components are locally related to the weight being updated. This is one feature of backpropagation that seems biologically plausible. However, brain connections appear to be unidirectional and not bidirectional as would be required to implement backpropagation.

  figure11
Figure 2: The change to a hidden to output weight depends on error (depicted as a lined pattern) at the output node and activation (depicted as a solid pattern) at the hidden node. While the change to a input to hidden weight depends on error at the hidden node (which in turn depends on error at all the output nodes) and activation at the input node.

Review of Calculus Rules


displaymath252


displaymath253


displaymath254

Gradient Descent on Error

Key to terms:

We can motivate the backpropagation learning algorithm as gradient descent on sum-squared error (we square the error because we are interested in its magnitude and not its sign). The total error in a network is given by the following equation (the tex2html_wrap_inline284 will simplify things later).


displaymath255

We want to adjust the network's weights to reduce this overall error.


displaymath256

We will begin at the output layer with a particular weight.


displaymath257

However error is not directly a function of a weight. We expand this as follows.


displaymath258

Let's consider each of these partial derivatives in turn. Note that only one term of the E summation will have a non-zero derivative: the one associated with the particular weight we are considering.


displaymath259

Now we see why the tex2html_wrap_inline284 in the E term was useful.


displaymath260

Note that only one term of the net summation will have a non-zero derivative: again the one associated with the particular weight we are considering.


displaymath261

Now substituting these results back into our original equation we have:


displaymath262

This is typically simplified as shown below where the tex2html_wrap_inline294 term repesents the product of the error with the derivative of the activation function.


displaymath263

Now we have to determine the appropriate weight change for an input to hidden weight.


displaymath264


displaymath265


displaymath266


displaymath267


Back: Lecture Plans