The Neural Network Chain Rule

Published August 9, 2021

Doug Rose

Author | Agility | Artificial Intelligence | Data Ethics

In a previous article, Neural Network Backpropagation, I define backpropagation as a machine-learning technique used to calculate the gradient of the cost function at output and distribute it back through the layers of the artificial neural network, providing guidance on how to adjust the weights of the connections between neurons and the biases within certain neurons to increase the accuracy of the output.

Imagine backpropagation as creating a feedback loop in the neural network. As the machine tries to match inputs to labels, the cost function tells the neural network how wrong its answer is. Backpropagation then feeds the error back through the network to make adjustments via gradient descent —an optimization algorithm that minimizes cost by repeatedly and gradually moving the output in the direction where the slope of the gradient decreases, as shown below. The goal is to reach the global cost minimum — the point where the slope of the gradient is as close to zero as possible.

Backpropagation relies on the neural network chain rule — a technique used to find the derivatives of cost with respect to any variable in a nested equation. In math, a derivative expresses the rate of change at any given point on a graph; it is the slope of the tangent at that point. A tangent is a line that touches a curve at only one point. As shown in the image above, the dotted line labeled "Gradient" and the solid line labeled "Global Cost Minimum" both represent tangents; their slopes represent derivatives.

Note that gradient descent gradually moves the weight of a connection from a point where the slope of the tangent is very steep to a point where the slope is nearly flat. The chain rule can be used to calculate the derivative of cost with respect to any weight in the network. This enables the network to identify how much each weight contributes to the cost (and causes the wrong answer) and whether that weight needs to be increased or decreased (and by how much) to reduce the cost (and improve the odds of a right answer).

A Bird's-Eye View

Here's how the cost function, the chain rule, gradient descent, and backpropagation of errors work together to enable the neural network to learn:

The cost function tells the network how wrong it is.
The chain rule enables the network to identify how much each weight contributes to the cost (error) and how much each weight needs to be adjusted.
Gradient descent tells the network the direction each weight needs to be adjusted to reduce the amount that weight contributes to the error.
Through backpropagation, the network adjusts the weights of the connections and the bias of certain neurons one layer at a time, starting at the output layer and moving back through the hidden layers to the input layer.
The network recalculates its answer with the adjusted weights and biases, checks its answer against the correct answer, and backpropagates the error. The process continues until the network produces the correct answer with the desired level of certainty.

Stepping Back Through the Neural Network to Make Adjustments

With backpropagation, weights and biases are adjusted from the output layer back through the hidden layers to the input layer. For example, if the neural network has four layers — an input layer, an output layer, and two hidden layers — weights and biases are adjusted in the following order:

The network first adjusts the weights of the connections between the second hidden layer and the output layer and determines how those adjustments impact the output.
It then adjusts the connections between the first hidden layer and the second hidden layer and determines how those adjustments impact the output.
Finally, it adjusts the connections between the input layer and the first hidden layer and determines how those adjustments impact the output.

In other words, the network turns the dials, starting with the dials closest to the output and working back through the network, testing the output after every adjustment and before making the next adjustment.

Using this technique, the network is able to reduce errors by fine-tuning the weights and biases one level at a time.

A Cumulative Effect

Through the chain rule, backpropagation focuses less on individual adjustments and more on the cumulative effect of those adjustments. Therefore, it uses the following strategies to make adjustments:

Make adjustments from the front of the neural network back, because adjustments between the input layer and the first hidden layer will have a larger cumulative effect than adjustments between higher layers. Imagine backpropagation and the chain rule applied to four megaphones aligned in a row. Cranking up the volume on the first megaphone will have a bigger cumulative effect on the output in the final megaphone than if you were to merely crank up the volume on the final megaphone.
First adjust the weights that have the most room for adjustment. A big twist of a dial is more likely to have a greater impact on the output than a small twist of the dial. By making big changes first followed by smaller changes, backpropagation can more effectively fine-tune the output of each layer.

As you work with artificial neural networks and machine learning, keep in mind that no one factor is responsible for learning. Several elements must work together, including the cost function, the chain rule, gradient descent, and backpropagation of errors. Machine learning is essentially an exercise in testing answers and nudging the entire network toward reducing the likelihood of wrong answers.

The Neural Network Chain Rule

A Bird's-Eye View

Stepping Back Through the Neural Network to Make Adjustments

A Cumulative Effect

Quick Links

Contact

Follow Me On