In my previous article, Neural Network Backpropagation, I explained the basics of how an artificial neural network learns. The cost function calculates how wrong the network's answer is, and then backpropagation goes back through the various layers of the neural network changing weights and biases to adjust the strengths of the connections between neurons and the calculations within certain neurons. I compare the process to that of turning the dials on a complex sound system to optimize the quality of sound being emitted by the speakers.

In that article, I focus mostly on how fine tuning a neural network with back propagation improves its accuracy — to arrive at the *correct* answer. Although coming up with the *correct* answer is certainly the top priority, backpropagation must also adjust the weights and biases to reduce the output that’s driving the *wrong* answers:

**First**, it adjusts the connections and neurons that feed output to the neurons that are most confident in their wrong answers.**Then**, it moves on to adjust the inputs to neurons that are less confident in their wrong answers.

Keep in mind that the artificial neural network makes adjustments for every item in the training data set. In my previous post, I used an example of an artificial neural network for identifying dog breeds. To properly train the network to identify ten different dog breeds, the initial training data set would need at least one picture of a dog of each breed: German shepherd, Labrador retriever, Rottweiler, beagle, bulldog, golden retriever, Great Dane, poodle, Doberman, and dachshund.

This is a very small dataset, but what is important is that it contains at least one picture of each breed. If you fed the network only the picture of the beagle, you would end up with a neural network that classifies every image of a dog as a beagle. You need to feed the network more pictures — pictures of dogs of different breeds and pictures of different dogs of the same breed.

You would follow up the initial training with a test data set — a separate collection of at least one picture of a dog of each breed. During this test, you would feed a picture of a dog to the neural network without telling it the dog's breed and then check its answer. If the answer was wrong, you would correct the neural network, and it would use backpropagation to make additional adjustments. You would then feed it the next picture, and continue the process, working through the entire set of test data.

Even after testing, the neural network continues to learn as you feed it larger volumes of data. With every picture you feed into the network, it uses backpropagation to make tiny adjustments to weights and biases, fine-tuning its ability to distinguish between different breeds.

While the example from my previous post focused on improving the network's ability to identify a picture of a beagle, you want the network to achieve the same degree of accuracy for each and every breed. To achieve this feat, the neural network needs to make some trade-offs.

As you feed the network a diversity of pictures, it becomes a little *less* accurate in identifying beagles, so that it can do a better job identifying pictures of Labrador retrievers, Rottweilers, German shepherds, poodles, and so on. Your network tries to find the optimal weights and biases to minimize the cost regardless of the breed shown in the picture. The settings may not be the best for any one dog, but having well-balanced settings enables the network to make fewer mistakes, resulting in more accurate classification among different breeds of dogs overall.

The cost function, gradient descent, and backpropagation all work together to make this magic happen.

Although machine learning is a continuous process of optimizing accuracy, the goal of training and testing a neural network is to create a *model* — the product of the entire machine learning process that can be used to make predictions. Creating a model is actually a joint effort of humans and machines. A human being builds the artificial neural network and establishes hyperparameters to set the stage for the learning process; the machine then takes over, adjusting parameters, including weights and biases, to develop the desired skill.

As you begin to build your own neural networks, keep in mind that the process often involves some experimentation. Your first attempts may involve small experiments to test outcomes followed by adjustments to hyperparameters. In a sense, both you and the machine are involved in a learning process.

In my previous article, The Neural Network Cost Function, I describe the cost function and highlight the essential role it plays in machine learning. With the cost function, the machine pays a price for every mistake it makes. This provides the machine with a sort of incentive or motivation to learn; the machine's goal is to minimize the cost by becoming increasingly accurate.

Unfortunately, the cost function tells the network only how wrong it is; it doesn't provide a way for the network to become less wrong. This is where machine learning gradient descent comes into play. *Gradient descent* is an optimization algorithm that minimizes the cost by repeatedly and gradually moving the output in the direction opposite of that in which the slope of the gradient line increases, as shown here.

During the learning process, the neural network adjusts the weights of the connections between neurons, giving input from some neurons more or less emphasis than inputs from other neurons, as shown below. This is how the machine learns. With gradient descent, the neural network adjusts the initial weights a tiny bit at a time in the direction opposite of the steepest incline. The neural network performs this adjustment iteratively, continually pushing the weight down the slope toward the point at which it can no longer be moved downhill. This point is called the *local minimum* and is the point at which the machine pays the lowest cost for errors because it has achieved optimum accuracy.

For example, suppose you are building a machine that can look at a picture of a dog and tell what breed it is. You would place a cost function at the output layer that would signal all the nodes in the hidden layer telling them how wrong the output was. The nodes in the hidden layer would then use gradient descent to move their outputs in the direction opposite of the steepest incline in an attempt to minimize the cost.

As the nodes make adjustments, they monitor the cost function to see whether the cost has increased or decreased and by how much, so they can determine whether the adjustments were beneficial or not. During this process, the neural network is learning from its mistakes. As the machine becomes more accurate and more confident in its output, the overall cost is diminished.

For example, suppose the neural network's output layer has 5 neurons representing five dog breeds — German shepherd, Doberman, poodle, beagle, and dachshund. The output neuron for the Doberman indicates a 40% probability the picture is of a Doberman; the German shepherd neuron is 35% sure it's a German shepherd; the poodle neuron is 25% sure it's a poodle; and the beagle and dachshund neurons each indicate a certainty of 15% the picture is of one of their breeds.

You already decided that you want the machine to be 90% certain in its analysis, so these numbers are not very good.

To improve the machine's accuracy, you can combine the cost function with gradient descent. With the cost function, the machine calculates the difference between each wrong answer and each correct answer and then averages them. So let’s say it was a picture of a Doberman. That means you want to nudge the network in a few places:

- +0.60 for the Doberman to get it to 1.0
- -0.35 for the German shepherd to get it to 0.0
- -0.25 for the poodle to get it to 0.0
- -0.15 for the beagle and dachshund to get those to 0.0

Then you want to average of all your nudges to get an overall sense of how accurate your network is at finding different dog breeds:

(+0.60 – 0.35 – 0.25 – 0.15 – 0.15)/5 = –0.55/5 = –0.11

But remember this is just one training example. The machine repeats this process on numerous pictures of dogs of different breeds:

(0.01 – 0.6 – 0.32 + 0.16 – 0.25)/5 = –0.04/5 = –0.02

(0.7 – 0.3 + 0.12 – 0.05 – 0.12)/5 = 0.35/5 = 0.07

With each iteration, the neural network calculates the cost and adjusts the weights moving the network closer and closer to zero cost — the point at which point the network has achieved optimum accuracy and you are confident in its output.

As you can see, the cost function and gradient descent make a powerful combination in machine learning, not only telling the machine when it has made a mistake and how far off it was, but also providing guidance on which direction to tune the network to increase the accuracy of its output.

Machines often learn the same way humans do — by making mistakes and paying the price for doing so. For example, when you’re first learning to drive, you merge onto the highway and are driving 55 mph in a 65 mph zone. Other drivers are beeping at you, passing you on the left and right, giving you dirty looks, and making rude gestures. You get the message and start driving the speed limit. Cars are still passing you on the left and right, and their drivers appear to be annoyed. You start driving 75 mph to blend in with the traffic. You are rewarded by feeling the excitement of driving faster and by reaching your destination more quickly. Soon, you are so comfortable driving 75 mph that you start driving 80 mph. One day, you hear a siren, and you see a state trooper’s car close behind you with a flashing red light. You get pulled over and issued a ticket for $200, so you slow it down and now routinely drive about 5 to 9 mph over the speed limit.

During this entire scenario, you learn through a process of trial and error by paying for your mistakes. You pay by being embarrassed for driving too slowly or you pay by getting pulled over and issued a warning or ticket or by getting into or causing an accident. You also learn by being rewarded, but since this article is about the cost function, I won’t get into that.

With machine learning, your goal is to make your machine as accurate as possible — whether the machine’s purpose is to make predictions, identify patterns in medical images, or drive a car. One way to improve accuracy in machine learning is to use a neural network cost function — a mathematical operation that compares the network’s output (the predicted answer) to the targeted output (the correct answer) to determine the accuracy of the machine.

In other words, the cost function tells the network how wrong it was, so the network can make adjustments to be less wrong (and more right) in the future. As a result, the network pays for its mistakes and learns by trial and error. The cost is higher when the network is making bad or sloppy classifications or predictions — typically early in its training phase.

Machines learn different lessons depending on the model. In a simple linear regression model, the machine learns the relationship between an independent variable and a dependent variable; for example, the relationship between the size of a home and its cost. With linear regression, the relationship can be graphed as a straight line, as shown in the figure.

During the learning process, the machine can adjust the model in several ways. It can move the line up or down, left or right, or change the line’s slope, so that it more accurately represents the relationship between home size and square footage. The resulting model is what the machine learns. It can then use this model to predict the cost of a home when provided with the home’s size.

The cost function has one major limitation — it does not tell the machine what to adjust, by how much, or in which direction. It only indicates the accuracy of the output. For the machine to be able to make the necessary adjustments, the cost function must be combined with another function that provides the necessary guidance, such as gradient descent, which just happens to be the subject of my next post.

Artificial neural networks learn through a combination of functions, weights, and biases. Each neuron receives weighted inputs from the outside world or from other neurons, adds bias to the sum of the weighted inputs, and then executes a function on the total to produce an output. During the learning process, the neural network weights are assigned randomly across the entire network to increase its overall accuracy in performing its task, such as deciding how likely a certain credit card transaction is fraudulent.

Imagine weights and biases as dials on a sound system. Just as you can turn the dials to control the volume, balance, and tone to produce the desired sound quality, the machine can adjust its dials (weights and biases) to fine-tune its accuracy. (For more about functions, weights, and bias, see my previous article, Functions, Weights, and Bias in Artificial Neural Networks.)

When you’re setting up an artificial neural network, you have to start somewhere. You could start by cranking the dials all the way up or all the way down, but then you would have too much symmetry in the network, making it more difficult for the network to learn. Specifically, if neighboring nodes in the hidden layers of the neural network are connected to the same inputs and those connections have identical weights, the learning algorithm is unable to adjust the weights, and the model will be stuck — no learning will occur.

Instead, you want to assign different values to the weights — typically small values, close to zero but not zero. (By default, the bias in each neuron is set to zero. The network can dial up the bias during the learning process and then dial it up or down to make additional adjustments.)

In the absence of any prior knowledge, a plausible solution is to assign totally random values to the weights. Techniques for generating random values include the following:

- Orthogonal random matrix initialization
- RandomNormal
- RandomUniform
- Zero-mean Gaussian

For now just think of random values as unrelated weights between zero and one but closer to zero. What’s important is that these random values provide a starting point that enables the network to adjust weights up and down to improve the artificial neural network’s accuracy. The network can also make adjustments by dialing the bias within each neuron up or down.

For an artificial neural network to learn, it requires a *machine learning algorithm* — a process or set of procedures that enables the machine to create a model that can process the data input in a way that achieves the network’s desired objective. Algorithms come in two types:

**Deterministic**: Every time the algorithm is given the same problem, it takes the same steps in the same sequence to solve it, and produces the same outcome. An example of a deterministic algorithm is the sort feature in a word processor. Every time you use the feature to sort a list, it takes the same steps to arrange the items in the same order.**Non-deterministic**: Every time the algorithm is given the same problem, it takes the steps in a different sequence, which may produce a slightly different outcome. An example of a non-deterministic algorithm is an electronic card game that shuffles the cards before dealing them. The cards must be shuffled in a way that places them in a random order, so players cannot “guess” the order of the cards.

As a rule of thumb, use deterministic algorithms to solve problems with concrete answers, such as determining which route is shortest in a GPS program. Use non-deterministic algorithms when an approximate answer is good enough and too much processing power and time would be required for the computer to arrive at a more accurate answer or solution.

An artificial neural network uses a non-deterministic algorithm, so the network can experiment with different approaches and then adjust accordingly to optimize its accuracy.

Suppose you are training an artificial neural network to distinguish among different dog breeds. As you feed your training data (pictures of dogs and label of breeds) into the network, it adjusts the weights and biases to identify a relationship between each picture and label (dog breed), and it begins to distinguish between different breeds. Early in training, it may be a little unsure whether the dog in a certain picture is one breed or another. It may indicate that it’s 40% sure it’s a beagle, 30% sure it’s a dachshund, 20% sure it’s a Doberman, and 10% sure it’s a German shepherd.

Suppose it is a dachshund. You correct the machine, it adjusts the weights and biases, and tries again. This time, the machine indicates that it’s 80% sure it’s a dachshund, and 20% sure it’s a beagle. You tell the machine it is correct, and no further adjustment is needed. (Of course, the machine may need to make further adjustments later if it makes another mistake.)

The good news is that during the machine learning process, the artificial neural network does most of the heavy lifting. It turns the dials up and down to make the necessary adjustments. You just need to make sure that you give it a good starting point by assigning random weights and that you continue to feed it relevant input to enable it to make further adjustments.

In my previous article Artificial Neural Networks Regression and Classification, I introduced the three types of problems that machine learning is generally used to solve:

- Classification
- Regression
- Clustering

In that article, I focus on solving classification and regression problems. In this article, I turn my attention to neural network clustering problems — problems that can be solved by identifying common patterns among inputs.

Clustering has numerous applications in a wide variety of fields. Here are a few examples of how clustering may be used:

- In
**biology**, clustering of genetic patterns can provide insight into how different organisms are related in terms of evolution. - In
**medicine**, clustering can analyze patterns of antibiotic resistance among different bacteria, to identify patterns in x-ray images that signal a higher risk of certain diseases and genetic patterns that may be at the root of certain hereditary illnesses. **Businesses**may apply clustering to*market segmentation*— to aggregate prospective buyers into different groups, so the business can more effectively target its marketing efforts, as shown below.- In
**social networks**, clustering can be used to identify “similar” communities within the social network and to introduce members who have shared interests. **Search engines**use clustering to present relevant results.**Law enforcement**can use clustering to identify locations that have a greater frequency of certain types of crime and to analyze online communications for patterns that may be related to a potential terrorist attack.**Educational institutions**may use clustering to identify conditions that place students at a greater risk of poor performance.

Unlike classification and regression problems, which employ supervised learning, clustering problems rely on unsupervised learning. With supervised learning, you have clearly labeled data or categories that you are trying to match inputs to. For example, you may want to classify homes by price or classify transactions as fraudulent or honest.

Unfortunately, supervised learning is not always an option. For example, if you do not have clearly labeled data or know the categories into which you want to sort the data inputs, you cannot engage your artificial neural network in supervised learning. In other applications, you may not be interested in classifying your data into categories created by humans; instead, you want to see how your neural network clusters the data to call your attention to patterns you may never have thought to look for.

In such cases, unsupervised learning is the better choice. With unsupervised learning, you let the neural network cluster your data into different groups.

One of the more interesting applications of clustering is its use by large retailers to decide whom to invite to their loyalty programs or when to offer promotions. With unsupervised learning, the machine may identify three clusters of customers — loyal, somewhat loyal, and not loyal. (The not loyal customers always buy from whichever retailer offers the lowest price.) Knowing these clusters, the large retailers create strategies to try and elevate somewhat loyal customers to loyal customers. Or they could invite their loyal customers to participate in special promotions.

Other companies use clustering to decide where to place new stores. For example, a seller of athletic footwear may feed demographic and sales data into an artificial neural network to find locations that have the highest concentration of active runners or locations where customers allocate a higher percentage of their income to outdoor recreation.

When you decide to use machine learning to solve a problem, what is most important is that you choose the right approach for the problem you are trying to solve. Classification is great when you know what you are looking for and can teach the machine the relationship between inputs and labels or between independent variables and a dependent variable. Clustering is a more powerful tool for gaining insight — for seeing things in a different way, a way you may never have considered or when you have a high volume of unlabeled data you want to analyze. After all, there is much more unlabeled (unstructured) data available than there is labeled (structured) data.

When you’re trying to decide which approach to take — classification, regression, or clustering — first ask yourself what problem you’re trying to solve or what question you need to answer. Then ask yourself whether the problem or question is something that can best be addressed with classification, regression, or clustering. Finally, ask yourself whether the data you have is labeled or unlabeled. By answering these questions, you should have a clearer idea of which approach to take: classification or regression (with supervised learning) or clustering (with unsupervised learning).

Unlike human beings who often learn for the intrinsic value of knowing something, machine-learning is almost always purpose-driven. Your job as the machine's developer is to determine what that purpose is *before* you start development. With neural network regression and classification, you then need to decide on the capability that will serve that purpose best. Ask yourself, “Am I looking at a classification problem, a regression problem, or a clustering problem?” Those are the three things artificial neural networks do best: classification, regression, and clustering. Here’s how you choose:

1. **Classification** is best when you need to assign inputs to known (labeled) categories. There are two types of classification:

*Binary classification*is used when you have only two categories; for example, a machine learning algorithm may create a line graph that distinguishes between dogs and cats based on their size and type, as shown below.*Multi-class classification*is used when you have three or more categories; for example, if you need a system that can place customers in four categories — very dissatisfied, dissatisfied, satisfied, and delighted.

2. **Regression** is best when you need to predict a *continuous response value* — a variable that can take on any value between its minimum and maximum value; for example, if you need a system that can predict the value of a home based on certain criteria, such as square footage, location, number of bedrooms and bathrooms, and so on.

3. **Clustering** is the right choice when you want to identify patterns in the data and have no idea what those patterns may be; for example, if you want to identify patterns among loyal, somewhat loyal, and un-loyal customers.

In this article, you gain a deeper understanding of classification and regression. In my next article, I focus on clustering problems. But first, let's take a look at how the approach to machine learning differs based on the type of problem you are trying to solve.

Classification and regression problems involve *supervised* learning — using training data, to teach the machine (the artificial neural network) how to associate inputs with outputs. For example, you may feed the machine a picture of a cat and tell it, "This is a cat." You feed it a picture of a dog and tell it, "This is a dog." Then, you feed the machine test data; for example, a picture of a cat without telling the machine what the animal in the picture is, and the machine should be able to tell you it's a cat. If the machine gives the incorrect answer, you correct it, and the machine makes adjustments to improve its accuracy.

Clustering problems are in the realm of *unsupervised* learning. You feed the machine data inputs without labels, and the machine identifies common patterns among the inputs without labeling those patterns.

For more about supervised and unsupervised learning, see my previous post Supervised Versus Unsupervised Learning.

Classification is one of the most common ways to use an artificial neural network. For example, credit card companies use classification to detect and prevent fraudulent transactions. The human trainer will feed the machine an example of a fraudulent transaction and tell the machine, "This is fraud." The trainer then feeds the machine an example of an honest transaction and tells the machine, "This is not fraud." As the trainer feeds more and more labeled data into the machine, it learns the patterns in the data that distinguish fraudulent transactions from honest transactions.

The machine may be set up with three output nodes (one for each class). If a transaction is highly characteristic of fraud, the Fraud neuron fires to cancel the transaction and suspend the card. If a transaction is less characteristic of fraud, the Maybe Fraud neuron fires to notify the cardholder of suspicious activity. If the transaction is even less characteristic of fraud, the Not Fraud neuron fires and the transaction is processed.

In regression problems, the machine tries to come up with an approximation based on the data input. During the training session, instead of showing the machine how inputs are connected to labels, you show the machine the connection between a known outcome and the variables that impact that outcome. For example, the amount of time it takes to drive home from work varies depending on weather conditions, traffic conditions, and the time of day, as shown below.

A stock price predictor would be another example of machine learning used to solve a regression problem. The stock price would be the dependent variable and would be driven by a host of independent variables, including earnings, profits, future estimated earnings, a change of management, accounting errors or scandals, and so forth.

One way to look at the difference between classification and regression is that with classification the output requires a class label, whereas with regression the output is an approximation or likelihood.

In my next article, I examine an entirely different type of problem — those that can be solved not by classification or regression but by clustering.

In a previous article What Is Machine Learning? I define machine learning as "the science of getting computers to perform tasks they weren't specifically programmed to do." So what is Deep Learning? Deep learning is a subset of machine learning (ML), which is a subset of artificial intelligence (AI):

*Artificial intelligence*is any technique or combination of techniques that enable computers to simulate human intelligence. Techniques may include logic (if-then statements), rules, decision trees, and machine learning, to name a few.*Machine learning*is a subset of AI that relies on statistical approaches to enable machines to perform a function on the data it is fed, and then become progressively better at performing that function. An example of a machine learning product is a system that recommends products by associating a shopper's behaviors with those of other shoppers with similar interests and purchase histories.*Deep learning*is a subset of machine learning that involves the use of multi-layered artificial neural networks with a structure and function similar to biological brains, enabling the system to train itself to perform and improve at complex tasks. For example, Google created an artificial neural network called AlphaGo that learned how to win a board game called Go by playing against professional Go players.

In 1958 Cornell professor Frank Rosenblatt created an early version of an artificial neural network composed of interconnected perceptrons. Like the nodes in modern artificial neural networks, a perceptron takes in binary inputs and performs a calculation on those inputs to produce an output, as presented below. Note that with a perceptron both the inputs and outputs are binary — for example, zero/one, on/off, in/out.

Rosenblatt's machine, the Mark I Perceptron, had small cameras and was designed to learn how to tell the difference between two images. Unfortunately, it took thousands of tries, and even then the Mark I had difficulty distinguishing even basic images. In other words, the Mark I Perceptron wasn't a very good student. It could not develop a skill that is relatively easy for humans to learn.

The Mark I Perceptron had a couple flaws — it had only one layer of perceptrons, and the perceptrons were equipped with binary functions. As a result, this artificial neural network could solve only linear problems and had no easy and effective way to adjust the strength of the connections between neurons, which is required for learning to take place.

These problems were solved primarily by the introduction of hidden layers in the mid-1980s by Carnegie Mellon professor Geoff Hinton and by replacing binary functions with the sigmoid function, which increased the variation in outputs while limiting those variations between zero and one.

These additions enabled the artificial neural network to tackle much more complicated challenges. However, these early artificial neural networks continued to struggle; they were slow, having to review a problem several times before becoming "smart" enough to solve it.

Later, in the 1990s, Hinton started working in a new field called *deep learning* — an approach that added many more hidden layers between the input and output layers of the neural network.

The hidden layers of a hidden network function like a black box, swirling together computation and data to find answers and solutions. No human knows how the network arrives at its decision. For example, in 2012, Google’s DeepMind project wanted to see how a deep learning neural network might perceive video data. Developers fed 10 million random images from YouTube videos into a network that had over 1 billion neural connections running on 16,000 processors. They didn’t label any of the data. So the network didn’t know what it meant to be a cat, a human being, or a car. Instead, the network just looked through the images and came up with its own clusters.

It found that many of the videos contained a very similar cluster. To the network this cluster looked like this.

Now as a human being you might recognize this as the face of a cat, but to the neural network this was just a very common pattern that it recognized in many of the videos. In a sense it invented its own interpretation of a cat. After performing this exercise, the network was able to identify a cat in an image 74.8% of the time.

While it is certainly intriguing to see an artificial neural network recognize objects without ever being trained to do so, the real mystery is how the network accomplishes such a feat. We know that the machine adjusts strengths of the connections between neurons, but we cannot describe the "thought processes" in a way that supports any of the conclusions the machine draws.

The black box nature of hidden layers is important to keep in mind when designing artificial neural networks, because you may be "flying blind" when you're developing your initial design. Success depends a great deal on taking an empirical approach — trying different arrangements of neurons, starting with different weights and biases, trying different activation functions, and then looking at the results and making adjustments.

In a previous article, Neural Network Backpropagation, I define backpropagation as a machine-learning technique used to calculate the gradient of the cost function at output and distribute it back through the layers of the artificial neural network, providing guidance on how to adjust the weights of the connections between neurons and the biases within certain neurons to increase the accuracy of the output.

Imagine backpropagation as creating a feedback loop in the neural network. As the machine tries to match inputs to labels, the cost function tells the neural network how wrong its answer is. Backpropagation then feeds the error back through the network to make adjustments via *gradient descent* —an optimization algorithm that minimizes cost by repeatedly and gradually moving the output in the direction where the slope of the gradient decreases, as shown below. The goal is to reach the *global cost minimum* — the point where the slope of the gradient is as close to zero as possible.

Backpropagation relies on the neural network *chain rule* — a technique used to find the derivatives of cost with respect to any variable in a nested equation. In math, a derivative expresses the rate of change at any given point on a graph; it is the slope of the tangent at that point. A *tangent* is a line that touches a curve at only one point. As shown in the image above, the dotted line labeled "Gradient" and the solid line labeled "Global Cost Minimum" both represent tangents; their slopes represent derivatives.

Note that gradient descent gradually moves the weight of a connection from a point where the slope of the tangent is very steep to a point where the slope is nearly flat. The chain rule can be used to calculate the derivative of cost with respect to any weight in the network. This enables the network to identify how much each weight contributes to the cost (and causes the wrong answer) and whether that weight needs to be increased or decreased (and by how much) to reduce the cost (and improve the odds of a right answer).

Here's how the cost function, the chain rule, gradient descent, and backpropagation of errors work together to enable the neural network to learn:

- The cost function tells the network how wrong it is.
- The chain rule enables the network to identify how much each weight contributes to the cost (error) and how much each weight needs to be adjusted.
- Gradient descent tells the network the direction each weight needs to be adjusted to reduce the amount that weight contributes to the error.
- Through backpropagation, the network adjusts the weights of the connections and the bias of certain neurons one layer at a time, starting at the output layer and moving back through the hidden layers to the input layer.
- The network recalculates its answer with the adjusted weights and biases, checks its answer against the correct answer, and backpropagates the error. The process continues until the network produces the correct answer with the desired level of certainty.

With backpropagation, weights and biases are adjusted from the output layer back through the hidden layers to the input layer. For example, if the neural network has four layers — an input layer, an output layer, and two hidden layers — weights and biases are adjusted in the following order:

- The network first adjusts the weights of the connections between the second hidden layer and the output layer and determines how those adjustments impact the output.
- It then adjusts the connections between the first hidden layer and the second hidden layer and determines how those adjustments impact the output.
- Finally, it adjusts the connections between the input layer and the first hidden layer and determines how those adjustments impact the output.

In other words, the network turns the dials, starting with the dials closest to the output and working back through the network, testing the output after every adjustment and before making the next adjustment.

Using this technique, the network is able to reduce errors by fine-tuning the weights and biases one level at a time.

Through the chain rule, backpropagation focuses less on individual adjustments and more on the cumulative effect of those adjustments. Therefore, it uses the following strategies to make adjustments:

- Make adjustments from the front of the neural network back, because adjustments between the input layer and the first hidden layer will have a larger cumulative effect than adjustments between higher layers. Imagine backpropagation and the chain rule applied to four megaphones aligned in a row. Cranking up the volume on the first megaphone will have a bigger cumulative effect on the output in the final megaphone than if you were to merely crank up the volume on the final megaphone.
- First adjust the weights that have the most room for adjustment. A big twist of a dial is more likely to have a greater impact on the output than a small twist of the dial. By making big changes first followed by smaller changes, backpropagation can more effectively fine-tune the output of each layer.

As you work with artificial neural networks and machine learning, keep in mind that no one factor is responsible for learning. Several elements must work together, including the cost function, the chain rule, gradient descent, and backpropagation of errors. Machine learning is essentially an exercise in testing answers and nudging the entire network toward reducing the likelihood of wrong answers.

As I explain in my previous article, Machine Learning Gradient Descent, machine learning requires the use of a cost function along with gradient descent. As the machine learns to perform a task, the cost function tells the machine how wrong its output is, and gradient descent provides a way for the machine's neural network to adjust the strength of the connections between neurons to improve the machine's accuracy.

As you might imagine, this learning process can consume a lot of memory and processing power, especially when the neural network is trying to process hundreds or thousands of inputs. Just imagine your own brain trying to identify a suspect. The detective shows you 1,000 mug shots and asks you to point out the person who most closely resembles the person who committed the crime. You would need to remember all of those images and then have to go through the images several times to pick out what you considered the closest match.

In the same way, if you feed 1,000 inputs into a neural network, it must store all those inputs in memory and then figure out, based on the collective inputs, how to adjust the weights and biases of all its neurons to arrive at the most accurate outputs for each and every one of those inputs.

To alleviate this processing burden, especially when dealing with massive datasets in a neural network gradient descent, data science teams will feed the machine one data input at a time — a technique referred to as *stochastic gradient descent* — or feed the machine a small batch of data inputs at a time — a technique referred to as *batch gradient descent*.

Let's return to the example of the mug shots. Suppose that instead of showing you 1,000 mugshots, the detective broke the mugshots down into sets of 100 and asked you to point out the person who most closely resembles the person who committed the crime. The process of picking out one person among 100 would be much easier than having to pick out one person among 1,000. You could then narrow down the list to 10 possibilities and pick one from the 10.

Stochastic means randomly determined. In statistics, this term often describes techniques used to analyze randomly selected data to approximate a distribution pattern. The goal is to arrive at a quick approximation instead of spending a lot of time establishing a precise pattern.

Suppose we build a neural network for classifying fruit. We have 1,000 pictures of different fruits — apples, pears, bananas, different types of melons, grapes, papaya, and so forth. We could feed the machine the entire batch of 1,000 pictures all at once, but the machine might take a very long time examining all those pictures and making all the necessary adjustments.

To speed the process, we could use batch gradient descent. For example, we could shuffle the pictures to introduce some randomness and then divide them into 10 batches of 100 pictures each. We would then feed the pictures into the machine one batch at a time, so the network would have fewer pictures to process at one time. More importantly, the neural network would have far fewer adjustments to make at one time.

The benefit of stochastic gradient descent is that the network processes smaller batches of inputs much faster and consumes significantly less processing power than if it had to process all inputs at once. As a result, stochastic gradient descent is especially useful for massive datasets that the network can’t store in its memory at any one time. Potential drawbacks include the following:

- The neural network's accuracy may vary considerably with each batch. It may be highly accurate when processing one batch of inputs and not so accurate when processing others.
- You have to do ten training sessions instead of feeding the network all 1,000 pictures in a single session.

The key thing to remember is to not be overconfident with the results from each of these batches. You can get accurate results pretty quickly with a smaller batch of the training data, but you still need to run all the data through the network to enable the machine to build an accurate model.

Perhaps the biggest benefit of stochastic gradient descent or batch gradient descent is that it enables you to train a neural network using large data sets when the network doesn't have sufficient memory and processing capacity to train on large data sets.

An artificial neural network requires several components to drive its machine learning process, including the following:

**Artificial neurons**: Commonly referred to as "nodes," artificial neurons are like brain cells. Each neuron receives one or more inputs and performs a calculation on those inputs to produce an output.**Weights**: Weights are added to the connections between neurons to control the relative importance of each neuron's output. For example, suppose you have an artificial neural network designed to tell whether a person is smiling or frowning. You would want to place more weight on inputs related to the person's mouth and eyes and less weight on inputs related to their nose, chin, and hair.**Biases**: Bias is similar to weight, but it is an adjustment made within a neuron to control its output.**Activation functions**: The activation function, within each neuron, is responsible for performing a calculation on the sum of the weighted inputs to produce the neuron's output.**Cost function**: The cost function resides at the end of the neural network and calculates the difference between the network's answer and the correct answer. In other words, it determines how wrong the artificial neural network is.**Gradient descent**: Gradient descent is a technique that tells the artificial neural network the adjustments required to bring the answer closer to the correct answer. See my previous article Fine Tuning Machine Learning with Gradient Descent for details.**Backpropagation**: Neural Network Backpropagation calculates the gradient of the cost function at output and distributes it back through the layers of the artificial neural network, providing guidance on how to adjust the weights to increase the accuracy of the output. Think of weights and biases as dials that can be turned to adjust each neuron's output. Backpropagation provides guidance on which dials to turn, in what direction, and by how much.

To understand how backpropagation works, imagine standing in front of a control board that has a few hundred little dials like the ones you see in professional sound studios. You’re looking at a screen above these dials that has a number between one and zero. Your goal is to get that number as close to zero as possible — zero cost. You don't know anything about the purpose of each dial or how its setting might impact the value on the screen. All you do is turn dials while watching the screen.

When you look closely at these dials, you notice that each has a setting from 0 (zero) to 1 (one). Turning a dial clockwise brings the setting closer to one. Turning it counter clockwise brings the setting closer to zero. Each dial represents a weight — the strength of the connection between two neurons. It’s almost as though you’re tuning an instrument without actually knowing the notes. As you make adjustments, you get closer and closer to perfect pitch, at which point the cost is zero.

With an artificial neural network, the dials start with random settings, which allow them to be turned up or down. During the training process, the network looks for the dials with the greatest weights — the dials that are turned up higher than all the others. It turns all of these dials up a tiny bit to see if that lessens the cost. If that adjustment doesn’t work, the network turns them down a little.

Suppose we build an artificial neural network for identifying dog breeds. It is designed to distinguish among 10 breeds: German shepherd, Labrador retriever, Rottweiler, beagle, bulldog, golden retriever, Great Dane, poodle, Doberman, and dachshund. We feed a black-and-white image of a beagle into the machine.

This grayscale image is broken down into 625 pixels in the input layer, and that data is sent over 12,500 weighted connections to the 20 neurons in the first hidden layer (20 x 625 = 12,500). The first hidden layer neurons perform their calculations and send the results over 400 weighted connections to 20 neurons in the second hidden layer (20 x 20 = 400). Those second hidden layer neurons send their output over 200 weighted connections to the 10 neurons in the output layer (20 x 10 = 200). So our network has 13,100 dials to turn (12,500 + 400 + 200 = 13,100). On top of that it also has 50 settings to adjust the bias in the hidden and output layer neurons. All the weights start with random settings.

We send our beagle picture through the neural network, and the output layer delivers its results; it’s 0.3 certain it’s a German shepherd, 0.8 sure it’s a Labrador retriever, 0.5 sure it’s a Rottweiler, 0.2 sure it’s a beagle, 0.3 sure it’s a bulldog, 0.6 it’s a golden retriever, 0.3 sure it’s a Great Dane, 0.3 sure it’s a poodle, 0.4 sure it’s a Doberman, and 0.7 sure it’s a dachshund.

Obviously, those are lousy answers. The network is much more certain that the picture of the beagle represents a Labrador retriever, a Rottweiler, a golden retriever, or a dachshund than a beagle.

The neural network needs to use backpropagation to find out how to adjust its weights and minimize the cost. The best place to start is by dialing up the correct answer (beagle), because it’s the right answer and it has the most room for adjustment; that is, you can dial it up more than you can dial the others up or down. The next priority is to dial down the wrong answers starting with the highest number, so you would start by dialing down the 0.8 (Labrador retriever) and the 0.7 (dachshund).

So backpropagation looks at 0.2 and works its way back to the connections to this output neuron to identify which connections have the most room for adjustment, and it dials those up or down. It then looks back to the second hidden layer neurons to see which neurons have the most room to adjust the bias, and it dials those up or down. The network continues to work back through the connections and neurons and continues to make adjustments until it reaches the input layer.

As you can see, backpropagation is a powerful technique that enables machines to learn as we often do as humans — through trial and error. We make mistakes, analyze the outcome, and then make adjustments to improve the outcomes. If we don't, we pay the high cost incurred from continually making the same mistakes!