When you start your neural network you want to assign random neural network weights to each node in the network.

Published August 9, 2021

Doug Rose

Author | Agility | Artificial Intelligence | Data Ethics

As I explain in my previous article, Machine Learning Gradient Descent, machine learning requires the use of a cost function along with gradient descent. As the machine learns to perform a task, the cost function tells the machine how wrong its output is, and gradient descent provides a way for the machine's neural network to adjust the strength of the connections between neurons to improve the machine's accuracy.

As you might imagine, this learning process can consume a lot of memory and processing power, especially when the neural network is trying to process hundreds or thousands of inputs. Just imagine your own brain trying to identify a suspect. The detective shows you 1,000 mug shots and asks you to point out the person who most closely resembles the person who committed the crime. You would need to remember all of those images and then have to go through the images several times to pick out what you considered the closest match.

In the same way, if you feed 1,000 inputs into a neural network, it must store all those inputs in memory and then figure out, based on the collective inputs, how to adjust the weights and biases of all its neurons to arrive at the most accurate outputs for each and every one of those inputs.

To alleviate this processing burden, especially when dealing with massive datasets in a neural network gradient descent, data science teams will feed the machine one data input at a time — a technique referred to as *stochastic gradient descent* — or feed the machine a small batch of data inputs at a time — a technique referred to as *batch gradient descent*.

Let's return to the example of the mug shots. Suppose that instead of showing you 1,000 mugshots, the detective broke the mugshots down into sets of 100 and asked you to point out the person who most closely resembles the person who committed the crime. The process of picking out one person among 100 would be much easier than having to pick out one person among 1,000. You could then narrow down the list to 10 possibilities and pick one from the 10.

Stochastic means randomly determined. In statistics, this term often describes techniques used to analyze randomly selected data to approximate a distribution pattern. The goal is to arrive at a quick approximation instead of spending a lot of time establishing a precise pattern.

Suppose we build a neural network for classifying fruit. We have 1,000 pictures of different fruits — apples, pears, bananas, different types of melons, grapes, papaya, and so forth. We could feed the machine the entire batch of 1,000 pictures all at once, but the machine might take a very long time examining all those pictures and making all the necessary adjustments.

To speed the process, we could use batch gradient descent. For example, we could shuffle the pictures to introduce some randomness and then divide them into 10 batches of 100 pictures each. We would then feed the pictures into the machine one batch at a time, so the network would have fewer pictures to process at one time. More importantly, the neural network would have far fewer adjustments to make at one time.

The benefit of stochastic gradient descent is that the network processes smaller batches of inputs much faster and consumes significantly less processing power than if it had to process all inputs at once. As a result, stochastic gradient descent is especially useful for massive datasets that the network can’t store in its memory at any one time. Potential drawbacks include the following:

- The neural network's accuracy may vary considerably with each batch. It may be highly accurate when processing one batch of inputs and not so accurate when processing others.
- You have to do ten training sessions instead of feeding the network all 1,000 pictures in a single session.

The key thing to remember is to not be overconfident with the results from each of these batches. You can get accurate results pretty quickly with a smaller batch of the training data, but you still need to run all the data through the network to enable the machine to build an accurate model.

Perhaps the biggest benefit of stochastic gradient descent or batch gradient descent is that it enables you to train a neural network using large data sets when the network doesn't have sufficient memory and processing capacity to train on large data sets.

Related Posts

When you start your neural network you want to assign random neural network weights to each node in the network.

You can apply a sigmoid function to a perceptron in a neural network to smooth out the binary output.

The perceptron history starts with Frank Rosenblatt and the earliest work on artificial neural networks. This was some of the earliest steps in artificial intelligence.