Implementing Back Propagation Algorithm In A Neural Network

Neural Network

Math

What is a Neural Network?

Artificial Neural Networks (ANNs) are information processing systems that are inspired by the biological neural networks like a brain. They are a chain of algorithms which attempt to identify relationships between data sets.

In its simplest form, a biological brain is a huge collection of neurons. Each neuron takes electrical and chemical signals as inputs through its many dendrites and transmits the output signals through its axon (in a more specialized context, there are exceptions to this behavior like with multipolar neurons). Axons make contact with other neurons at specialized junctions called synapses where they pass on their output signals to other neurons to repeat the same process over and over millions and millions of times. Image source

Taking inspiration from the brain, an artificial neural network is a collection of connected units, also called neurons. The connection between the neurons can carry signals between them. Each connection carries a real number value which determines the weight/strength of the signal.

ANNs are universal function approximators who at there very essence attempt to approximate a function/relation between the input data and the output data.

Neurons

Neurons are the basic building blocks of ANNs. An artificial neuron receives vectorized input and operates on them to produce vectorized output. This is similar to dendrites in a biological neuron receiving input signals and sending out an output signal through its axon.

Each input to an artificial neuron is individually weighted to determine the strength of each input. A weighted summation of all the inputs with their weights forms the total net input to a neuron. For a neuron with n inputs i and the same number of corresponding weights w, the Total Net Input would be: Consider the following diagram of an artificial neuron. In the above artificial neuron, i1 and i2 are the inputs to the neuron where each input has weights w1 and w2 respectively. Hence, the Total Net Input for the neuron would be: The neuron then computes the output using an activation function on the Total Net Input. The result of the activation function is the output of the neuron. The output of a neuron is forwarded to the next neuron in line as input. In an ANN, the results from one layer of neurons are progressively forwarded to the neurons in the next layer until they exit from the output layer as the output of the network.

Activation Functions

The Total Net Input to a neuron could be anywhere between +Infinity to -Infinity. To decide whether the neuron should fire/activate (i.e. generate an output) researchers introduced activation functions inside neurons. Activation functions provide complex non-linear functional mapping between the net inputs and output of a neuron.

Without the activation functions, the output of a neuron will just be a linear function of the input parameters. An ANN without an activation function is a Linear Regression Model i.e. polynomial function of degree one. Though, it would be much easier to train such an ANN, they would be limited in their complexity and have less capability to learn complex functional mappings from complex non-linear data sets like images, audio, video etc.

Similarly, having linear activation functions would also give us a linear neural network of input parameters regardless of how many hidden layers we add in the network.

Unlike linear functions, non-linear functions are polynomials functions of a degree more than one and their plottings on a map are curved. For an ANN to be able to create non-linear arbitrary and complex mappings between the input and output, we need non-linear activation functions.

Another important requirement for an activation function is that it should be differentiable so that we can calculate the error gradient of the network with respect to its weights. This gradient is needed to perform back propagation optimization strategy as we will soon see.

Some of the most popular activation functions are:

Every activation function has its own characteristics, for example Sigmoid suffers from the vanishing gradient problem and the zero-centered problem. Tanh solves the zero-centered problem but suffers from the vanishing gradient problem. ReLU helps in solving the vanishing gradient problem but it too suffers from the zero-centered problem.

Recently, ReLU has become the popular choice as activation function for hidden layer neurons.

The choice of activation function inside a neuron heavily influences the learning techniques employed on the network and vice versa.

Though Sigmoid has fallen out of favor with neural network designers nowadays, we would be using it in the current implementation. The sigmoid function non-linearly squashes or normalizes the input to produce an output in a range of 0 to 1.

Topology Of an Artificial Neural Network

The neurons in an ANN are arranged in two layers vis hidden layer and output layer. In addition to these two layers, there is also a third layer called the input layer which doesn't has any neurons but instead just acts as a vector or matrix for the input data.

One of the simplest form of neural networks is a single hidden layer feed forward neural network. All the directed connections in a neural network are meant to carry output from one neuron to the next neuron as input. All these connections are weighted to determine the strength of the data they are carrying.

Multi Layer Perceptrons (MLP)

MLPs are one of the most popular topologies of neural networks being used today. MLPs have only one input and output layer but can have more than one hidden layers. Each layer in MLPs can have numerous nodes. In addition to the neurons, each layer except the output layer also has a bias node. The bias node in a layer also has weighted connections with neurons of the next layer but its input value is always 1, hence the weight of the connection is the effective input from the bias to the neuron. The Total Net Input to a neuron with n inputs and a bias is: Where w and i are the weight and input respectively while b is the weight from the bias node to the neuron.

All inputs from the input layer along with the bias are forwarded to each neuron in the hidden layer where each neuron performs a weighted summation of the input and sends the activation results as output to the next layer. The process repeats until the data final exits form the network through the output layer. This process is called the feed forward process.

Why do we need bias nodes?

As mentioned before, neural networks are universal function approximators and they assist us in finding a function/relationship between the input and the output data sets. In a neural network without bias nodes, the output is the result of a direct function of the inputs of the network without any constant to adjust the shift in the output.

To understand better, consider the geometric equation of a line passing through the origin on a 2D cartesian plane: Here, m is the slope of the line and x & y hold the coordinates of the points on the line.

For a line not passing through the origin, a new constant c (also called the y-intercept) is introduced in the equation to quantify the shift in the line away from the origin. The y-intercept is not related to any of the coordinates. For the line passing through the origin, we can train a neural network without bias nodes to estimate the y coordinate of a point if the x coordinate is given. After training, the weights in the neural network will get adjusted to emulate the value of the slope m. But such a network will not be able to learn, with reasonable accuracy, from a coordinates set of a line not passing through the origin because there are no weights unrelated to the input to emulate the behavior of the constant c. Each weight in the network is directly or indirectly related to the input of the network.

A neural network with bias nodes however can learn to predict the coordinates of a line not passing through the origin. With training, the weights of the bias nodes will also get adjusted to emulate the behavior of the y-intercept c.

Back Propagation Algorithm

Back propagation algorithm is a supervised learning algorithm which uses gradient descent to train multi-layer feed forward neural networks.

The back propagation algorithm involves calculating the gradient of the error in the network's output against each of the network's weights and adjusting the weights to reduce the error. In simpler words, it calculates how much effect each weight in the network has on the network's error and adjusts every weight to reduce the error.

The error value indicates how much the network's output (actual) is off the mark from the expected output (target). We use the Mean Squared Error function to calculate the error.

The error value of a single output neuron is a function of its actual value and the target value. The total error of the network is a sum of all error values from all output neurons. For a network with n neurons in the output layer, the Total Error is: To understand the back propagation algorithm, consider a single output neuron with only one input connection. The input data would follow the following flow to give us the total network error. To calculate the gradient of the Total Error against a weight, we calculate the partial differential of the Total Error with respect to the weight. We employ the chain rule to simplify the above equation by splitting it into smaller equations. Let's look into solving each of the individual differentials to get the initial desired differential.

The Total Error of a network is the sum of errors in all its output neurons. Since we are calculating the partial differential with respect to only one neuron's error, errors from other neurons are treated as constants since they are not a function of the selected neuron's error, hence they don't effect the partial differential. So essentially we are calculating the partial differential of an entity against itself, hence the partial differential of Total Error with respect to one neuron's error is 1. As mentioned earlier, we use the Mean Squared Error function to calculate the error in an output neuron's actual output. Since, actual is the output of the neuron, the derivative of the error with respect to the Output would be: As we are using the Sigmoid activation function to calculate output from the Total Net Input, which is a function of only one variable, we calculate the derivative instead of partial derivative of a neuron's output with respect to its input using this formula. As we already know, the Total Net Input of a neuron is a weighted summation of all its inputs and the bias. The derivative of the Total Net Input with respect to one of its weights is the corresponding input factor of that particular weight since all other weighted sums and bias will be treated as constants. After solving for each of the individual differentials, we can now calculate the partial differential of the Total Error with respect to a weight. Once we have that, we adjust our weight in proportion to the learning rate. The learning rate determines how drastically do the weights in a network get adjusted based on the error differential. In other words, how quickly or slowly will the neural network adapt to reduce its errors. The learning rate's value spans between 0 and 1. There are caveats for using both high and low learning rates.

If a neural network's learning rate is too high then it will quickly adjust it weights for any errors in the output. As a consequence, the latest data set in the training data set will have a much higher bearing on the network's output than the earlier data sets. In simpler words, the network will readily learn from the latest data but will keep forgetting lessons from the older data.

On the other hand, having a very low learning rate is also fraught with consequences. The network will have a higher inertia against learning i.e. it will be very slow to responding to any errors in the output. As a result, it will either only learn in the beginning and become prejudiced against the later training data sets or it will not learn at all and will remain stuck with the initial weights it was initialized with.

We just saw how back propagation of errors is used in MLP neural networks to adjust weights for the output layer to train the network. We use a similar process to adjust weights in the hidden layers of the network which we would see next with a real neural network's implementation since it will be easier to explain it with an example where we have actual numbers to play with.

Neural Network Implementation

Let's build a neural network and train it to approximate the function for a XOR gate. The following are the I/O sets for a XOR gate:

A B A XOR B
0 0 0
0 1 1
1 0 1
1 1 0

As we can see a XOR gate accepts two inputs and returns a single output. Hence we need a neural network with two input nodes and one output neuron. We will also opt to use two neurons in the hidden layer of our neural network. Determining the number of neurons needed in the hidden layer depends on a wide variety of variables and there are a number of hidden layer configuration designs to choose from.

The general consensus is that, if the number of inputs is greater than the outputs, the number of neurons in the hidden layer should be between the number of nodes in the input and the output layer but never greater than the number of input nodes. The last bit exists so that the network can generalize better instead of overfitting/memorizing the training data.

Since we know the expected output for our inputs, this will be called supervised learning. Also, since we will be predicting the output value for the input set, we are creating a regression neural network model.

Apart from neurons we will also add one bias node in the input and the hidden layer. Notice that we have labeled all the nodes and the weights in the network for easy of explanation.

The next step is to initialize all the connections between the neurons of different layers with weights. Initially these weights are just random decimal point numbers between 0 and 1.

Let us start training our network with the second row in the above table. The input nodes will have the values 0 & 1. Now that we have initialized our network let us see what output it spews out for the selected input.

Feed Forward

Hidden Layer

We need to calculate the Total Net Input to each hidden layer neuron, squash it using the activation function and then repeat the same process for the output layer.

First let's calculate the Total Net Input to the first hidden layer neuron, h1.

As described earlier, the Total Net Input to a neuron is a weighted summation of all of its inputs and its bias. Hence, the Total Net Input to h1 is:

TotalNetInput to h1 = w1 * i1 + w3 * i2 + w5 * b1
TotalNetInput to h1 = 0.604 * 0 + 0.940 * 1 + 0.664 * 1
TotalNetInput to h1 = 1.604 Now that we have the Total Net Input to h1 we can calculate the Output from h1 using the sigmoid function:

Output from h1 = 1/ (1 + e^(-TotalNetInput to h1))
Output from h1 = 1/ (1 + e^(-1.604))
Output from h1 = 0.8325766980765932 Similarly, the Total Net Input to the second hidden layer neuron, h2 will be:

TotalNetInput to h2 = w2 * i1 + w4 * i2 + w6 * b1
TotalNetInput to h2 = 0.437 * 0 + 0.424 * 1 + 0.686 * 1
TotalNetInput to h2 = 1.11 And the Output from h2 is:

Output from h2 = 1/ (1 + e^(-TotalNetInput to h2))
Output from h2 = 1/ (1 + e^(-1.11))
Output from h2 = 0.7521291114395702 As we know, in a feed forward neural network, output from one layer is input to the next layer. So now let's input our hidden layer outputs in the output layer.

Output Layer

We begin by calculating the Total Net Input to the output layer's neuron, o1.

TotalNetInput to o1 = w7 * Outputh1 + w8 * Outputh2 + w9 * b2
TotalNetInput to o1 = 0.065 * 0.8325766980765932 + 0.156 * 0.7521291114395702 + 0.096 * 1
TotalNetInput to o1 = 0.26744962675955153 The Output from o1 would be:

Output from o1 = 1/ (1 + e^(-TotalNetInput to o1))
Output from o1 = 1/ (1 + e^(-0.26744962675955153))
Output from o1 = 0.5664666852388589 0.5664666852388589 is the output from the neural network for the given inputs 0 and 1 while the expected output is 1. As expected from the first run of a neural network, the actual output is quite the way off from the target output since we are operating with random weights.

Back propagation

Now we will employ back propagation strategy to adjust weights of the network to get closer to the required output.

We will be using a relatively higher learning rate of 0.8 so that we can observe definite updates in weights after learning from just one row of the XOR gate's I/O table. Generally a starting learning rate of 0.5 is used in most applications. I say starting rate because many back propagation techniques nowadays also update the learning rate as the training progresses. Another interesting concept is to use a variable learning rate based on the actual output's deviation from the target output instead of a fixed learning rate. For the sake of simplicity we will only be using a fixed learning rate in our implementation.

As we know, the Total Error from a neural network is simply a sum of errors from all the output neurons. Since we have only one output neuron in our network, the Total Error is simply the error from output neuron o1.

Total Error = Erroro1 Let us begin updating weights to the output layer.

Output Layer

First we calculate the partial differential of Total Error from the network with respect to the Output of output layers only neuron o1. Hence, the partial differential of Total Error with respect to Output of o1 is:

pd(TotalError)/pd(Output) = pd(Erroro1)/pd(Outputo1)
pd(TotalError)/pd(Output) = actualo1 - targeto1
pd(TotalError)/pd(Output) = 0.5664666852388589 - 1
pd(TotalError)/pd(Output) = -0.4335333147611411 Next we need a derivative of Output of o1 with respect to the Total Net Input to o1. From what we earlier discussed:

d(Output)/d(TotalNetInput) = Outputo1 * (1 - Outputo1)
d(Output)/d(TotalNetInput) = 0.5664666852388589 * (1 - 0.5664666852388589)
d(Output)/d(TotalNetInput) = 0.24558217975335844 Now from our understanding of chain rule:

pd(TotalError)/pd(TotalNetInput) = pd(TotalError)/pd(Output) * d(Output)/d(TotalNetInput)
pd(TotalError)/pd(TotalNetInput) = -0.4335333147611411 * 0.24558217975335844
pd(TotalError)/pd(TotalNetInput) = -0.10646805643473987 We now have the partial differential of Total Error with respect to the Total Net Input to the output neuron. To adjust the weights from the hidden layer to the output layer we need to find partial differential of Total Error with respect to each of those individual weights and update them.

Next we calculate the partial differential of Total Net Input with respect to each of the weights (i.e. w7, w8 and w9). As we know from earlier explanations, partial differential of Total Net Input with respect to the weight is simply the input associated with that weight when calculating the Total Net Input. Therefore,

pd(TotalNetInput)/pd(w7) = Outputh1
pd(TotalNetInput)/pd(w7) = 0.8325766980765932 And according to the chain rule, we can express partial differential of Total Error with respect to weight as:

pd(TotalError)/pd(weight) = pd(TotalError)/pd(TotalNetInput) * pd(TotalNetInput)/pd(weight) Building on the above equation,

pd(TotalError)/pd(w7) = pd(TotalError)/pd(TotalNetInput) * pd(TotalNetInput)/pd(w7)
pd(TotalError)/pd(w7) = -0.10646805643473987 * 0.8325766980765932
pd(TotalError)/pd(w7) = -0.08864282287706811 Now that we know how much the Total Error of the network is affected by the weight w7, we can use this to set a new value for w7:

w7' = w7 - (learning rate * pd(TotalError)/pd(w7))
w7' = 0.065 - (0.8 * -0.08864282287706811)
w7' = 0.1359142583016545 Similarly, partial differential of Total Error with respect to w8 is:

pd(TotalError)/pd(w8) = pd(TotalError)/pd(TotalNetInput) * pd(TotalNetInput)/pd(w8)
pd(TotalError)/pd(w8) = -0.10646805643473987 * outputh2
pd(TotalError)/pd(w8) = -0.10646805643473987 * 0.7521291114395702
pd(TotalError)/pd(w8) = -0.08007772468295891 We can now update w8 as:

w8' = w8 - (learning rate * pd(TotalError)/pd(w8))
w8' = 0.156 - (0.8 * -0.08007772468295891)
w8' = 0.22006217974636713 There is also a third weight from the hidden layer to the output layer, w8. But unlike the above weights, this is not on a connection from a hidden layer neuron but instead from a bias node. Fortunately, updating this weight is also very much like updating the previous weights.

As we discussed earlier, the partial differential of Total Net Input with respect to a weight is simply the corresponding input associated with that weight. Since, the output from a bias node is always 1, the associated input in this case is also 1. Hence,

pd(TotalNetInput)/pd(w9) = 1 Now we can calculate the partial differential of Total Error with respect to this weight, which as per the chain rule can be expressed as:

pd(TotalError)/pd(w9) = pd(TotalError)/pd(TotalNetInput) * pd(TotalNetInput)/pd(w9)
pd(TotalError)/pd(w9) = -0.10646805643473987 * 1
pd(TotalError)/pd(w9) = -0.10646805643473987 Hence we can adjust the weight w8 on the connection from the bias node as:

w9' = w9 - (learning rate * pd(TotalError)/pd(w9))
w9' = 0.096 - (0.8 * -0.10646805643473987)
w9' = 0.1811744451477919 Updating the weights from the hidden layer to the output layer with the new weights should bring the output closer to the target output thereby reducing the network's error.

To further improve the network's output we also update the weights from the input layer to the hidden layer.

Hidden Layer

Like with the output layer, to update weights we calculate how much the change in the weights affects the Total Error of the network. i.e. we calculate the partial differential of error with respect to each of the hidden layer weights.

First we calculate the partial differential of Total Error with respect to a hidden neuron's output which can be expressed with the following equation for an output layer with n neurons.

pd(TotalError)/pd(output hidden neuron) = sigma i=0 to n (pd(Total Error)/pd(TotalNetInput output neuron) * pd(TotalNetInput output neuron)/pd(output hidden neuron)) Note that since the output from a hidden neuron is fed into each of the output layer neuron and the Total Error is a summation of errors from each of the output neuron, the partial differential of Total Error with respect to the output of a hidden neuron is a summation of chain rule calculations via Total Net Input of each of the output layer neuron.

Now since our network has only a single output neuron, we can calculate the partial differential of Total Error with respect to the output of h1 as:

pd(TotalError)/pd(Outputh1) = pd(Total Error)/pd(TotalNetInput o1) * pd(TotalNetInput o1)/pd(Outputh1) We have already calculated partial differential of Total Error with respect to the Total Net Input to o1 in the previous section. The output from h1 is one of the inputs to o1, and from our earlier discussions we know that partial differential of Total Net Input with respect to an input is the weight corresponding to the input while calculating the input. Therefore,

pd(TotalError)/pd(Output h1) = -0.10646805643473987 * w7
pd(TotalError)/pd(Output h1) = -0.10646805643473987 * 0.065
pd(TotalError)/pd(Output h1) = -0.006920423668258092 Notice that we used the old value of w7 not the new one because we are still calculating gradient for the error generated while the old weight was being used.

Moving on in our chain, we next calculate the derivative of output from h1 with respect to the Total Net Input to h1. The output, as we already know, is generated by squashing the input with the sigmoid function. Hence, the derivative is a derivative of the sigmoid function which as we have discussed earlier is:

pd(Output h1)/pd(TotalNetInput h1) = Output h1  * (1 - Output h1)
pd(Output h1)/pd(TotalNetInput h1) = 0.13939273989647058 Now that we have the necessary elements to calculate the partial differential of Total Error with respect to the Total Net Input to h1, we calculate it by first simplifying it using the chain rule:

pd(Total Error)/pd(TotalNetInput h1) = pd(TotalError)/pd(Output h1) * pd(Output h1)/pd(TotalNetInput h1)
pd(Total Error)/pd(TotalNetInput h1) = -0.006920423668258092 * 0.13939273989647058
pd(Total Error)/pd(TotalNetInput h1) = -0.000964656816362879 The Total Net Input to h1 is a function of inputs from i1, i2 and b1 with w1, w3 and w5 as the corresponding weights respectively. We will now find new values for weights w1 and w3 from input nodes and w5 from the bias node.

As we know, the partial differential of Total Net Input with respect to a weight is simply the input value corresponding to the weight. Therefore,

pd(TotalNetInput h1)/pd(w1) = i1
pd(TotalNetInput h1)/pd(w1) = 0 Again applying the chain rule,

pd(TotalError)/pd(w1) = pd(Total Error)/pd(TotalNetInput h1) * pd(TotalNetInput h1)/pd(w1)
pd(TotalError)/pd(w1) = -0.000964656816362879 * 0
pd(TotalError)/pd(w1) = 0 Now that we know how much the weight w1 affects the Total Error, we can adjust it to minimize the Total Error. Since the affect of w1 on the Total Error is zero the new weight should remain unchanged. This can be validated by applying the weight update equation.

w1' = w1 - (learning rate * pd(TotalError)/pd(w1))
w1' = 0.604 - (0.8 * 0)
w1' = 0.604 We can use a similar approach to calculate how much the partial differential of Total Net Input to h1 with respect to w3,

pd(TotalNetInput h1)/pd(w3) = i2
pd(TotalNetInput h1)/pd(w3) = 1 Hence,

pd(TotalError)/pd(w3) = pd(Total Error)/pd(TotalNetInput h1) * pd(TotalNetInput h1)/pd(w3)
pd(TotalError)/pd(w3) = -0.000964656816362879 * 1
pd(TotalError)/pd(w3) = -0.000964656816362879 Now similar to w1 we can calculate new weight for w3 as:

w3' = w3 - (learning rate * pd(TotalError)/pd(w3))
w3' = 0.94 - (0.8 * -0.000964656816362879)
w3' = 0.94 - (-0.0007717254530903032)
w3' = 0.9407717254530903 As with weights from the input nodes, we also need to update weight w5 from the bias node b1. First we find out how much the weight w5 affects the Total Net Input to h1. Since the corresponding input for the w5 in equation for calculating the Total Net Input is 1,

pd(TotalNetInput h1)/pd(w5) = 1 Hence,

pd(TotalError)/pd(w5) = pd(total error)/pd(totalnetinput h1) * pd(totalnetinput h1)/pd(w5)
pd(TotalError)/pd(w5) = -0.000964656816362879 * 1
pd(TotalError)/pd(w5) = -0.000964656816362879 Calculating the new value for weight w5:

w5' = w5 - (learning rate * pd(TotalError)/pd(w5))
w5' = 0.664 - (0.8 * -0.000964656816362879)
w5' = 0.664 - (-0.0007717254530903032)
w5' = 0.6647717254530904 Using the above techniques to calculate new values for weights from input layer to h2, we get the following values:

w2' = 0.437
w4' = 0.4264771473090286
w6' = 0.6884771473090286 Conclusion

If we again employ the feed forward technique to the neural network for the same input but with the new weights, we get 0.6130333654357785 as output. Which is slightly closer to the target output of 1 compared to the earlier output 0.5664666852388589. This is essentially how a neural network training works, over several iterations with randomly selected input data the network gradually updates its weight to reduce its total error. The choice of learning rate, number of hidden neurons, activation functions etc, dictate the speed and accuracy of the network's training life cycle.

Example

You can experiment with a basic GoLang based back propagation implementation I have on GitHub at: surenderthakran/back-propagation-demo where I train a neural network to learn the AND gate.

You can also checkout my attempt at writing a neural network library at surenderthakran/gomind which I hope will take a respectable shape in the coming weeks.