# Mathematics of Neural Networks

## Mathematics Underpins the World Around Us!

## by Rickesh Bedia

### Published 2017-05-31

This blog article contains a few worked examples and some exercises for you to try yourself. For maximum benefit, find a piece of paper and a pen and work through the problems as you go.

To recap on the fundamentals of Neural Networks, click here, in my Deep Learning Blog. I also covered the basis of the maths behind the neural network. In this blog, I'm going to go into more detail with the Maths, and attempt to explain some higher level concepts.

We could use a pre-build library or framework (examples), but no, we want a true understanding of what is happening. We are going to model our own neural network.

## Our Very Own Neural Network

The neural network we are going to model is a very simple case. It has 2 inputs (*i _{1}*,

*i*) 1 hidden layer with 2 neurons (

_{2}*h*,

_{1}*h*) and 2 outputs (

_{2}*o*,

_{1}*o*).

_{2}This neural network could be modelling how to get from [1, 2] to [3, 4]. Or, say you gave a word a numerical value based on the position of the letter in the alphabet (a = 1, b = 2, ..., z = 26). You may want to predict the next words for a keyboard. For example, what are the next two words after "How are"? "How" has value 8 + 15 + 23 = 46. "are" has value 1 + 18 + 5 = 24. The inputs are 46 and 24, and you want to train your neural network to output "you today", with values 61 and 65. You then have a problem of decoding, as 65 could represent "today" or "wori", which although is not a word, still has the correct value.

In order to create a neural network you need the following: (The values in the brackets relate to the above 2 -2 -2 neural network.)

- A set of inputs (1, ..., n) (i
_{1}i_{2}) - A set of outputs (1, ..., m) (o
_{1}o_{2}) - Number of Hidden Layers (1)
- Number of Neurons in each Hidden Layer (2 - h
_{1}h_{2})

Note: *n* does not need to equal *m*. Each Hidden Layer doesn't need to have the same number of neurons.

Optional properties:

- Bias for each neuron in the hidden layer(s) (b
_{1}b_{2}) and output layer (b_{3}b_{4}) - Weights of the Bias's in the hidden layer(s) (v
_{11}v_{12}) and output layer (v_{21}v_{22}) - Weights connecting neurons (w
_{1}w_{2}w_{3}w_{4}- for input to hidden, w_{5}w_{6}w_{7}w_{8}- for hidden to output)

#### Why are these properties optional?

From the calculations you will see that the bias is only needed to calculate the net of the neuron. Once we start differentiating, the bias doesn't depend on anything, so it is always evaluated to 0. A similar case is said for the weights of the bias.

In a neural network, as it is trained, the weights are updated so minimise the error between the inputs and your outputs. Therefore the weights are constantly being updated. Therefore, by specifying the starting weights, you are providing a starting point as you have no idea what the final value of the weight may be. There is a high possibility that the value is negative. You may as well have a random number between 0 and 1.

## Why do I want to go near Maths?

You're probably wondering to yourself, what possible reason could I have for going near Maths again? Didn't I leave that life behind at GCSE/A-Level where by the end of the course, their were more letters than numbers? An English Degree probably had less letters than that. However, Maths is very important, and I am not biased because I am a Maths Graduate. For programming in general, Maths allows you find optimizations for your processes, and gives you the obvious benefit when doing calculations. You don't need Maths to be a great programmer but it certainly helps.

Now, for Machine Learning, I would argue that Maths is more important. Looking at Google's server room, there are hundreds of servers running the calculations needed for search and all their other processes. Remember, Google employs very smart employees to make sure that their programs are running as efficiently as possible given their resources. Imagine, how many servers Google would need without a Maths basis.

You and I are not Google or Facebook or Microsoft. We need Maths for the Algorithms, building the Neural Networks, Linear Algebra and different Algorithms based on the data we are trying to model.

## Our First Neural Network

Lets take this one step at a time. First we are going to focus on the hidden layer. Lets calculate the value of *h _{1}*.

As we can see, *h _{1}* depends on

*i*, with weight

_{1}*w*,

_{1}*i*, with weight

_{2}*w*, and

_{3}*b*, with weight

_{1}*v*. To calculate the net value of

_{11}*h*, we multiply the value of the neuron and its weight. This can be formulated by

_{1}

Therefore,

We then put this value of *h _{1}* into the Sigmoid Function. From my Deep Learning article, a sigmoid neuron outputs a smooth continuous range of values between 0 and 1. As exponential functions are similar to handle mathematically and, since learning algorithms involve lots of differentiation (Spoiler Alert!), choosing a function that is computationally cheaper to handle is great.

The sigmoid function is defined to be:

Therefore the output of *h _{1}* is

We can repeat this same process with *h _{2}*.

*h*depends on

_{2}*i*, with weight

_{1}*w*,

_{2}*i*, with weight

_{2}*w*, and

_{4}*b*, with weight

_{2}*v*.

_{12}

If you feel you have understood up to this point, firstly congratulations, and secondly see if you can work out the net() and sig() for *o _{1}* and

*o*.

_{2}**WARNING**: DO NOT READ PAST THIS POINT IF YOU WANT TO ATTEMPT THE ABOVE EXERCISE!!

**HINT!**:

The results for *o _{1}* and

*o*are as follows:

_{2}THIS IS YOUR LAST CHANCE TO TRY YOURSELF BEFORE I REVEAL MY SOLUTIONS:

Note that we use the sigmoid value of *h _{1}*, (sig(

*h*)), not the Net value.

_{1}Now that you have the solutions, I'm sure you can see that working out the sigmoid isn't nearly as scary as you imagined it might be. Just a simple case of plugging values into formulae.

So we have our two sigmoid values for the outputs, *o _{1}* and

*o*. We can then compare sig(

_{2}*o*) and sig(

_{1}*o*) to the outputs we chose, say target(

_{2}*o*) and target(

_{1}*o*). To work out the total error, we use the Euclidean norm.

_{2}

Therefore our Total Error is:

## Maths is Fun, I Promise!

Time for the fun part, partial differentiation. I guess it depends on your definition of fun but let's just assume that we have the same definition. For the next section, all you need is some basic knowledge of partial differentiation and maybe a little chain rule. For those of you with a Maths background or know some partial differentiation, you may be able to figure out why, from this point the bias becomes irrelevant.

In the next section I am going to throw a lot of Maths your way. If you understand the derivations, awesome, if not, that's also perfectly fine. You can simply use the results and your neural network model will be no less special.

We defined the our total error to be

I want to see how the sig(*o _{1}*) affects the total error. This is a simple way to think of partial differentiation. As you can see from E

_{total}, it depends on 4 arguments, target(

*o*), target(

_{1}*o*), sig(

_{2}*o*) and sig(

_{1}*o*). Finding the Partial Derivative means differentiation on only one variable, not all the variables. (This isn't a mathematically sound definition but I find it helps to think of it in this way.)

_{2}Naturally the question is, what is differentiation? Differentiation is the sensitivity of change in the function with respect to it's arguments.

### Differentiating Your Mind

Differentiation is a massive subject in Mathematics, so for this article I am not going to go into how to differentiate. There are many resources online on learning to differentiate. I highly recommend working your way through the Khan Academy course (the first and last links especially), split into easily digestible bitesize chunks.

- Khan Academy - Comprehensive Guide, Chain Rule,Basic Differentiation,Partial Differentiation
- Derivative Calculator
- Bitesize Guide
- Chain Rule
- Partial Derivatives
- Partial Derivatives Calculator

Let us see how sig(*o _{1}*) affects the total error. As you can from the equation for E

_{total}, on the left hand side of the equal, the equation after the '+' doesn't depend on sig(

*o*) as an argument so this is immediately

_{1}*0*. Therefore we have:

Similarly, sig(o2) affects the total error:

We calculated these results using partial differentiation (ignoring part of the equation that does not depend on our argument) and the chain rule to get from

(Not the worst but could be prettier)

That's the basics, from here on out, I am simply going to give the results but once you have learned about partial derivatives and the chain rule, I encourage you to figure out these results yourself.

## Updating Weights! Unlike my weight, these may go down as well as up!

The aim of the this section is to see how the weights affect the *total* error.

We'll start with *w _{5}*. We want to calculate

You're probably looking at E_{total} and thinking, "E_{total} doesn't take *w _{5}* as an argument so the answer is

*0*, that was easy!" Before you pat yourself on the back for a great observation, you wouldn't have arrived at the correct conclusion. We have already seen that E

_{total}depends on sig(

*o*). sig(

_{1}*o*) depends on net(

_{1}*o*). net(

_{1}*o*) depends on

_{1}*w*. So E

_{5}_{total}does depend on

*w*.

_{5}Therefore to find how E_{total} depends on *w _{5}* the partial derivative we need to calculate is

If you remember multiplying fractions

you know you can cancel the 4's. You can think of the equation above in a similar way and after "cancelling", you are back to

(You are not strictly cancelling so unless you want a lecture on Maths from one of your Mathematically inclined friends, I wouldn't tell anyone that is what you are doing. But just between, you and me, it's cancelling!)

To find how E_{total} depends on sig(*o _{1}*), we calculate the partial derivative

To find how sig(*o _{1}*) depends on net(

*o*), we calculate the partial derivative of the sigmoid function

_{1}

This result is particularly tricky. If you have a good understand of differentiation you should try and get this result. If you have no intention of touching this with a 10 foot pole, you can see a solution here, although you may also see your lunch again.

To find how net(*o _{1}*) depends on

*w*, we calculate the partial derivative

_{5}Look back at your definition of net(

*o*) and you'll quickly spot this.

_{1}Plugging this into

Now that we have calculated how *w _{5}* affects the total error, lets take a look at the neural network we are modelling, focusing on

*w*.

_{5}You can see that *w _{5}* effectively connects the sigmoid value of neuron

*h*to neuron

_{1}*o*. Therefore, in the calculation, that is why we see sig(

_{1}*o*) and sig(

_{1}*h*)

_{1}The new value of *w _{5}*,

*w*, is now (w

_{51}*w*is the old value)

_{50}

Wait a second! What is that weird *n* and what is it doing in this equation? That "n" is an eta and it's there to represent the learning rate. The higher the learning rate, the quicker your neural network will lower the error to get close to your output. However, the neural network will be less accurate. Normally, the learning rate is set to 1/2.

If you were to take a guess at the equation for *w _{6}*, what do you think it would be?

You can see that *w _{6}* connects the sigmoid value of neuron

*h*to neuron

_{1}*o*. Therefore a good guess would be

_{2}

See if you can calculate this result, like how we did with *w _{5}*, noticing how, to find how E

_{total}depends on

*w*, E

_{6}_{total}depends on sig(

*o*) which depends on net(

_{2}*o*) which depends on

_{2}*w*.

_{6}Calculate for practice and prove to yourself that

Therefore,

The value of eta is the same for every weight in the whole neural network (for weights 1 - 8 not just 5 -8) but I see no reason why they can't be different. It will mean that your neural network weights are learning at different rates, but for some models this may be important. If you care more about one output than the other for example, then in our example, say *o _{2}* was more important. The learning rate of

*w*and

_{6}*w*could be higher than

_{8}*w*and

_{5}*w*.

_{7}We have successfully seen how *w _{5}*,

*w*,

_{6}*w*and

_{7}*w*affect the total error of our network and calculated their new values.

_{8}But that is only one layer. How do *w _{1}*,

*w*,

_{2}*w*and

_{3}*w*affect the total error?

_{4}## It's the Final Layer!

We shall start with w1.

Okay, we can see that w1 connects *i _{1}* and

*h*. I'm going to try the same method as we employed for

_{1}*w*. Umm...how does

_{5}*h*affect the total error?

_{1}We know that E_{total} depends on sig(*o _{1}*) and sig(

*o*). sig(

_{2}*o*) depends on net(

_{1}*o*) and sig(

_{1}*o*) depends on net(

_{2}*o*). net(

_{2}*o*) depends on sig(

_{1}*h*) and net(

_{1}*o*) depends on sig(

_{2}*h*). (Ah, there's the link!) sig(

_{1}*h*) depends on net(

_{1}*h*). net(

_{1}*h*) depends on

_{1}*w*. That was only mildly inconvenient.

_{1}Therefore our formula is

where

To find how E_{total} depends on sig(*h _{1}*), we calculate the partial derivative

To find how sig(*h _{1}*) depends on net(

*h*), we calculate the partial derivative of the sigmoid function

_{1}

To find how net(*h _{1}*) depends on

*w*, we calculate the partial derivative

_{1}Remind yourself of the value of net(

*h*).

_{1}I'll leave the following for you to figure out.

Use a similar formulae with the learning rate to find the new values for *w _{1}*,

*w*,

_{2}*w*and

_{3}*w*.

_{4}

So, that's the Maths. If you followed it or not, I'm sure you have a clearer picture of what is happening along those weights. Personally, when going working through the neural network, a visual representation of how the weights are changing, and what affects them helped me.

## Your Training is now Complete young Padawan!

I'm glad you've made it this far and I hope you now have the Maths ability to write your own neural network in the language of your choice, even Java!

If you found the neural network example puzzling, I would advise you to try working through some neural networks for yourself.

### Remember, Remember the five (plus 1) tips for Neural Networks! Gunpowder, Treason and Plot ....

- Start with your inputs
- Calculate your nets for each layer and their relevant sigmoid
- Find the error
- Differentiate everything or have an educated guess to the updated weights
- Putting everything together like a jigsaw
- Relax

We'll start with the most simple model:

Now try, a neural network that is a little difficult. Don't let the bias scare you:

Are you getting the hang of this now? This one might be slightly more challenging (but not for you!):

Let's complete that picture you're building in your head:

To become a Maths Master (when it comes to modelling neural networks), your final challenge is:

You'll recognise this next neural network as the one we worked through together. I believe in you, that you can work through this yourself. No Cheating!

How about one more just for fun? Don't tell me you don't find Maths fun now!

Okay so you can calculate a 3 layered neural network. Time to try 4 layers:

As the old saying goes, once you can do 4 layers, you can do an arbitrary number of layers. Doesn't quite roll off the tongue does it.

And you're done! Congratulations! Have a celebratory cookie! (insert cookie image)

If you have any questions when attempting the above models, or any questions in general, advice or improvements on the model, feel free to get in touch! You can find my contact details on my profile.

Sign up to the JUXT newsletter