This blog article contains a few worked examples and some exercises for you to try yourself. For maximum benefit, find a piece of paper and a pen and work through the problems as you go.

To recap on the fundamentals of Neural Networks, click here, in my Deep Learning Blog. I also covered the basis of the maths behind the neural network. In this blog, I'm going to go into more detail with the Maths, and attempt to explain some higher level concepts.

We could use a pre-build library or framework (examples), but no, we want a true understanding of what is happening. We are going to model our own neural network.

The neural network we are going to model is a very simple case. It has 2 inputs (*i _{1}*,

This neural network could be modelling how to get from [1, 2] to [3, 4]. Or, say you gave a word a numerical value based on the position of the letter in the alphabet (a = 1, b = 2, ..., z = 26). You may want to predict the next words for a keyboard. For example, what are the next two words after "How are"? "How" has value 8 + 15 + 23 = 46. "are" has value 1 + 18 + 5 = 24. The inputs are 46 and 24, and you want to train your neural network to output "you today", with values 61 and 65. You then have a problem of decoding, as 65 could represent "today" or "wori", which although is not a word, still has the correct value.

In order to create a neural network you need the following: (The values in the brackets relate to the above 2 -2 -2 neural network.)

- A set of inputs (1, ..., n) (i
_{1}i_{2}) - A set of outputs (1, ..., m) (o
_{1}o_{2}) - Number of Hidden Layers (1)
- Number of Neurons in each Hidden Layer (2 - h
_{1}h_{2})

Note: *n* does not need to equal *m*. Each Hidden Layer doesn't need to have the same number of neurons.

Optional properties:

- Bias for each neuron in the hidden layer(s) (b
_{1}b_{2}) and output layer (b_{3}b_{4}) - Weights of the Bias's in the hidden layer(s) (v
_{11}v_{12}) and output layer (v_{21}v_{22}) - Weights connecting neurons (w
_{1}w_{2}w_{3}w_{4}- for input to hidden, w_{5}w_{6}w_{7}w_{8}- for hidden to output)

From the calculations you will see that the bias is only needed to calculate the net of the neuron. Once we start differentiating, the bias doesn't depend on anything, so it is always evaluated to 0. A similar case is said for the weights of the bias.

In a neural network, as it is trained, the weights are updated so minimise the error between the inputs and your outputs. Therefore the weights are constantly being updated. Therefore, by specifying the starting weights, you are providing a starting point as you have no idea what the final value of the weight may be. There is a high possibility that the value is negative. You may as well have a random number between 0 and 1.

You're probably wondering to yourself, what possible reason could I have for going near Maths again? Didn't I leave that life behind at GCSE/A-Level where by the end of the course, their were more letters than numbers? An English Degree probably had less letters than that. However, Maths is very important, and I am not biased because I am a Maths Graduate. For programming in general, Maths allows you find optimizations for your processes, and gives you the obvious benefit when doing calculations. You don't need Maths to be a great programmer but it certainly helps.

Now, for Machine Learning, I would argue that Maths is more important. Looking at Google's server room, there are hundreds of servers running the calculations needed for search and all their other processes. Remember, Google employs very smart employees to make sure that their programs are running as efficiently as possible given their resources. Imagine, how many servers Google would need without a Maths basis.

You and I are not Google or Facebook or Microsoft. We need Maths for the Algorithms, building the Neural Networks, Linear Algebra and different Algorithms based on the data we are trying to model.

Lets take this one step at a time. First we are going to focus on the hidden layer. Lets calculate the value of *h _{1}*.

As we can see, *h _{1}* depends on

Therefore,

We then put this value of *h _{1}* into the Sigmoid Function. From my Deep Learning article, a sigmoid neuron outputs a smooth continuous range of values between 0 and 1. As exponential functions are similar to handle mathematically and, since learning algorithms involve lots of differentiation (Spoiler Alert!), choosing a function that is computationally cheaper to handle is great.

The sigmoid function is defined to be:

Therefore the output of *h _{1}* is

We can repeat this same process with *h _{2}*.

If you feel you have understood up to this point, firstly congratulations, and secondly see if you can work out the net() and sig() for *o _{1}* and

**WARNING**: DO NOT READ PAST THIS POINT IF YOU WANT TO ATTEMPT THE ABOVE EXERCISE!!

**HINT!**:

The results for *o _{1}* and

THIS IS YOUR LAST CHANCE TO TRY YOURSELF BEFORE I REVEAL MY SOLUTIONS:

Note that we use the sigmoid value of *h _{1}*, (sig(

Now that you have the solutions, I'm sure you can see that working out the sigmoid isn't nearly as scary as you imagined it might be. Just a simple case of plugging values into formulae.

So we have our two sigmoid values for the outputs, *o _{1}* and

Therefore our Total Error is:

Time for the fun part, partial differentiation. I guess it depends on your definition of fun but let's just assume that we have the same definition. For the next section, all you need is some basic knowledge of partial differentiation and maybe a little chain rule. For those of you with a Maths background or know some partial differentiation, you may be able to figure out why, from this point the bias becomes irrelevant.

In the next section I am going to throw a lot of Maths your way. If you understand the derivations, awesome, if not, that's also perfectly fine. You can simply use the results and your neural network model will be no less special.

We defined the our total error to be

I want to see how the sig(*o _{1}*) affects the total error. This is a simple way to think of partial differentiation. As you can see from E

Naturally the question is, what is differentiation? Differentiation is the sensitivity of change in the function with respect to it's arguments.

Differentiation is a massive subject in Mathematics, so for this article I am not going to go into how to differentiate. There are many resources online on learning to differentiate. I highly recommend working your way through the Khan Academy course (the first and last links especially), split into easily digestible bitesize chunks.

- Khan Academy - Comprehensive Guide, Chain Rule,Basic Differentiation,Partial Differentiation
- Derivative Calculator
- Bitesize Guide
- Chain Rule
- Partial Derivatives
- Partial Derivatives Calculator

Let us see how sig(*o _{1}*) affects the total error. As you can from the equation for E

Similarly, sig(o2) affects the total error:

We calculated these results using partial differentiation (ignoring part of the equation that does not depend on our argument) and the chain rule to get from

(Not the worst but could be prettier)

That's the basics, from here on out, I am simply going to give the results but once you have learned about partial derivatives and the chain rule, I encourage you to figure out these results yourself.

The aim of the this section is to see how the weights affect the *total* error.

We'll start with *w _{5}*. We want to calculate

You're probably looking at E_{total} and thinking, "E_{total} doesn't take *w _{5}* as an argument so the answer is

Therefore to find how E_{total} depends on *w _{5}* the partial derivative we need to calculate is

If you remember multiplying fractions

you know you can cancel the 4's. You can think of the equation above in a similar way and after "cancelling", you are back to

(You are not strictly cancelling so unless you want a lecture on Maths from one of your Mathematically inclined friends, I wouldn't tell anyone that is what you are doing. But just between, you and me, it's cancelling!)

To find how E_{total} depends on sig(*o _{1}*), we calculate the partial derivative

To find how sig(*o _{1}*) depends on net(

This result is particularly tricky. If you have a good understand of differentiation you should try and get this result. If you have no intention of touching this with a 10 foot pole, you can see a solution here, although you may also see your lunch again.

To find how net(*o _{1}*) depends on

Look back at your definition of net(

Plugging this into

Now that we have calculated how *w _{5}* affects the total error, lets take a look at the neural network we are modelling, focusing on

You can see that *w _{5}* effectively connects the sigmoid value of neuron

The new value of *w _{5}*,

Wait a second! What is that weird *n* and what is it doing in this equation? That "n" is an eta and it's there to represent the learning rate. The higher the learning rate, the quicker your neural network will lower the error to get close to your output. However, the neural network will be less accurate. Normally, the learning rate is set to 1/2.

If you were to take a guess at the equation for *w _{6}*, what do you think it would be?

You can see that *w _{6}* connects the sigmoid value of neuron

See if you can calculate this result, like how we did with *w _{5}*, noticing how, to find how E

Calculate for practice and prove to yourself that

Therefore,

The value of eta is the same for every weight in the whole neural network (for weights 1 - 8 not just 5 -8) but I see no reason why they can't be different. It will mean that your neural network weights are learning at different rates, but for some models this may be important. If you care more about one output than the other for example, then in our example, say *o _{2}* was more important. The learning rate of

We have successfully seen how *w _{5}*,

But that is only one layer. How do *w _{1}*,

We shall start with w1.

Okay, we can see that w1 connects *i _{1}* and

We know that E_{total} depends on sig(*o _{1}*) and sig(

Therefore our formula is

where

To find how E_{total} depends on sig(*h _{1}*), we calculate the partial derivative

To find how sig(*h _{1}*) depends on net(

To find how net(*h _{1}*) depends on

Remind yourself of the value of net(

I'll leave the following for you to figure out.

Use a similar formulae with the learning rate to find the new values for *w _{1}*,

So, that's the Maths. If you followed it or not, I'm sure you have a clearer picture of what is happening along those weights. Personally, when going working through the neural network, a visual representation of how the weights are changing, and what affects them helped me.

I'm glad you've made it this far and I hope you now have the Maths ability to write your own neural network in the language of your choice, even Java!

If you found the neural network example puzzling, I would advise you to try working through some neural networks for yourself.

- Start with your inputs
- Calculate your nets for each layer and their relevant sigmoid
- Find the error
- Differentiate everything or have an educated guess to the updated weights
- Putting everything together like a jigsaw
- Relax

We'll start with the most simple model:

Now try, a neural network that is a little difficult. Don't let the bias scare you:

Are you getting the hang of this now? This one might be slightly more challenging (but not for you!):

Let's complete that picture you're building in your head:

To become a Maths Master (when it comes to modelling neural networks), your final challenge is:

You'll recognise this next neural network as the one we worked through together. I believe in you, that you can work through this yourself. No Cheating!

How about one more just for fun? Don't tell me you don't find Maths fun now!

Okay so you can calculate a 3 layered neural network. Time to try 4 layers:

As the old saying goes, once you can do 4 layers, you can do an arbitrary number of layers. Doesn't quite roll off the tongue does it.

And you're done! Congratulations! Have a celebratory cookie! (insert cookie image)

If you have any questions when attempting the above models, or any questions in general, advice or improvements on the model, feel free to get in touch! You can find my contact details on my profile.

*Published: 2017-05-31*

Sign up to the JUXT newsletter