Recurrent Neural Network (RNN)
Say, you live in an apartment where you got lucky enough to get a roommate who makes dinner every night. He makes either of these three things: pizza, shushi and waffles. You want to predict what ‘s for dinner tonight. You are training a neural network using the input parameters day of the week, month of the year and if your roommate has late meetings. But the model doesn’t work that well. You are wondering why? You developed a great model! The reason is simple. Your dinner doesn’t depend on those input parameters. Rather, your roommate follows a pattern. He cooks pizza, shushi and waffles in consecutive nights. That means if you had pizza last night, you are getting shushi tonight; if you had shushi last night, tonight will be a waffle night and if you ate waffle last night, it will be pizza tonight. It’s a cycle. The only input that matters to predict tonight’s dinner is what you had last night.
What if you were not at home last night? You can still predict tonight’s dinner by remembering what you had two nights before. So, in this type of model your previous prediction matters. We use new information and previous prediction result both to determine what is coming next. This type neural networks are called recurrent neural network. We will use one hot encoding for depicting the results.
What if you were not at home last night? You can still predict tonight’s dinner by remembering what you had two nights before. So, in this type of model your previous prediction matters. We use new information and previous prediction result both to determine what is coming next. This type neural networks are called recurrent neural network. We will use one hot encoding for depicting the results.
Writing a children’s book
Let’s write a very simple children’s book with this idea. In the picture below, we can see our very simple book. It has really small set of words and periods. Total of five element.

Our task is to arrange these words in the right order to make a nice children’s book. As it’s a very small book, we can find out a pattern just by looking at it. If we see a name, the next thing should be either ‘saw’ or a period. Similarly if we predicted a name before, we want those to vote for a ‘saw’ or a period. Again, if we see a ‘saw’ or a period, those are supposed to vote for a name.
As we discussed before Recurrent Neural Network(RNN) used previous prediction and new information together to make new predictions. Look at the picture below, here we are passing new information and a copy of previous predictions through a neural network and this new sign that represents squashing function. This function helps the network to behave.
You take all your vote coming out and you subject them to this squashing function. In the picture below, it shows the use of tanh squashing function. If something receives a total vote of 0.5, draw a vertical line and draw a horizontal line that gives you the squashed version. For smaller numbers squashed version is close to the original version. As the number grows bigger and bigger, the squashed version gets closer to one. In case of negative numbers, it keeps going to closer to -1. No matter what the original version is squashed version is in between -1 to 1.

In that process, the same thing gets processed again and again. It is possible that something got voted for twice. In that case it will get twice as big very soon and will blow up to be astronomical. By ensuring that it is always less than one or more than -1, you can multiply it as many times as you want.
This neural network can make mistakes very easily. It can come up with a sentence like Doug saw Doug or Doug saw Jane saw Doug saw Spot. Because our current neural network only looks for the previous prediction not anything behind it. To correct this problem, we can add a memory layer to it like this.

Let’s see how it works. For that we need to understand few more new signs: the squashing sign with flat bottom, x inner circle and cross inner circle. First let’s understand the cross inner circle. It is element by element addition of the two vectors, as the picture shows below.

In the same way, x inner circle is element by element multiplication of the two vectors. This element by element multiplication lets you do something cool. Think of a signal which is like a bunch of pipes. They have a certain amount of water trying to flow down through them. In each of those pipes there is a facet. You can open it all the way, close it all the way or put somewhere in the middle to either let their signal come through or block it.

Gating lets us know what passes through and what stays in. In order to use gating, you need to have a value that you know and that’s always between 0 and 1. Here we need to introduce another squashing function, the one with flat bottom. Here the values will be between 0 and 1 instead of -1 and 1. It’s also called logistic function.

Now as we know all the signs, let’s look at what happens in this picture. In the bottom line of the picture, new information pass through the neural network and squashing function to output the possibilities. Then a copy of those possibilities goes back to the gate, part of it comes to the memory and part of it are forgotten. The memory part gets back to the predictions. We can see in the picture that there is a totally separate neural network that works to figure out which part we should forget. Through the x inner circle, it goes to the element wise addition phase. Then we see another squashing function, because after element wise addition it may become bigger than 1 or smaller than -1. We may not want to release all our memory, we may want to keep some of our memories inside. We need another layer to do that selection. As the picture shows, we have a separate neural network to just do the selection. Our new information and previous predictions can vote to do that selection. Then it passes through a gate again to determine what should be kept internal and what should be released as prediction.

At last there is an ignoring layer, that has its own neural network, logistics squashing function and gate so that any unnecessary information doesn’t cloud the possibilities to move forward.
Now focus on our children’s book. It has ‘Jane saw Spot’ at a point. Then in the next line our network sees Doug. Pass this new information to our network. There are two types of possibilities. Firstly, it will give a positive prediction for the word ‘saw’ and a negative prediction for the word ‘Doug’. Because as it already saw Doug, it knows not to put Doug in a near future. So, the possibilities that the next word could be ‘saw’ not ‘Doug’. For the simplicity, let’s assume that there is no memory yet. In the selection layer, it will predict that it could be ‘saw’ or a period after a name. So it blocks any other possibilities for now. The word ‘saw’ might be voted and predicted. So, the word ‘saw’ is our most recent prediction. The word ‘saw’ will pass through all our layers and through all the networks. We now predict that the word ‘Doug’, ‘Jane’ or ‘Spot’ might come next.
Now, we will see memory playing its role. The forgetting network will know to forget ‘Doug’, because it just happened.

This LSTM has a lot of application. If we have text in one language and we want to translate it into another language, LSTM works very well. Even though translation is not a word to word process, its actually a phrase to phrase or sentence to sentence process. LSTM can work pretty well. Because it can learn the grammar and pattern of a new language pretty well. LSTM can transform speech to text. It also very good in Robotics. It can look at the previous actions and based on that information it can predict what actions to take next.