Understanding LSTM Networks

Understanding the LSTM network is an extension of the recurrent neural network (RNN) that is primarily initiated to handle situations where RNNs fail. Talking about RNN, it is a network that acts as a short input (short term memory) on the current input, taking into account the previous input (feedback) and storing it in its memory for some time. Does. Among its various applications, the most popular are in the fields of speech processing, non-Markovian control, and musical composition. Nevertheless, RNNs have drawbacks. First, it fails to store information for long periods of time. In order to estimate the current output multiple times, some information is collected before some time reference.

Recurrent Neural Networks

Man does not start his thinking from scratch every second. As you read this essay, you understand each word based on your understanding of the previous words. You don't throw everything away and start thinking from scratch again. There is persistence in your thoughts.

Traditional neural networks cannot do this, and this seems like a major drawback. For example, imagine that you want to classify what kind of event is happening at every point in a film. It is unclear how a traditional neural network can use its logic about past events in the film to inform people later.

Recurrent neural networks solve this problem. They are networked with loops in them, thereby maintaining information.

In the diagram above, a part of the neural network, A, sees some input xt and outputs a value ht. A loop allows information to be passed from the next phase of the network.

These loops make the neural networks appear mysterious. However, if you think a little more, it turns out that they are not different from normal neural networks. A recurring neural network can be thought of as multiple copies of the same network, each conveying a message to a successor. Consider what happens if we unregister the loop:

This series-like nature suggests that recurrent neural networks are intimately related to sequences and lists. They are the natural architecture of neural networks to use for such data.

And they are definitely used! Over the years, RNN has had incredible success in applying it to a wide variety of problems: speech recognition, language modeling, translation, image captioning ... the list goes forward. I can discuss the amazing feats one can achieve with RNN with Karanpati's excellent blog post, The Unrenewable Effects of Recurrent Neural Networks. But they are very amazing indeed.

Essential to these successes is the use of "LSTM", a very specialized type of recurrent neural network that works for many functions, much better than the standard version. Almost all exciting results based on recurrent neural networks are obtained with them. It is these LSTMs that will explore this essay.

The Problem of Long-Term Dependencies

An appeal of RNNs is the idea that they may be able to relate previous information to the current work, such that using the previous video frame can inform the understanding of the current frame. If RNNs can do this, they will not be very useful. But can they? depends on.

Sometimes, we only have to look at recent information to carry out the current work. For example, consider a language model trying to predict the next word based on the previous one. If we are trying to predict the last word "there are clouds in the sky", we do not need any further reference - it is very clear that the next word is going to be the sky. In such cases, where the difference between the relevant information and the location that is small, the RNN can learn to use the previous information.

But there are also cases where we need more context. Consider trying to predict the last word in the text "I grew up in France ... I speak French fluently." Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need to look back at the context of France more. It is entirely possible for the gap between the relevant information and the point where it needs to become very large.

Unfortunately, as the interval increases, RNNs become unable to learn to connect information.

LSTM Networks

Understanding the lstm networks - commonly called "LSTM" - are a special type of RNN, capable of learning long-term dependencies. They were introduced by Hoekreiter and Schmidhuber (1997), and were refined and popularized by many in the following work. They work very well on major problems, and are now widely used.

LSTM is explicitly designed to avoid a long-term dependency problem. Remembering information for a long time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks consist of a series of repetitive modules of the neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tan layer.

LSTM also has this chain like structure, but the repeating module has a different structure. Instead of being a single neural network layer, there are four, interacting in a very special way.

Don't worry about the details of what's going on. We will go through the LSTM diagram step by step. For now, let's try to be comfortable with the notation we're using.

In the diagram above, each line takes an entire vector, from the output of one node to the input of another. Pink circles represent pointwise operations like vector joints, while yellow boxes are learned neural network layers. The lines denote a merger, while a line indicates copying its contents and moving to different locations.

The Core Idea Behind LSTMs

The key to LSTM is the cell stage, a horizontal line running through the top of the diagram.

The cell state is like a conveyor belt. It runs the entire series straight, with only some minor linear interactions. It is very easy for information with only unchanged flow.

LSTM has the ability to extract or add information to the cell state, which is carefully regulated by structures called gates.

Gates are alternatively a way of moving through information. They are composed of a sigmoid neural net layer and a pointwise multiplication operation.

The sigmoid layer produces numbers between zero and one, describing how much each component needs to contain. A value of zero means "let nothing happen", while one means "let everything go!"

LSTM has three of these gates for protection and control of cell state.

Step-by-Step LSTM Walk Through

The first step in our LSTM is to decide what information we are going to throw from the cell state. This decision is made by the sigmoid layer known as the "gate gate error". This cell looks at \ (h_ {t-1} \) and \ (x_t \) for each number in the state \ _ (C_ {t-1), and \ (0 \) and \ (1 \). The middle outputs a number. } \). A \ (1 \) "completely maintains it" while a \ (0 \) "completely represents getting rid of it."

Let us go back to our example of a language model that is trying to predict the next word based on all the previous ones. In such a problem, the cell state may include the gender of the current subject, so that the correct pronoun can be used. When we see a new subject, we want to forget the gender of the old subject.

The next step is to decide what new information we are going to store in the cell. It has two parts. First, a sigmoid layer called the "input gate layer" determines which values we update. Next, a seclusion layer creates a vector of new candidate values, \ (\ tilde {C} _t \), which can be added to the state. In the next step, we will combine these two to make updates to the state.

In the example of our language model, we want to add the gender of the new state to the cell state, to replace the old one we are forgetting.

Now the time has come to update the old cell state, \ (C_ {t-1} \) to the new cell state \ _ (C_t \). The previous steps had already decided what to do, we just needed to do it.

We multiply the old state by \ _ (f_t \), forgetting the things we decided to forget earlier. Then we add \ (i_t * \ tilde {C} _t \). This is the new candidate value, which we have decided to update the value of each state.

In the case of the language model, this is where we actually drop information about the gender of the old subject and add new information, as we decided in the previous steps.

Finally, we need to decide what we are going to output. This output will be based on our cell position, but will be the filtered version. First, we run a sigmoid layer which determines which parts of the cell we are going to output. Then, we put the cell state through \ (\ tanh \) (to push the values between \ _ (- 1 \) and \ (1 \)) and multiply it by the output of the sigmoid gate, so that We only produce the parts decided by us.

The language model, for example, since it looks at just one subject, may want to output information relevant to the action that is coming forward. For example, it can output whether the subject is singular or plural, so that we can know in what form a verb should be conjugated if it is next.

Variants on Long Short Term Memory

What I have described so far is a very common LSTM. But not all LSTMs are the same as above. In fact, it seems that almost every paper included in the LSTM uses a slightly different version. The differences are minor, but it is worth mentioning some of them.

A popular LSTM variant, introduced by Gers and Schmidhuber (2000), is adding "peephole connections". This means that we see the gate layers in the cell state.

The above diagram adds peepholes to all gates, but many papers will give some peepholes and not others.

Another variation is the use of coupled forget and input gate. Instead of deciding separately what we should forget and what new information we should add, we make those decisions together. We only forget when we are going to input something in its place. When we forget something old, we only input new values in the state.

A slightly more dramatic variation on the LSTM has been introduced by the Gated Revert Unit, or GRU, Cho, et al. (2014). It combines a forgotten and input gate into an "update gate". It also mixes cell state and hidden state, and makes some other changes. The resulting model is simpler than the standard LSTM model, and is becoming increasingly popular.

These are just some of the most notable LSTM variants. There are several others such as the depth gated RNN by Yao, et al. (2015). There is also a completely different way to deal with long-term dependency, such as clockwork RNNs by Koutnik, et al. (2014).

Which of these variants is the best? How does it matter? Greif, et al. (2015) compare well to popular variants, finding that they are all about the same. Jozefowicz, et al. (2015) Tested over ten thousand RNN architectures, finding some that worked better than LSTM on some tasks.

Conclusion

Earlier, I mentioned the remarkable results that people are achieving with RNN. Essentially all these are obtained using LSTM. They actually work much better for most tasks!

Written as a set of equations, LSTMs are very intimidating. Hopefully, he felt a little better because of his step by step with this essay.

LSTM was a big step in what we can accomplish with RNN. It is natural to wonder: is there another big step? There is a general opinion among researchers: "Yes! There is a next step and this is taken care of! The idea is to let each stage of the RNN take information from some large collection of information to view. For example, if you are using an RNN to create a caption to describe a picture, it can choose a part of the word that sees each word for output.

In fact, Xu, et al. (2015) Do exactly this - if you want to find attention, this can be a fun starting point! There are a number of really exciting results using meditation, and it seems like too many are around the corner…

Note that RNN is not the only exciting thread in research. For example, the grid LSTM by Kalchbrenner, et al. (2015) look extremely promising. Work in generic models using JNN - such as Greger, et al. (2015), Chung, et al. (2015), or Bayer and Ossendorrer (2015) - also sounds very interesting. The last few years have been an exciting time for recurrent neural networks, and the coming promises only to be so much more.

Acknowledgments

I am grateful to many people for helping me understand LSTM networks better, commenting on the visualization, and providing feedback on this post.

I am extremely grateful to my colleagues at Google for their helpful feedback, especially Orol Vinels, Greg Corrado, John Schlens, Luke Villanis, and Ilya Sutsquhar. I am also thankful to many other friends and colleagues who took time to help me, including Dario Amodi, and Jacob Stafford. I am especially grateful to Kyunghun Cho for his extremely thoughtful correspondence regarding his diagrams.

Prior to this post, I practiced interpreting LSTM during two seminar series taught on neural networks. Thanks to everyone who participated with me for their patience and for their feedback.

In addition to the original authors, many people contributed to modern LSTM. A non-comprehensive list is: Felix Gers, Fred Cummins, Santiago Fernandez, Justin Bayer, Dan Verstra, Julian Togelius, Faustino Gomez, Mateo Gagallolo and Alex Graves.

For more related articles and courses visit InsideAIML.

Search This Blog

Artificial Intelligence and machine learning