Understanding LSTM Networks

Understanding LSTM network is an extension of the Recurrent Neural Network (RNN) that is primarily initiated to handle situations where RNNs fail. Talking about RNN, it is a network that works on current input as short input (short term memory), keeping in mind the previous input (feedback) and storing it in its memory for a short time. Among its various applications, the most popular are in the fields of speech processing, non-Markovian control, and musical composition. Nevertheless, RNN has drawbacks. First, it fails to store information for long periods of time. In order to estimate the current output many times, some information collected before some time reference is needed.

Recurrent Neural Networks

Man does not start his thinking from scratch every second. As you read this essay, you understand each word based on your understanding of the previous words. You don't throw everything away and start thinking from scratch again. There is persistence in your thoughts.

Traditional neural networks cannot do this, and this seems like a major drawback. For example, imagine that you want to classify what kind of event is happening at every point in a film. It is unclear how a traditional neural network can use its logic about past events in the film to inform people later.

Recurrent neural networks solve this problem. They are networked with loops in them, thereby maintaining information.

In the diagram above, a part of the neural network, \ (A \), looks at some input \ (x_t \) and outputs a value (h_t \). A loop allows information to be passed from the next phase of the network.

These loops make the neural networks appear mysterious. However, if you think a little more, it turns out that they are not different from normal neural networks. A recurring neural network can be considered as multiple copies of the same network, each conveying a message to one successor. Consider what happens if we control the loop:

This series-like nature suggests that recurrent neural networks are intimately related to sequences and lists. They are the natural architecture of neural networks to use for such data.

And they are definitely used! Over the years, RNN has had incredible success in applying it to a wide variety of problems: speech recognition, language modeling, translation, image captioning ... the list goes forward. I can discuss the amazing feats one can achieve with RNN, Karanpathy's excellent blog post, The Unrelentable Efficiency of Recurrent Neural Networks. But they are very amazing indeed.

Essential to these successes is the use of "LSTM", a special type of recurrent neural network that works for many functions, much better than the standard version. Almost all exciting results based on recurrent neural networks are achieved with them. It is these LSTMs that will explore this essay.

The Problem of Long-Term Dependencies

An appeal of RNNs is the idea that they may be able to relate previous information to the current work, such as using previous video frames to inform understanding of the current frame. If RNNs can do this, they will not be very useful. But can they? depends on.

Sometimes, we only have to look at recent information to carry out the current work. For example, consider a language model trying to predict the next word based on the previous one. If we are trying to predict the last word "there are clouds in the sky", then we don't need any further reference - it is very clear that the next word is going to be the sky. In cases where the difference between the relevant information and the location that is required is small, RNNs can learn to use the previous information.

But there are also cases where we need more context. Consider trying to predict the last word in the text "I grew up in France ... I speak French fluently." Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need to further expand the context of France. It is entirely possible for the gap between the relevant information and the point where it needs to be very large.

Unfortunately, as the interval increases, RNNs become unable to learn to connect information.

In theory, RNNs are fully capable of dealing with such "long-term dependencies". A human can carefully choose parameters for them to solve toy problems of this form. Sadly, in practice, RNNs do not seem able to learn them. This problem was deeply explored by Hökreiter (1991) [German] and Bengio, et al. (1979), which found some very fundamental reasons why this can be difficult.
Thankfully, LSTMs do not have this problem!

LSTM Network

Long short-term memory networks - commonly called "LSTM" - are a special type of RNN, capable of learning long-term dependencies. He was introduced by Horchariter and Schmiduber (1997), and was refined and popularized by many in the following work. They work very well on a large number of problems, and are now widely used.

LSTM is explicitly designed to avoid a long-term dependency problem. Remembering information for a long time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks consist of a series of repetitive modules of the neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tan layer.

LSTM also has this chain like structure, but the repeating module has a different structure. Instead of being a single neural network layer, there are four, interacting in a very special way.

Don't worry about the details of what's going on. We will go through the LSTM diagram step by step. For now, try to be comfortable with the marking of what we are using.

In the above diagram, each line takes an entire vector, from the output of one node to the input of another. Pink circles represent pointwise operations like vector joints, while yellow boxes are learned neural network layers. Line merger denotes merger, while a line signifies the copying of its contents and copies going to different locations.

The Core Idea Behind LSTMs

The key to LSTM is the cell stage, a horizontal line running through the top of the diagram.

The cell state is like a conveyor belt. It runs the entire series straight, with only some minor linear interactions. It is very easy for information with only unchanged flow.

LSTM has the ability to extract or add information to the cell state, which is carefully regulated by structures called gates.

Gates are alternatively a way of moving through information. They are composed of a sigmoid neural net layer and a pointwise multiplication operation.

The sigmoid layer produces numbers between zero and one, describing how much each component needs to contain. A value of zero means "let nothing happen", while one means "let everything go!"

LSTM has three of these gates for protection and control of cell state.

Step-by-Step LSTM Walk Through

The first step in our LSTM is to decide what information we are going to throw from the cell state. This decision is made by the sigmoid layer known as the "gate gate fault". This cell denotes \ (h_ {t-1} \) and \ (x_t \) for each number in state \ _ (C_ {t-1), and \ (0 \) and \ (1 \) k. The middle outputs a number. } \). A \ (1 \) represents "completely maintaining it", while a (0 \) represents "completely getting rid of it."

Let's go back to our example of a language model that is trying to predict the next word based on all the previous ones. In such a problem, the cell state may include the gender of the current subject, so that the correct pronoun can be used. When we see a new subject, we want to forget the gender of the old subject.

The next step is to decide what new information we are going to store in the cell. It has two parts. First, a sigmoid layer called the "input gate layer" decides which values we update. Next, a seclusion layer creates a vector of new candidate values, \ (\ tilde {C} _t \), which can be added to the state. In the next step, we will combine these two to update the state.

In the example of our language model, we want to add the gender of the new state to the cell state, to replace the old one that we are forgetting.

Now the time has come to update the old cell state, \ (C_ {t-1} \) to the new cell state \ _ (C_t \). The previous steps had already decided what to do, we just needed to do it.

We multiply the old state by \ _ (f_t \), forgetting the things we had previously decided to forget. Then we add \ (i_t * \ tilde {C} _t \). This is the new candidate value, which is how much we have increased to update each state value.

In the case of the language model, this is where we actually drop information about the gender of the old subject and add new information, as we decided in the previous steps.

Finally, we need to decide what we are going to output. This output will be based on our cell position, but will be the filtered version. First, we run a sigmoid layer which determines which parts of the cell we are going to output. Then, we put the cell state through \ (\ tanh \) (to push the values between \ _ (- 1 \) and \ (1 \)) and multiply it by the output of the sigmoid gate, so that We only produce the parts decided by us.

The language model, for example, because it looks at just one subject, may want to output information relevant to an action that is coming forward. For example, it can output whether the subject is singular or plural, so that we can know what the form of the verb is if it is carried forward.

Variants on Long Short Term Memory

What I have described so far is a general LSTM. But not all LSTMs are the same as above. In fact, it seems that almost every paper included in the LSTM uses a slightly different version. The differences are minor, but it is worth mentioning some of them.

A popular LSTM variant, introduced by Gers and Schmidhuber (2000), is adding "peep connections". This means that we see the gate layers in the cell state.

The above diagram adds peepholes to all gates, but many papers will give some peepholes and not others.

Another variation is the use of coupled forget and input gate. Instead of deciding separately what we should forget and what new information we should add, we make those decisions together. We only forget when we are going to input something instead. When we forget something old, we only input new values in the state.

A slightly more dramatic variation on the LSTM has been introduced by the Gated Revert Unit, or GRU, Cho, et al. (2014). It combines a forgotten and input gate into an "update gate". It also combines cell state and hidden state, and makes some other changes. The resulting model is simpler than the standard LSTM model, and is becoming increasingly popular.

These are just some of the most notable LSTM variants. There are several others such as the depth gated RNN by Yao, et al. (2015). There is also a completely different way to deal with long-term dependency, such as clockwork RNNs by Koutnik, et al. (2014).

Which of these variants is the best? How does it matter? Greif, et al. (2015) Compare well the popular variants, seeing that they are all the same. Jozefowicz, et al. (2015) tested over ten thousand RNN architectures, some of which performed better than LSTM on some tasks.

Conclusion

Earlier, I mentioned the remarkable results that people are achieving with RNN. Essentially all these are obtained using LSTM. They actually work much better for most tasks!

Written as a set of equations, LSTMs are very intimidating. Hopefully, he felt a little better because of his step by step with this essay.

LSTM was a big step in what we can accomplish with RNN. It is natural to wonder: is there another big step? There is a general opinion among researchers: "Yes! There is a next step and this is taken care of! The idea is to let each stage of the RNN take information from some large collection of information to view. For example, if you are using an RNN to create a caption to describe a picture, it can select a portion of the image that sees each word for output. In fact, Xu, et al. (2015) Do the same - it can be a fun starting point if you want to seek attention! There are a number of really exciting results using meditation, and it seems like too many are around the corner…

Note that RNN is not the only exciting thread in research. For example, the grid LSTM by Kalchbrenner, et al. (2015) look extremely promising. Work in generic models using JNN - such as Greger, et al. (2015), Chung, et al. (2015), or Bayer and Ossendorrer (2015) - also sounds very interesting. The past few years have been an exciting time for recurrent neural networks, and the coming promises only to be more!

Acknowledgments

I’m grateful to a number of people for helping me better understand LSTMs, commenting on the visualizations, and providing feedback on this post.
I’m very grateful to my colleagues at Google for their helpful feedback, especially Oriol Vinyals, Greg Corrado, Jon Shlens, Luke Vilnis, and Ilya Sutskever. I’m also thankful to many other friends and colleagues for taking the time to help me, including Dario Amodei, and Jacob Steinhardt. I’m especially thankful to Kyunghyun Cho for extremely thoughtful correspondence about my diagrams.
Before this post, I practiced explaining LSTMs during two seminar series I taught on neural networks. Thanks to everyone who participated in those for their patience with me, and for their feedback.
In addition to the original authors, a lot of people contributed to the modern LSTM. A non-comprehensive list is: Felix Gers, Fred Cummins, Santiago Fernandez, Justin Bayer, Daan Wierstra, Julian Togelius, Faustino Gomez, Matteo Gagliolo, and Alex Graves.↩
For more related articles and courses visit InsideAIML.

Search This Blog

Artificial Intelligence and machine learning