1. Intution behind RNN based Sequence-to-Sequence Model:¶
Add caption |
- Both Encoder and Decoder are RNNs
- At every time step in the Encoder, the RNN takes a word vector (xi) from the input sequence and a hidden state (Hi) from the previous time step
- The hidden state is updated at each time step
- The hidden state from the last unit is known as the context vector. This contains information about the input sequence
- This context vector is then passed to the decoder and it is then used to generate the target sequence (English phrase)
- If we use the Attention mechanism, then the weighted sum of the hidden states are passed as the context vector to the decoder
Limitations of RNN'S:¶
- Dealing with long-range dependencies is still challenging
- The sequential nature of the model architecture prevents parallelization.
- Vanishing Gradient Problem
Introduction to the Transformer:
The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution. The idea behind Transformer is to handle the dependencies between input and output with attention and recurrence completely.
the Encoder and Decoder parts only.Now focus on the below image. The Encoder block has 1 layer of a Multi-Head Attention followed by another layer of Feed Forward Neural Network. The decoder, on the other hand, has an extra Masked Multi-Head Attention.
Encoder: The encoder is composed of a stack of N = 6
identical layer layer has two sub-layers. The first is a
multi-head self-attention mechanism, and the second is a simple,
positionwise fully connected feed-forward network. We employ a residual
connection around each ofthe two sub-layers, followed by layer
normalization. That is, the output of each sub-layer is LayerNorm(x +
Sublayer(x)),where Sublayer(x) is the function implemented by the
sub-layeritself. To facilitate these residual connections, all
sub-layers in the model, as well as the embeddinglayers, produce outputs
of dimension dmodel = 512.
Decoder: The decoder is also composed of a stack of N
= 6 identical layers. In addition to the two sub-layers in each encoder
layer, the decoder inserts a third sub-layer, which performs multi-head
attention over the output of the encoder stack. Similar to the encoder,
we employ residual connections around each of the sub-layers, followed
by layer normalization. We also modify the self-attention
sub-layer in the decoder stack to prevent positions from attending to
subsequent positions. This masking, combined with fact that the output
embeddings are offset by one position, ensures that the
predictions for position i can depend only on the known outputs at
positions less than i.
Add caption Let’s see how this setup of the encoder and the decoder stack works:
An important thing to note here – in addition to the self-attention
and feed-forward layers, the decoders also have one more layer of
Encoder-Decoder Attention layer. This helps the decoder focus on the
appropriate parts of the input sequence. 2.2 Understanding self attention:
Self-attention, sometimes called intra-attention, is an attention
mechanism relating different positions of a single sequence in order to
compute a representation of the sequence. Is it referring to the street or to the animal? It’s a simple
question for us but not for an algorithm. When the model is processing
the word “it”, self-attention tries to associate “it” with “animal” in
the same sentence. Self-attention allows the model to look at the other words in the
input sequence to get a better understanding of a certain word in the
sequence. Self-attention is computed not once but multiple times in the Transformer’s architecture, in parallel and independently. It is therefore referred to as Multi-head Attention. The outputs are concatenated and linearly transformed as shownin the figure below: Multi-head attention allows the model to jointly attend to
information from different representation subspaces at different
positions. 2.3 Limitations of the Transformer:
Attention can only deal with fixed-length text strings. The text has
to be split into a certain number of segments or chunks before being fed
into the system as input
This chunking of text causes context fragmentation. For example, if a
sentence is split from the middle, then a significant amount of context
is lost. In other words, the text is split without respecting the
sentence or any other semantic boundary But,undoutedly transformer inspired BERT and all the following breakthroughs in NLP Credits : http://jalammar.github.io/illustrated-transformer/
To be continued... , stay tuned |