Why Bert model came and how it is helpful ??


1. Intution behind RNN based Sequence-to-Sequence Model:


Sequence-to-sequence (seq2seq) models in NLP are used to convert sequences of Type A to sequences of Type B and it can be used to many other such task. For example, translation of English sentences to German sentences is a sequence-to-sequence task. 




Let’s take a simple example of a sequence-to-sequence model. Check out the
above illustration:

A single Encoder Decoder architecture with cell state as a intermediate state of input.
Add caption

  • Both Encoder and Decoder are RNNs
  • At every time step in the Encoder, the RNN takes a word vector (xi) from the input sequence and a hidden state (Hi) from the previous time step
  • The hidden state is updated at each time step
  • The hidden state from the last unit is known as the context vector. This contains information about the input sequence
  • This context vector is then passed to the decoder and it is then used to generate the target sequence (English phrase)
  • If we use the Attention mechanism, then the weighted sum of the hidden states are passed as the context vector to the decoder

    Limitations of RNN'S:

  • Dealing with long-range dependencies is still challenging
  • The sequential nature of the model architecture prevents parallelization.
  • Vanishing Gradient Problem

Introduction to the Transformer:

The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution. The idea behind Transformer is to handle the dependencies between input and output with attention and recurrence completely.


the Encoder and Decoder parts only.Now focus on the below image. The Encoder block has 1 layer of a Multi-Head Attention followed by another layer of Feed Forward Neural Network. The decoder, on the other hand, has an extra Masked Multi-Head Attention.


Encoder: The encoder is composed of a stack of N = 6 identical layer                     layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network. We employ a residual connection around each ofthe two sub-layers, followed by layer normalization. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)),where Sublayer(x) is the function implemented by the sub-layeritself. To facilitate these residual connections, all sub-layers in the model, as well as the embeddinglayers, produce outputs of dimension dmodel = 512.

Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

The encoder and decoder blocks are actually multiple identical encoders and decoders stacked on top of each other. Both the encoder stack and the decoder stack have the same number of units.The number of encoder and decoder units is a hyperparameter


Add caption

Let’s see how this setup of the encoder and the decoder stack works:


  • The word embeddings of the input sequence are passed to the first encoder
  • These are then transformed and propagated to the next encoder
  • The output from the last encoder in the encoder-stack is passed to all the decoders in the decoder-stack as shown in the figure below

An important thing to note here – in addition to the self-attention and feed-forward layers, the decoders also have one more layer of Encoder-Decoder Attention layer. This helps the decoder focus on the appropriate parts of the input sequence.

2.2 Understanding self attention:

Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

Take a look at the above image. Can you figure out what the term “it” in this sentence refers to?

Is it referring to the street or to the animal? It’s a simple question for us but not for an algorithm. When the model is processing the word “it”, self-attention tries to associate “it” with “animal” in the same sentence.

Self-attention allows the model to look at the other words in the input sequence to get a better understanding of a certain word in the sequence.

Self-attention is computed not once but multiple times in the Transformer’s architecture, in parallel and independently. It is therefore referred to as Multi-head Attention. The outputs are concatenated and linearly transformed as shownin the figure below:


Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

2.3 Limitations of the Transformer:

Attention can only deal with fixed-length text strings. The text has to be split into a certain number of segments or chunks before being fed into the system as input This chunking of text causes context fragmentation. For example, if a sentence is split from the middle, then a significant amount of context is lost. In other words, the text is split without respecting the sentence or any other semantic boundary

But,undoutedly transformer inspired BERT and all the following breakthroughs in NLP

Credits : http://jalammar.github.io/illustrated-transformer/

 

To be continued... , stay tuned

Post a Comment

Previous Post Next Post