Bert For Classification with Tensorflow HUB

In one my post I have explained about why bert model came into existence and how it is useful if you did not read that yet, to read click here.
   
Classification is simple task in NLP but it is difficult to achieve good accuracy
and to make a code to production is really difficult and in this blog we are going to see how are we going to create a simple Classification Model using Bert and Tensorflow and Tensorflow hub



    In this blog post  i am going to explain how to build a simple bert model for tweet classifier  , the code which i am going to explain was used by me for a ongoing kaggle competition named Real or not ?NLP with disaster tweets
which helps me to place at top 10% in the Leaderboard

How does a Bert Mode works
    
    BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed workings of Transformer are described in a paper by Google

The chart below is a high-level description of the Transformer encoder. The input is a sequence of tokens, which are first embedded into vectors and then processed in the neural network. The output is a sequence of vectors of size H, in which each vector corresponds to an input token with the same index.

Masked LM (MLM)

Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. In technical terms, the prediction of the output words requires:
  1. Adding a classification layer on top of the encoder output.
  2. Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.
  3. Calculating the probability of each word in the vocabulary with softmax.


Next Sentence Prediction (NSP)

In  BERT training , the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence. The assumption is that the random sentence will be disconnected from the first sentence.

To help the model distinguish between the two sentences in training, the input is processed in the following way before entering the model:


  1. A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.
  2. A sentence embedding indicating Sentence A or Sentence B is added to each token. Sentence embeddings are similar in concept to token embeddings with a vocabulary of 2.
  3. A positional embedding is added to each token to indicate its position in the sequence. The concept and implementation of positional embedding are presented in the Transformer paper.
Without wasting much time lets get started for coding .Hang on with me as it is going to be more technical

BERT DOES NOT HAVE DECODER BLOCK

Prequesties
  • Tensorflow-hub - For loading the BertModel from Tensorflow hub
  • Tensorflow - For Building the model
  • Pandas  - File reading operations
  • numpy - For Array operations


Before getting stated to code we need one file named tokenization which helps to tokenize the text ..to get that

!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py
The above line helps us to import a module named tokenization

Import Statements
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub
import tokenization

Loading the Bert Layer form Tensorflow HUB

module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)
 With the help of tensorflow hub we are loading the BERT Model with trainable as True so that we can train on that layer


Read the data
train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")
submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")


Some Pre Processing based on Bert model

def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)      
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)
 The above code helps us to convert the input text tokens ,ids and mask by adding [CLS] token at the start 
of the sentence and [SEP] token at the end of sentence then find the max len of sentence so that can
help to pad all the sentence of equal length and convert each sentence to ids and then convert all the sentence
to equal length by padding .do this to till we get mask and tokens .

Define the Model

def build_model(bert_layer, max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")
_, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(clf_output)
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model
The Model accepts the bert layer and the max len as an input after that it defines the ids,mask,and segment_ids
calling the bert layer by passing the ids,mask and segments , then bert model gives two output
we take the second output and do the next step and do the next step by passing it into the Dense
layer with 1 as the output because it is a Binary Classification,change this value according to the
problem statement ,and compile the model with loss ad "Binary_crossentropy" and metrics as "accuracy"
with optimizers as Adam

Getting Vocab and Tokenizer

vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)
we get the vocab text fot the bert model with the help of bert model loaded from tensorflow hub and we need to initialize the tokenizer to tokenize the given input by passing the vocab and the lowercase parameter

Calling the defined Model on train and test data

train_input = bert_encode(train.text.values, tokenizer, max_len=160)
test_input = bert_encode(test.text.values, tokenizer, max_len=160)
train_labels = train.target.values

we call the defined model on train and test data by passing the data and tokenizer we defined earlier and the max_len of each sentence to be fed to the model


Call the model and see the Model Summary
model = build_model(bert_layer, max_len=160)
model.summary()

Creating the checkpoint and train the model on train data for 5 Epochs and batch_size as 16 and validation_split as 0.2


checkpoint = ModelCheckpoint('model.h5', monitor='val_loss', save_best_only=True)

train_history = model.fit(
    train_input, train_labels,
    validation_split=0.2,
    epochs=5,
    callbacks=[checkpoint],
    batch_size=16
)
Loading the saved model for prediction 
model.load_weights('model.h5')
test_pred = model.predict(test_input)

Create the Submission file 

submission['target'] = test_pred.round().astype(int)
submission.to_csv('bert_Model_submission1.csv', index=False)

Please feel free to play with the hyper-parameters to achieve higher accuracy
I got top 150 with this code


Link to LeaderBoard

If you have any doubts about this blog feel free to comment here or in my kaggle kernal
link to the kaggle kernal is here please do upvote the kernal if you find it useful.please do visit
my blog for further blogs .
If you want to follow me personally do follow me on


2 Comments

  1. Use ktrain module for NLP based problems. Ktrain module supports Vision related problems too. Ktrain also comprises of pretrained model with respect to NLP such as BERT,DistillBert, Roberta etc. Simply put, just less than 5 lines of code we can build a state of the NLP model. Hope you use it!

    ReplyDelete
    Replies
    1. Yes surely i will try it and if possible i will try to do a blog on that library.

      Delete
Previous Post Next Post