In one my post I have explained about why bert model came into existence and how it is useful if you did not read that yet, to read click here.
Classification is simple task in NLP but it is difficult to achieve good accuracy
and to make a code to production is really difficult and in this blog we are going to see how are we going to create a simple Classification Model using Bert and Tensorflow and Tensorflow hub
In this blog post i am going to explain how to build a simple bert model for tweet classifier , the code which i am going to explain was used by me for a ongoing kaggle competition named Real or not ?NLP with disaster tweets
which helps me to place at top 10% in the Leaderboard
How does a Bert Mode works
BERT makes use of Transformer, an attention mechanism that learns
contextual relations between words (or sub-words) in a text. In its
vanilla form, Transformer includes two separate mechanisms — an encoder
that reads the text input and a decoder that produces a prediction for
the task. Since BERT’s goal is to generate a language model, only the
encoder mechanism is necessary. The detailed workings of Transformer are
described in a paper by Google
The chart below is a high-level description of the Transformer encoder.
The input is a sequence of tokens, which are first embedded into vectors
and then processed in the neural network. The output is a sequence of
vectors of size H, in which each vector corresponds to an input token
with the same index.
Masked LM (MLM)
Before
feeding word sequences into BERT, 15% of the words in each sequence are
replaced with a [MASK] token. The model then attempts to predict the
original value of the masked words, based on the context provided by the
other, non-masked, words in the sequence. In technical terms, the
prediction of the output words requires:
- Adding a classification layer on top of the encoder output.
- Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.
- Calculating the probability of each word in the vocabulary with softmax.
Next Sentence Prediction (NSP)
In BERT training , the model receives pairs of sentences as
input and learns to predict if the second sentence in the pair is the
subsequent sentence in the original document. During training, 50% of
the inputs are a pair in which the second sentence is the subsequent
sentence in the original document, while in the other 50% a random
sentence from the corpus is chosen as the second sentence. The
assumption is that the random sentence will be disconnected from the
first sentence.
To help the model distinguish between the two sentences in training, the input is processed in the following way before entering the model:
- A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.
- A sentence embedding indicating Sentence A or Sentence B is added to each token. Sentence embeddings are similar in concept to token embeddings with a vocabulary of 2.
- A positional embedding is added to each token to indicate its position in the sequence. The concept and implementation of positional embedding are presented in the Transformer paper.
Without wasting much time lets get started for coding .Hang on with me as it is going to be more technical
BERT DOES NOT HAVE DECODER BLOCK
Prequesties
- Tensorflow-hub - For loading the BertModel from Tensorflow hub
- Tensorflow - For Building the model
- Pandas - File reading operations
- numpy - For Array operations
Before getting stated to code we need one file named tokenization which helps to tokenize the text ..to get that
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py
The above line helps us to import a module named tokenization
Import Statements
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub
import tokenization
Loading the Bert Layer form Tensorflow HUB
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)With the help of tensorflow hub we are loading the BERT Model with trainable as True so that we can train on that layer
Read the data
train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")
submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")
Some Pre Processing based on Bert model
def bert_encode(texts, tokenizer, max_len=512):
all_tokens = []
all_masks = []
all_segments = []
for text in texts:
text = tokenizer.tokenize(text)
text = text[:max_len-2]
input_sequence = ["[CLS]"] + text + ["[SEP]"]
pad_len = max_len - len(input_sequence)
tokens = tokenizer.convert_tokens_to_ids(input_sequence)
tokens += [0] * pad_len
pad_masks = [1] * len(input_sequence) + [0] * pad_len
segment_ids = [0] * max_len
all_tokens.append(tokens)
all_masks.append(pad_masks)
all_segments.append(segment_ids)
return np.array(all_tokens), np.array(all_masks), np.array(all_segments)
The above code helps us to convert the input text tokens ,ids and mask by adding [CLS] token at the start
of the sentence and [SEP] token at the end of sentence then find the max len of sentence so that can
help to pad all the sentence of equal length and convert each sentence to ids and then convert all the sentence
to equal length by padding .do this to till we get mask and tokens .
Define the Modeldef build_model(bert_layer, max_len=512):input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")
_, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])clf_output = sequence_output[:, 0, :]out = Dense(1, activation='sigmoid')(clf_output)model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])return modelThe Model accepts the bert layer and the max len as an input after that it defines the ids,mask,and segment_ids
calling the bert layer by passing the ids,mask and segments , then bert model gives two output
we take the second output and do the next step and do the next step by passing it into the Dense
layer with 1 as the output because it is a Binary Classification,change this value according to the
problem statement ,and compile the model with loss ad "Binary_crossentropy" and metrics as "accuracy"
with optimizers as Adam
Getting Vocab and Tokenizer
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)
we get the vocab text fot the bert model with the help of bert model loaded from tensorflow hub and we need to initialize the tokenizer to tokenize the given input by passing the vocab and the lowercase parameter
Calling the defined Model on train and test data
train_input = bert_encode(train.text.values, tokenizer, max_len=160)
test_input = bert_encode(test.text.values, tokenizer, max_len=160)
train_labels = train.target.values
we call the defined model on train and test data by passing the data and tokenizer we defined earlier and the max_len of each sentence to be fed to the model
Call the model and see the Model Summary
model = build_model(bert_layer, max_len=160)
model.summary()Creating the checkpoint and train the model on train data for 5 Epochs and batch_size as 16 and validation_split as 0.2checkpoint = ModelCheckpoint('model.h5', monitor='val_loss', save_best_only=True)
train_history = model.fit(train_input, train_labels,validation_split=0.2,epochs=5,callbacks=[checkpoint],batch_size=16)
Loading the saved model for prediction
model.load_weights('model.h5')
test_pred = model.predict(test_input)
Create the Submission filesubmission['target'] = test_pred.round().astype(int)submission.to_csv('bert_Model_submission1.csv', index=False)
Please feel free to play with the hyper-parameters to achieve higher accuracy
I got top 150 with this code
Link to LeaderBoard
If you have any doubts about this blog feel free to comment here or in my kaggle kernal
link to the kaggle kernal is here please do upvote the kernal if you find it useful.please do visit
my blog for further blogs .
If you want to follow me personally do follow me on
Use ktrain module for NLP based problems. Ktrain module supports Vision related problems too. Ktrain also comprises of pretrained model with respect to NLP such as BERT,DistillBert, Roberta etc. Simply put, just less than 5 lines of code we can build a state of the NLP model. Hope you use it!
ReplyDeleteYes surely i will try it and if possible i will try to do a blog on that library.
Delete