Today we are going to see how can we use Bert Model for Multi label Classification using Pytorch and Transformers Library
Problem Statement
Topic Modeling for Research Articles
Researchers have access to large online archives of scientific articles. As a consequence, finding relevant articles has become more difficult. Tagging or topic modelling provides a way to give token of identification to research articles which facilitates recommendation and search process.
Given the abstract and title for a set of research articles, predict the topics for each article included in the test set.
Note that a research article can possibly have more than 1 topic. The research article abstracts and titles are sourced from the following 6 topics:
1. Computer Science
2. Physics
3. Mathematics
4. Statistics
5. Quantitative Biology
6. Quantitative Finance
Evaluation Metric
Submissions are evaluated on micro F1 Score between the predicted and observed topics for each article in the test set
Public and Private Split
Test reviews are further divided into Public (40%) and Private (60%)
- Your initial responses will be checked and scored on the Public data.
- The final rankings would be based on your private score which will be published once the competition is over.
Read the files
train = pd.read_csv('/kaggle/input/janatahack-independence-day-2020-ml-hackathon/train.csv')
test = pd.read_csv('/kaggle/input/janatahack-independence-day-2020-ml-hackathon/test.csv')
sub = pd.read_csv('/kaggle/input/janatahack-independence-day-2020-ml-hackathon/sample_submission_UVKGLZE.csv')
Let's Look the dataset
Train Data
ID | TITLE | ABSTRACT | Computer Science | Physics | Mathematics | Statistics | Quantitative Biology | Quantitative Finance | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | Reconstructing Subject-Specific Effect Maps | Predictive models allow subject-specific inf... | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 2 | Rotation Invariance Neural Network | Rotation invariance and translation invarian... | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 3 | Spherical polyharmonics and Poisson kernels fo... | We introduce and develop the notion of spher... | 0 | 0 | 1 | 0 | 0 | 0 |
3 | 4 | A finite element approximation for the stochas... | The stochastic Landau--Lifshitz--Gilbert (LL... | 0 | 0 | 1 | 0 | 0 | 0 |
4 | 5 | Comparative study of Discrete Wavelet Transfor... | Fourier-transform infra-red (FTIR) spectra o... | 1 | 0 | 0 | 1 | 0 | 0 |
Test Data
ID | TITLE | ABSTRACT | |
---|---|---|---|
0 | 20973 | Closed-form Marginal Likelihood in Gamma-Poiss... | We present novel understandings of the Gamma... |
1 | 20974 | Laboratory mid-IR spectra of equilibrated and ... | Meteorites contain minerals from Solar Syste... |
2 | 20975 | Case For Static AMSDU Aggregation in WLANs | Frame aggregation is a mechanism by which mu... |
3 | 20976 | The | |
-ESO Survey: the inner disk intermed... | Milky Way open clusters are very diverse in ... | ||
4 | 20977 | Witness-Functions versus Interpretation-Functi... | Proving that a cryptographic protocol is cor... |
Lets Get Started
Import Necessary Statements
import re
from tqdm import tqdm
%matplotlib inline
from sklearn.model_selection import train_test_split
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
words = stopwords.words("english")
lemma = nltk.stem.WordNetLemmatizer()
import torch
from transformers import BertTokenizer
from transformers import RobertaTokenizer
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
import torch.nn as nn
import torch.nn.functional as F
from transformers import BertModel,RobertaModel
from transformers import AdamW, get_linear_schedule_with_warmup
import random
import time
Select Target Column
targets = ['ComputerScience', 'Physics', 'Mathematics','Statistics', 'QuantitativeBiology', 'QuantitativeFinance']
Since it is multilabel classification Problem there are 6 Target column
Contactenate Title and Abstract as a new Text Column for the X feature for both train and test data
train["text"] = train["TITLE"]+""+train["ABSTRACT"]
test["text"] = test["TITLE"]+""+test["ABSTRACT"]
Train Test Split for Train and Validation Data
X = train.text.values
y = train[['Computer Science', 'Physics', 'Mathematics',
'Statistics', 'Quantitative Biology', 'Quantitative Finance']].values
X_train, X_val, y_train, y_val =train_test_split(X, y, test_size=0.1, random_state=2020)
The dataset contains More Noise with in them like More special Charater as it was a abstract of a research paper , we need to take care of the special character and math equation
I removed all the special characters and LATEX Tags using Regex
Text Pre-Processing
def text_preprocessing(text):
"""
- Remove entity mentions (eg. '@united')
- Correct errors (eg. '&' to '&')
@param text (str): a string to be processed.
@return text (Str): the processed string.
"""
text = text.lower()
text = re.sub(r"what's", "what is ", text)
text = re.sub(r"won't", "will not ", text)
text = re.sub(r"\'s", " ", text)
text = re.sub(r"\'ve", " have ", text)
text = re.sub(r"can't", "can not ", text)
text = re.sub(r"n't", " not ", text)
text = re.sub(r"i'm", "i am ", text)
text = re.sub(r"\'re", " are ", text)
text = re.sub(r"\'d", " would ", text)
text = re.sub(r"\'ll", " will ", text)
text = re.sub(r"\'scuse", " excuse ", text)
text = re.sub(r"\'\n", " ", text)
text = re.sub(r"-", " ", text)
text = re.sub(r"\'\xa0", " ", text)
text = re.sub('\s+', ' ', text)
text = ''.join(c for c in text if not c.isnumeric())
# Remove '@name'
text = re.sub(r'(@.*?)[\s]', ' ', text)
# Replace '&' with '&'
text = re.sub(r'&', '&', text)
# Remove trailing whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text
To remove the Latex Tags
text = re.sub(r'(\$+)(?:(?!\1)[\s\S])*\1',' ',text)
To remove the words with in Brackets
text = re.sub(r'\([^)]*\)', '', text)
Pre-Processing the text that Bert Expects
#Load the Bert tokenizer
# tokenizer = RobertaTokenizer.from_pretrained('roberta-base',do_lower_case=True)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased",do_lower_case=True)
# Create a funcition to tokenize a set of text
def preprocessing_for_bert(data):
"""Perform required preprocessing steps for pretrained BERT.
@param data (np.array): Array of texts to be processed.
@return input_ids (torch.Tensor): Tensor of token ids to be fed to a model.
@return attention_masks (torch.Tensor): Tensor of indices specifying which
tokens should be attended to by the model.
"""
# create empty lists to store outputs
input_ids = []
attention_masks = []
#for every sentence...
for sent in data:
# 'encode_plus will':
# (1) Tokenize the sentence
# (2) Add the `[CLS]` and `[SEP]` token to the start and end
# (3) Truncate/Pad sentence to max length
# (4) Map tokens to their IDs
# (5) Create attention mask
# (6) Return a dictionary of outputs
encoded_sent = tokenizer.encode_plus(
text = text_preprocessing(sent), #preprocess sentence
add_special_tokens = True, #Add `[CLS]` and `[SEP]`
max_length= MAX_LEN , #Max length to truncate/pad
pad_to_max_length = True, #pad sentence to max length
return_attention_mask= True #Return attention mask
)
# Add the outputs to the lists
input_ids.append(encoded_sent.get('input_ids'))
attention_masks.append(encoded_sent.get('attention_mask'))
#convert lists to tensors
input_ids = torch.tensor(input_ids)
attention_masks = torch.tensor(attention_masks)
return input_ids,attention_masks
we need to pass the fixed length of sentence to the bert model , i have fixed as 500, and Bert Model
MAX_LEN = 500
# Pre-processing the text
train_inputs, train_masks = preprocessing_for_bert(X_train) val_inputs, val_masks = preprocessing_for_bert(X_val)
# Convert other data types to torch.Tensor train_labels = torch.tensor(y_train) val_labels = torch.tensor(y_val) ## For fine-tuning Bert, the authors recommmend a batch size of 16 or 32 batch_size = 16 # Create the DataLoader for our training set train_data = TensorDataset(train_inputs,train_masks, train_labels) train_sampler = RandomSampler(train_data) train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size) # Create the DataLoader for our validation set val_data = TensorDataset(val_inputs, val_masks, val_labels) val_sampler = SequentialSampler(val_data) val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)
Create a class for BERT Model
# Create the BertClassifier class class BertClassifier(nn.Module): """ Bert Model for classification Tasks. """ def __init__(self, freeze_bert=False): """ @param bert: a BertModel object @param classifier: a torch.nn.Module classifier @param freeze_bert (bool): Set `False` to fine_tune the Bert model """ super(BertClassifier,self).__init__() # Specify hidden size of Bert, hidden size of our classifier, and number of labels D_in, H,D_out = 768,30,6 # self.bert = RobertaModel.from_pretrained('roberta-base') self.bert = BertModel.from_pretrained("bert-base-uncased") self.classifier = nn.Sequential( nn.Linear(D_in, H), nn.ReLU(), nn.Linear(H, D_out)) self.sigmoid = nn.Sigmoid() # Freeze the Bert Model if freeze_bert: for param in self.bert.parameters(): param.requires_grad = False def forward(self,input_ids,attention_mask): """ Feed input to BERT and the classifier to compute logits. @param input_ids (torch.Tensor): an input tensor with shape (batch_size, max_length) @param attention_mask (torch.Tensor): a tensor that hold attention mask information with shape (batch_size, max_length) @return logits (torch.Tensor): an output tensor with shape (batch_size, num_labels) """ outputs = self.bert(input_ids=input_ids, attention_mask = attention_mask) # Extract the last hidden state of the token `[CLS]` for classification task last_hidden_state_cls = outputs[0][:,0,:] # Feed input to classifier to compute logits logit = self.classifier(last_hidden_state_cls) # logits = self.sigmoid(logit) return logit
Initialize the bert Model
def initialize_model(epochs=4):
"""Initialize the Bert Classifier, the optimizer and the learning rate scheduler.
"""
# Instantiate Bert Classifier
bert_classifier = BertClassifier(freeze_bert=False)
bert_classifier.to(device)
# Create the optimizer
optimizer = AdamW(bert_classifier.parameters(),
lr=5e-5, #Default learning rate
eps=1e-8 #Default epsilon value
)
# Total number of training steps
total_steps = len(train_dataloader) * epochs
# Set up the learning rate scheduler
scheduler = get_linear_schedule_with_warmup(optimizer,
num_warmup_steps=0, # Default value
num_training_steps=total_steps)
return bert_classifier, optimizer, scheduler
Define Optimizer and training loops
# Specify loss function #loss_fn = nn.CrossEntropyLoss() loss_fn = nn.BCEWithLogitsLoss() def set_seed(seed_value=42): """Set seed for reproducibility. """ random.seed(seed_value) np.random.seed(seed_value) torch.manual_seed(seed_value) torch.cuda.manual_seed_all(seed_value) def train(model, train_dataloader, val_dataloader=None, epochs=4, evaluation=False): """Train the BertClassifier model. """ # Start training loop print("Start training...\n") for epoch_i in range(epochs): # ======================================= # Training # ======================================= # Print the header of the result table print(f"{'Epoch':^7} | {'Batch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {'Val Acc':^9} | {'Elapsed':^9}") print("-"*70) # Measure the elapsed time of each epoch t0_epoch, t0_batch = time.time(), time.time() # Reset tracking variables at the beginning of each epoch total_loss, batch_loss, batch_counts = 0, 0, 0 # Put the model into the training mode model.train() # For each batch of training data... for step, batch in enumerate(train_dataloader): batch_counts +=1 # Load batch to GPU b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch) # Zero out any previously calculated gradients model.zero_grad() # Perform a forward pass. This will return logits. logits = model(b_input_ids, b_attn_mask) # Compute loss and accumulate the loss values loss = loss_fn(logits, b_labels.float()) batch_loss += loss.item() total_loss += loss.item() # Perform a backward pass to calculate gradients loss.backward() # Clip the norm of the gradients to 1.0 to prevent "exploding gradients" torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Update parameters and the learning rate optimizer.step() scheduler.step() # Print the loss values and time elapsed for every 20--50000 batches if (step % 50000 == 0 and step != 0) or (step == len(train_dataloader) - 1): # Calculate time elapsed for 20 batches time_elapsed = time.time() - t0_batch # Print training results print(f"{epoch_i + 1:^7} | {step:^7} | {batch_loss / batch_counts:^12.6f} | {'-':^10} | {'-':^9} | {time_elapsed:^9.2f}") # Reset batch tracking variables batch_loss, batch_counts = 0, 0 t0_batch = time.time() # Calculate the average loss over the entire training data avg_train_loss = total_loss / len(train_dataloader) print("-"*70) # ======================================= # Evaluation # ======================================= if evaluation == True: # After the completion of each training epoch, measure the model's performance # on our validation set. val_loss, val_accuracy = evaluate(model, val_dataloader) # Print performance over the entire training data time_elapsed = time.time() - t0_epoch print(f"{epoch_i + 1:^7} | {'-':^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^9.2f} | {time_elapsed:^9.2f}") print("-"*70) print("\n") print("Training complete!") def evaluate(model, val_dataloader): """After the completion of each training epoch, measure the model's performance on our validation set. """ # Put the model into the evaluation mode. The dropout layers are disabled during # the test time. model.eval() # Tracking variables val_accuracy = [] val_loss = [] # For each batch in our validation set... for batch in val_dataloader: # Load batch to GPU b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch) # Compute logits with torch.no_grad(): logits = model(b_input_ids, b_attn_mask) # Compute loss loss = loss_fn(logits, b_labels.float()) val_loss.append(loss.item()) # Get the predictions #preds = torch.argmax(logits, dim=1).flatten() # Calculate the accuracy rate #accuracy = (preds == b_labels).cpu().numpy().mean() * 100 accuracy = accuracy_thresh(logits.view(-1,6),b_labels.view(-1,6)) val_accuracy.append(accuracy) # Compute the average accuracy and loss over the validation set. val_loss = np.mean(val_loss) val_accuracy = np.mean(val_accuracy) return val_loss, val_accuracy def accuracy_thresh(y_pred, y_true, thresh:float=0.5, sigmoid:bool=True): "Compute accuracy when `y_pred` and `y_true` are the same size." if sigmoid: y_pred = y_pred.sigmoid() return ((y_pred>thresh)==y_true.byte()).float().mean().item() #return np.mean(((y_pred>thresh).float()==y_true.float()).float().cpu().numpy(), axis=1).sum()
# Call your defined model
set_seed(42) # Set seed for reproducibility bert_classifier, optimizer, scheduler = initialize_model(epochs=1) train(bert_classifier, train_dataloader, val_dataloader, epochs=1, evaluation=True)
That's it Boom you have fine tuned your model on your dataset , and move the model to production easily.
In the next blog we will see how we can productionize your bert Mode .