How to Build a good and generalized ML Model

In this blog we are going to learn about how to approach a good generalized Machine learning Model which are used by many people in the industry or during a competition

Lets Dive In



Steps to be followed to create a ML Model
  • Data collection
  • Data pre-processing
  • Feature Engineering
  • Model Building
  • Evaluation
  • Hyper parameter tuning
  • Model Testing

we will go through each and every steps in details

Data Collection

    The major problem in all the problems is to find the relevant datasets.There are many sources available to collect datasets like Kaggle or UCI Repository and there are many other resources to collect the datasets like web scrapping  and building a pipeline to fetch the data from live stream data

Data Pre-Processing
    After collection of datasets ,the next step is to pre-process the data like

1)Handling NaN values
2)Handling target class distribution(Classification Problem)
3)Remove Outliers or handle outliers
4)Standardization
5)Normalization(optional)
6)Target column Encoding
7)Handle categorical data ,i.e convert text data to numerical data

Feature Engineering

After pre-processing  if possible try to create a new features from the existing features,by adding two features are multiplying two features and many other possibility are there and Feature selection ,try to select the features which are more correlation to the target feature.


Model Building

Important Point

ALWAYS USE SOME CROSS VALIDATION TECHNIQUE TO GENERALIZE THE MODEL(MAY WORK OR MAY NOT WORK) works good for some hackathons and it is recommended

The best cross validation technique is
  1. K-Fold Cross Validation
  2. Stratified Cross Validation
These are the most widely used cross validation technique

There are multiple ways to build a model best approach is to build a model where the loss is very less for your train and valid data .
You can try different models at the same time by building a pipeline using sklearn

There are multiples loss functions , have to choose one based on the problem statement.

If you have very low train and validation loss , the less the score the more generalize your model

Most of us use Ensemble technique as the first priority instead of trying only ensemble model ,try to use initial models like Linear,Logistic and much small algorithms and after that try Ensemble technique



HYPER PARAMETER TUNING

After finalizing one model which gives you low loss we can further improve the model by using hyper parameter tuning on the model ,
Disadvantage of using hyper parameter tuning if one is going to do grid search on a larger data it will take some 8 to 9 hours to complete the grid search.

TESTING

After completion of the model test the model by create a web app for the model

Post a Comment

Previous Post Next Post