In this blog we are going to learn about how to approach a good generalized Machine learning Model which are used by many people in the industry or during a competition
Lets Dive In
Steps to be followed to create a ML Model
- Data collection
- Data pre-processing
- Feature Engineering
- Model Building
- Evaluation
- Hyper parameter tuning
- Model Testing
we will go through each and every steps in details
Data Collection
The major problem in all the problems is to find the relevant datasets.There are many sources available to collect datasets like Kaggle or UCI Repository and there are many other resources to collect the datasets like web scrapping and building a pipeline to fetch the data from live stream data
Data Pre-Processing
After collection of datasets ,the next step is to pre-process the data like
1)Handling NaN values
2)Handling target class distribution(Classification Problem)
3)Remove Outliers or handle outliers
4)Standardization
5)Normalization(optional)
6)Target column Encoding
7)Handle categorical data ,i.e convert text data to numerical data
Feature Engineering
After pre-processing if possible try to create a new features from the existing features,by adding two features are multiplying two features and many other possibility are there and Feature selection ,try to select the features which are more correlation to the target feature.
Model Building
Important Point
ALWAYS USE SOME CROSS VALIDATION TECHNIQUE TO GENERALIZE THE MODEL(MAY WORK OR MAY NOT WORK) works good for some hackathons and it is recommended
The best cross validation technique is
- K-Fold Cross Validation
- Stratified Cross Validation
These are the most widely used cross validation technique
There are multiple ways to build a model best approach is to build a model where the loss is very less for your train and valid data .
You can try different models at the same time by building a pipeline using sklearn
There are multiples loss functions , have to choose one based on the problem statement.
If you have very low train and validation loss , the less the score the more generalize your model
Most of us use Ensemble technique as the first priority instead of trying only ensemble model ,try to use initial models like Linear,Logistic and much small algorithms and after that try Ensemble technique
HYPER PARAMETER TUNING
After finalizing one model which gives you low loss we can further improve the model by using hyper parameter tuning on the model ,
Disadvantage of using hyper parameter tuning if one is going to do grid search on a larger data it will take some 8 to 9 hours to complete the grid search.
TESTING
After completion of the model test the model by create a web app for the model