What is PCA?? why we have to use it ? when we have to use it ?
Every one who works in machine learning will heard about this term called PCA , what does it do exactly lets see it one by one .
What is PCA?
PCA stands for Principal Component Analysis(PCA) and it is a linear dimensionality reduction technique. Many non-linear dimensionality reduction techniques exist, but linear methods are more mature
Why and when we have to use it ?
if a data set contain more features like 50 or 6o or even
100 features we can use PCA to understand the data like finding which
features are more important for model building and without loosing the
main information about the data.
How to implement PCA from scratch for MNIST data set
Steps to implement PCA
Step 1: Standardize the dataset. Step 2: Calculate the covariance matrix for the features in the dataset. Step 3: Calculate the eigenvalues and eigenvectors for the covariance matrix. Step 4: Sort eigenvalues and their corresponding eigenvectors. Step 5: Pick k eigenvalues and form a matrix of eigenvectors. Step 6: Transform the original matrix.
Lets see how can we implement PCA from scratch using python
# Necessary Dependecies
import numpy as np import pandas as pd import matplotlib.pyplot as plt from numpy.linalg import eig from scipy.linalg import eigh import matplotlib.pyplot as plt import seaborn as sn
# reading the training data data = pd.read_csv('train.csv')
#creating the Target column or Y column
target = data['label']
x = data.drop(['label'],axis=1)
print(x.shape) print(target.shape)
# display or plot a random number. plt.figure(figsize=(7,7)) idx = 100 grid_data = x.iloc[idx].as_matrix().reshape(28,28) # reshape from 1d to 2d pixel array plt.imshow(grid_data, interpolation = "none", cmap = "gray") plt.show() print(target[idx])
IMPORTANT STEP
Before applying pca the data has to be in standardised , we can use sklearn library to do this or we can create our own class
#pre - Processing class just like a library class StandardScaler(object): def __init__(self): pass def fit(self, X): self.mean_ = np.mean(X, axis=0) self.scale_ = np.std(X - self.mean_, axis=0) return self def transform(self, X): return (X - self.mean_) / self.scale_ def fit_transform(self, X): return self.fit(X).transform(X)
# Calling the class on the data
standardized_data = StandardScaler().fit_transform(data)
# Removing the nan values
standardized_data = np.nan_to_num(standardized_data)
sample_data = standardized_data
print(sample_data.T.shape)
# finding the Co - Variance matrix
covar_matrix = np.cov(standardized_data.T)
covar_matrix.shape
#finding eighen values and eighen vectores form co variance Matrix values,vectors = eigh(covar_matrix, eigvals=(782,783))
vectors.shape
values.shape
len(vectors)
vectors = vectors.T print("Updated shape of eigen vectors = ",vectors.shape) # here the vectors[1] represent the eigen vector corresponding 1st principal eigen vector # here the vectors[0] represent the eigen vector corresponding 2nd principal eigen vector
# projecting the original data sample on the plane #formed by two principal eigen vectors by vector-vector multiplication. new_coordinates = np.matmul(vectors, sample_data.T)
print (" resultanat new data points' shape ", vectors.shape, "X", sample_data.T.shape," = ", new_coordinates.shape)
# appending label to the 2d projected data(vertical stack) new_coordinates = np.vstack((new_coordinates, labels)).T
new_coordinates
# creating a new data frame for ploting the labeled points. dataframe = pd.DataFrame(data=new_coordinates, columns=("1st_principal", "2nd_principal", "label"))
visuvalization of PCA components using Matplotlib¶
plt.figure() plt.figure(figsize=(10,10)) plt.xticks(fontsize=12) plt.yticks(fontsize=14) plt.xlabel('Principal Component - 1',fontsize=20) plt.ylabel('Principal Component - 2',fontsize=20) plt.title("Principal Component Analysis MNIST Dataset",fontsize=20) targets = [1,2,3,4,5,6,7,8,9] colors = ['r', 'g','b','y','v','o','bl','c','tab:olive'] for target, color in zip(targets,colors): indicesToKeep = dataframe['label'] == target plt.scatter(dataframe.loc[indicesToKeep, '1st_principal'] , dataframe.loc[indicesToKeep, '2nd_principal'], s = 50) plt.legend(targets,prop={'size': 15})
All the code are avaliable in my github account
https://github.com/VpkPrasanna/PCA_scratch/blob/master/PCA.ipynb
References
1) http://www.oranlooney.com/post/ml-from-scratch-part-6-pca/
2)https://machinelearningmastery.com/calculate-principal-component-analysis-scratch-python/