PCA from scratch for MNIST dataset

What is PCA?? why we have to use it ? when we have to use it ?
            Every one who works in machine learning will heard about this term called PCA , what does it do exactly lets see it one by one .


What is PCA?
    PCA stands for Principal Component Analysis(PCA) and it is a  linear dimensionality reduction technique. Many non-linear dimensionality reduction techniques exist, but linear methods are more mature


Why and when we have to use it ?
   if a data set contain more features like 50 or 6o or even 100 features we can use PCA to understand the data like finding which features are more important for model building and  without loosing the main information about the data.

How to implement PCA from scratch for MNIST data set
    Steps to implement PCA

Step 1: Standardize the dataset.

Step 2: Calculate the covariance matrix for the features in the dataset.

Step 3: Calculate the eigenvalues and eigenvectors for the covariance matrix.

Step 4: Sort eigenvalues and their corresponding eigenvectors.

Step 5: Pick k eigenvalues and form a matrix of eigenvectors.

Step 6: Transform the original matrix.



Lets see how can we implement PCA from scratch using python

# Necessary Dependecies

import
numpy as np import pandas as pd import matplotlib.pyplot as plt from numpy.linalg import eig from scipy.linalg import eigh import matplotlib.pyplot as plt import seaborn as sn


# reading the training data data = pd.read_csv('train.csv')
#creating the Target column or Y column
target = data['label']
# Drop the target column and remaining will be the x column
x = data.drop(['label'],axis=1)

print(x.shape) print(target.shape)

# display or plot a random number. plt.figure(figsize=(7,7)) idx = 100 grid_data = x.iloc[idx].as_matrix().reshape(28,28) # reshape from 1d to 2d pixel array plt.imshow(grid_data, interpolation = "none", cmap = "gray") plt.show() print(target[idx])


IMPORTANT STEP
Before applying pca the data has to be in standardised , we can use sklearn library to do this or we can create our own class
#pre - Processing class just like a library
class StandardScaler(object):
    def __init__(self):
        pass

    def fit(self, X):
        self.mean_ = np.mean(X, axis=0)
        self.scale_ = np.std(X - self.mean_, axis=0)
        return self

    def transform(self, X):
        return (X - self.mean_) / self.scale_

    def fit_transform(self, X):
        return self.fit(X).transform(X)

# Calling the class on the data
standardized_data = StandardScaler().fit_transform(data)
# Removing the nan values
standardized_data = np.nan_to_num(standardized_data)

sample_data = standardized_data

print(sample_data.T.shape)
# finding the Co - Variance matrix 
covar_matrix = np.cov(standardized_data.T)
covar_matrix.shape

#finding eighen values and eighen vectores form co variance Matrix values,vectors = eigh(covar_matrix, eigvals=(782,783))

vectors.shape

values.shape

len(vectors)

vectors = vectors.T print("Updated shape of eigen vectors = ",vectors.shape) # here the vectors[1] represent the eigen vector corresponding 1st principal eigen vector # here the vectors[0] represent the eigen vector corresponding 2nd principal eigen vector

# projecting the original data sample on the plane #formed by two principal eigen vectors by vector-vector multiplication. new_coordinates = np.matmul(vectors, sample_data.T)

print (" resultanat new data points' shape ", vectors.shape, "X", sample_data.T.shape," = ", new_coordinates.shape)
# appending label to the 2d projected data(vertical stack)
new_coordinates = np.vstack((new_coordinates, labels)).T
new_coordinates


# creating a new data frame for ploting the labeled points. dataframe = pd.DataFrame(data=new_coordinates, columns=("1st_principal", "2nd_principal", "label"))
visuvalization of PCA components using Matplotlib
plt.figure()
plt.figure(figsize=(10,10))
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Principal Component - 1',fontsize=20)
plt.ylabel('Principal Component - 2',fontsize=20)
plt.title("Principal Component Analysis MNIST Dataset",fontsize=20)
targets = [1,2,3,4,5,6,7,8,9]
colors = ['r', 'g','b','y','v','o','bl','c','tab:olive']
for target, color in zip(targets,colors):
    indicesToKeep = dataframe['label'] == target
    plt.scatter(dataframe.loc[indicesToKeep, '1st_principal']
               , dataframe.loc[indicesToKeep, '2nd_principal'], s = 50)

plt.legend(targets,prop={'size': 15})


All the code are avaliable in my github account
https://github.com/VpkPrasanna/PCA_scratch/blob/master/PCA.ipynb

References
1) http://www.oranlooney.com/post/ml-from-scratch-part-6-pca/
2)https://machinelearningmastery.com/calculate-principal-component-analysis-scratch-python/


Post a Comment

Previous Post Next Post