PCA from scratch for MNIST dataset

What is PCA?? why we have to use it ? when we have to use it ?

Every one who works in machine learning will heard about this term called PCA , what does it do exactly lets see it one by one .

What is PCA?

PCA stands for Principal Component Analysis(PCA) and it is a linear dimensionality reduction technique. Many non-linear dimensionality reduction techniques exist, but linear methods are more mature

Why and when we have to use it ?

if a data set contain more features like 50 or 6o or even 100 features we can use PCA to understand the data like finding which features are more important for model building and without loosing the main information about the data.

How to implement PCA from scratch for MNIST data set

Steps to implement PCA

Step 1: Standardize the dataset.

Step 2: Calculate the covariance matrix for the features in the dataset.

Step 3: Calculate the eigenvalues and eigenvectors for the covariance matrix.

Step 4: Sort eigenvalues and their corresponding eigenvectors.

Step 5: Pick k eigenvalues and form a matrix of eigenvectors.

Step 6: Transform the original matrix.

Lets see how can we implement PCA from scratch using python

# Necessary Dependecies

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from numpy.linalg import eig
from scipy.linalg import eigh
import matplotlib.pyplot as plt
import seaborn as sn


# reading the training data
data = pd.read_csv('train.csv')

#creating the Target column or Y column

target = data['label']

# Drop the target column and remaining will be the x column

x = data.drop(['label'],axis=1)

print(x.shape)
print(target.shape)

# display or plot a random number.
plt.figure(figsize=(7,7))
idx = 100
grid_data = x.iloc[idx].as_matrix().reshape(28,28)  # reshape from 1d to 2d pixel array
plt.imshow(grid_data, interpolation = "none", cmap = "gray")
plt.show()

print(target[idx])

IMPORTANT STEP

Before applying pca the data has to be in standardised , we can use sklearn library to do this or we can create our own class

#pre - Processing class just like a library
class StandardScaler(object):
    def __init__(self):
        pass

    def fit(self, X):
        self.mean_ = np.mean(X, axis=0)
        self.scale_ = np.std(X - self.mean_, axis=0)
        return self

    def transform(self, X):
        return (X - self.mean_) / self.scale_

    def fit_transform(self, X):
        return self.fit(X).transform(X)

# Calling the class on the data
standardized_data = StandardScaler().fit_transform(data)
# Removing the nan values 
standardized_data = np.nan_to_num(standardized_data)

sample_data = standardized_data

print(sample_data.T.shape)

# finding the Co - Variance matrix 
covar_matrix = np.cov(standardized_data.T)

covar_matrix.shape


#finding eighen values and eighen vectores form co variance Matrix
values,vectors = eigh(covar_matrix, eigvals=(782,783))

vectors.shape

values.shape

len(vectors)

vectors = vectors.T
print("Updated shape of eigen vectors = ",vectors.shape)
# here the vectors[1] represent the eigen vector corresponding 1st principal eigen vector
# here the vectors[0] represent the eigen vector corresponding 2nd principal eigen vector


# projecting the original data sample on the plane 
#formed by two principal eigen vectors by vector-vector multiplication.

new_coordinates = np.matmul(vectors, sample_data.T)

print (" resultanat new data points' shape ", vectors.shape, "X", sample_data.T.shape," = ", new_coordinates.shape)

# appending label to the 2d projected data(vertical stack)
new_coordinates = np.vstack((new_coordinates, labels)).T
new_coordinates


# creating a new data frame for ploting the labeled points.
dataframe = pd.DataFrame(data=new_coordinates, columns=("1st_principal", "2nd_principal", "label"))

visuvalization of PCA components using Matplotlib¶

plt.figure()
plt.figure(figsize=(10,10))
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Principal Component - 1',fontsize=20)
plt.ylabel('Principal Component - 2',fontsize=20)
plt.title("Principal Component Analysis MNIST Dataset",fontsize=20)
targets = [1,2,3,4,5,6,7,8,9]
colors = ['r', 'g','b','y','v','o','bl','c','tab:olive']
for target, color in zip(targets,colors):
    indicesToKeep = dataframe['label'] == target
    plt.scatter(dataframe.loc[indicesToKeep, '1st_principal']
               , dataframe.loc[indicesToKeep, '2nd_principal'], s = 50)

plt.legend(targets,prop={'size': 15})

All the code are avaliable in my github account

https://github.com/VpkPrasanna/PCA_scratch/blob/master/PCA.ipynb

References

1) http://www.oranlooney.com/post/ml-from-scratch-part-6-pca/

2)https://machinelearningmastery.com/calculate-principal-component-analysis-scratch-python/

PCA from scratch for MNIST dataset

Post a Comment

Contact form