Principal Component Analysis

Principal Component Analysis (PCA) is a dimensionality reduction technique that can be used to reduce dimensionality of a dataset while retaining most of the information in the dataset. This concept might be daunting at first, but it is actually quite simple. Let’s start with a very simple example.

Student scores

To understand PCA it’s really a lot more fun if we use real data. We’ll use a dataset of student scores on various tests that’s available from Kaggle. For now let’s focused only on three features: math score, reading score, and writing score.

import pandas as pd

# Replace 'your_data_file.csv' with the actual path to your CSV file
file_path = 'https://storage.googleapis.com/rg-ai-bootcamp/machine-learning/StudentsPerformance.csv'

# Load the CSV data into a pandas DataFrame
data = pd.read_csv(file_path)

numerical_data = data[['math score', 'reading score', 'writing score']]
numerical_data

Let’s plot the data to see what it looks like.

import pandas as pd
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Create a subplot with axis labels
fig = make_subplots(rows=1, cols=3, subplot_titles=( "Reading vs Writing", "Math vs Reading", "Math vs Writing"))

# Reading vs Writing
fig.add_trace(go.Scatter(x=data['reading score'], y=data['writing score'], mode='markers', name='Reading vs Writing',
                         hovertemplate='Reading Score: %{x:.2f}<br>Writing Score: %{y:.2f}<extra></extra>'), row=1, col=1)
fig.update_xaxes(title_text="Reading Score", row=1, col=1)
fig.update_yaxes(title_text="Writing Score", row=1, col=1)

# Add scatter plots with axis labels
# Math vs Reading
fig.add_trace(go.Scatter(x=data['math score'], y=data['reading score'], mode='markers', name='Math vs Reading',
                         hovertemplate='Math Score: %{x:.2f}<br>Reading Score: %{y:.2f}<extra></extra>'), row=1, col=2)
fig.update_xaxes(title_text="Math Score", row=1, col=2)
fig.update_yaxes(title_text="Reading Score", row=1, col=2)

# Math vs Writing
fig.add_trace(go.Scatter(x=data['math score'], y=data['writing score'], mode='markers', name='Math vs Writing',
                         hovertemplate='Math Score: %{x:.2f}<br>Writing Score: %{y:.2f}<extra></extra>'), row=1, col=3)
fig.update_xaxes(title_text="Math Score", row=1, col=3)
fig.update_yaxes(title_text="Writing Score", row=1, col=3)

# Update layout
fig.update_layout(height=500, width=1500, title_text="Students' Performance Comparisons")
fig.show()

If we see above 1000 student scores on a plot, we can see some patterns:

Students who perform well on reading tend to perform well on writing as well
Students who perform well on reading or writing doesn’t mean they perform well on math, and vice versa
Students who do poor on any of the tests tend to do poor on all of the tests, and vice versa

As you can see the skills of reading and writing are more aligned than the skill of either of those with math. This “alignment” is something called covariance. Covariance is a measure of how two variables change together. Let’s check below calculation.

cov_reading_writing = data['reading score'].cov(data['writing score'])
print("Covariance between Reading Score and Writing Score:", cov_reading_writing)

cov_math_writing = data['math score'].cov(data['writing score'])
print("Covariance between Math Score and Writing Score:", cov_math_writing)

cov_math_reading = data['math score'].cov(data['reading score'])
print("Covariance between Math Score and Reading Score:", cov_math_reading)

Covariance between Reading Score and Writing Score: 211.78666066066071
Covariance between Math Score and Writing Score: 184.93913313313314
Covariance between Math Score and Reading Score: 180.99895795795805

As you can see above the math is validating our observation: reading and writing covariance score is higher than reading and math or writing and math. This means if a student either does well or poorly on reading, they tend to do the same on writing, so if we say have a class to improve student’s reading skills, we can expect their writing skills to improve as well.

But because the covariance is still positive for math vs either reading or writing, can we say that if we improve student’s reading or writing skills, their math skills will improve as well? Maybe, intuitively if a student is get a better score from reading or writing, they will become a better learner, but this conclusion is a long shot compared to the conclusion we can make about reading and writing where it’s more obvious that if we improve one, the other will tend to improve as well.

Let’s do some PCA

So before we already introduced our dataset and a little bit of covariance that will come in handy to understand PCA. Now let’s do some PCA to our data. PCA is a technique that can be used to reduce the dimensionality of a dataset while trying to retain as much information as possible. So let’s try to reduce our 3 features dataset into 2 features dataset.

import plotly.express as px
from sklearn.decomposition import PCA

# Performing PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(numerical_data)

# Creating a DataFrame for the PCA results
pca_df = pd.DataFrame(data=principal_components, columns=['Principal Component 1', 'Principal Component 2'])

# Adding original scores as hover_data
pca_df['Math Score'] = data['math score']
pca_df['Reading Score'] = data['reading score']
pca_df['Writing Score'] = data['writing score']

# Create a scatter plot using Plotly
fig = px.scatter(
    pca_df, 
    x='Principal Component 1', 
    y='Principal Component 2', 
    hover_data=['Math Score', 'Reading Score', 'Writing Score'], 
    title='PCA of Student Performance',
    labels={'Principal Component 1': 'PC1', 'Principal Component 2': 'PC2'}
)

# Show the plot
fig.show()

Above is 2 dimensional plot of our data, reduced from 3 dimensions. The process of the data reduction is done by focusing on keeping the relation between features as much as possible. How so? Look at below plot

import plotly.express as px
import plotly.subplots as sp


# Adding original scores as hover_data
pca_df['Math Score'] = data['math score']
pca_df['Reading Score'] = data['reading score']
pca_df['Writing Score'] = data['writing score']

# Create scatter plots using Plotly
fig1 = px.scatter(
    pca_df, 
    x='Principal Component 1', 
    y='Principal Component 2', 
    hover_data=['Math Score', 'Reading Score', 'Writing Score'], 
    title='PCA of Student Performance (Math Score)',
    labels={'Principal Component 1': 'PC1', 'Principal Component 2': 'PC2'},
    color='Math Score',
    color_continuous_scale='rainbow'
)

fig2 = px.scatter(
    pca_df, 
    x='Principal Component 1', 
    y='Principal Component 2', 
    hover_data=['Math Score', 'Reading Score', 'Writing Score'], 
    title='PCA of Student Performance (Writing Score)',
    labels={'Principal Component 1': 'PC1', 'Principal Component 2': 'PC2'},
    color='Writing Score',
    color_continuous_scale='rainbow'
)

fig3 = px.scatter(
    pca_df, 
    x='Principal Component 1', 
    y='Principal Component 2', 
    hover_data=['Math Score', 'Reading Score', 'Writing Score'], 
    title='PCA of Student Performance (Reading Score)',
    labels={'Principal Component 1': 'PC1', 'Principal Component 2': 'PC2'},
    color='Reading Score',
    color_continuous_scale='rainbow'
)

# Create subplots horizontally
fig = sp.make_subplots(rows=1, cols=3, shared_xaxes=False, shared_yaxes=False, horizontal_spacing=0.1)

# Add traces to the subplots
fig.add_trace(fig1['data'][0], row=1, col=1)
fig.add_trace(fig2['data'][0], row=1, col=2)
fig.add_trace(fig3['data'][0], row=1, col=3)

# Add labels at the top of each plot using annotations
fig.add_annotation(
    text='Math Score',
    xref='paper', yref='paper',
    x=0.07, y=1.15,
    showarrow=False,
    font=dict(size=14)
)

fig.add_annotation(
    text='Writing Score',
    xref='paper', yref='paper',
    x=0.5, y=1.15,
    showarrow=False,
    font=dict(size=14)
)

fig.add_annotation(
    text='Reading Score',
    xref='paper', yref='paper',
    x=0.9, y=1.15,
    showarrow=False,
    font=dict(size=14)
)

# Update layout for the overall figure
fig.update_layout(
    title='PCA of Student Performance',
    xaxis=dict(title='Principal Component 1 (PC1)'),
    yaxis=dict(title='Principal Component 2 (PC2)'),
    showlegend=False,
)

# Show the horizontal subplot
fig.show()

Above are three identical plots but color coded by our three main features. As you can see for the writing and reading score, the gradient of the color is nearly the same. Why? Because as we’ve already learned their covariance is high.

It’s different from math when compared to other features, because even if generally when writing or reading score get bigger the math score also get bigger, but because the covariance is low, we see that the gradient still from the left to right, but a little bit rotated clockwise.

One of the neat thing about PCA is that data that is related to each other will be clustered together. You can try to check above plot that:

Students that have all scores high are clustered together
Students that have all scores low are clustered together
Students that have high reading and writing scores but low math scores are clustered together
Students that have high math scores but low reading and writing scores are clustered together
And so on

You can try to check it yourself! But hang on, what if we have a machine learning model that can help us cluster unsupervised data automatically? Hmm, I wonder if there’s any machine learning model that can do that? 🤔

Note: One term that you need to know is every features after PCA dimensionality reduction is called principal component. So in our case we have 2 features after reducing the dimensionality of our dataset, so we have 2 principal components.

from sklearn.cluster import KMeans
import plotly.express as px

# Perform k-means clustering on the PCA-transformed data
kmeans = KMeans(n_clusters=22, random_state=0)
pca_df['Cluster'] = kmeans.fit_predict(principal_components)

# Create a scatter plot for the k-means clustering results
fig4 = px.scatter(
    pca_df, 
    x='Principal Component 1', 
    y='Principal Component 2', 
    color='Cluster',
    title=f'K-Means Clustering on PCA Components',
    labels={'Principal Component 1': 'PC1', 'Principal Component 2': 'PC2'},
    color_continuous_scale='rainbow',
    hover_data=['Math Score', 'Reading Score', 'Writing Score'], 
)

# Show the k-means clustering plot
fig4.show()

/home/imam/miniconda3/envs/my-manim-environment/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

K-means of course! We can see above and formalize our learning about PCA will cluster the data that is related to each other together. There are so many action items that we can take from this that we might want to share to our stakeholders:

If any students fall to the cluster number 5, 8, 17, 19, they might need major help to fix their scores (because as we can see most data that fall into those clusters are students with low scores)
Some students on 6 and 15 got good grades on math but not on reading and writing, maybe we can help them to improve their reading and writing skills
Some students on 4, 11, and 14 got good grades on reading and writing but not on math, maybe we can help them to improve their math skills
Students on cluster 20 are excellent students, maybe we can give them some special treatment

So as you can see PCA is a really amazing tools that can help us to visualize especially on unsupervised data, and even we can combine it’s power with other machine learning models to help us farther.

So what’s the point behind PCA?

PCA can help us visualize hundreds of dimensions into 2 or 3 dimensions, above data use only 3 dimensions for simplicity
Similar data will be clustered together, imagine if we’re on a finance company, PCA might help us reduce hundreds of features from our customer data into 2 or 3 features, and then we might see some pattern for our customers that are likely to pay their debt, and customers that are likely to not pay their debt, some fraud customers, and so on, because we might see they cluster together
We can use PCA as intermediate step to then feed the data into other machine learning models, such as above we can see that we can use PCA to reduce the dimensionality of our data, and then we can feed the data into K-means to cluster the data, and then we can say that some clusters are students that need help only on math, some clusters are students that need help only on reading and writing, and so on
Other use case is we can use PCA to compress our data
And so on

Inference

So given we have a new student scores, for example: math score 45, reading score 80, and writing score 90, how can we manually predict the cluster number? Let’s breakdown

Standardization

Scikit by default will standardize the data for us, it’s basically subtracting the mean from each data point (some standardization method also divide the data point by the standard deviation, but scikit doesn’t do that by default). For the reason why this standardization is used, you can check the supplementary material. So let’s check every mean of each feature:

import numpy as np

mean_math = np.mean(numerical_data['math score'])
mean_reading = np.mean(numerical_data['reading score'])
mean_writing = np.mean(numerical_data['writing score'])

print("Mean Math Score:", mean_math)
print("Mean Reading Score:", mean_reading)
print("Mean Writing Score:", mean_writing)

Mean Math Score: 66.089
Mean Reading Score: 69.169
Mean Writing Score: 68.054

Now let’s subtract the mean from each data point:

math_score = 45
reading_score = 80
writing_score = 90

math_score_standardized = math_score - mean_math 
reading_score_standardized = reading_score - mean_reading
writing_score_standardized = writing_score - mean_writing

print("Standardized Math Score:", math_score_standardized)
print("Standardized Reading Score:", reading_score_standardized)
print("Standardized Writing Score:", writing_score_standardized)

Standardized Math Score: -21.089
Standardized Reading Score: 10.831000000000003
Standardized Writing Score: 21.945999999999998

Plotting our data

Now let’s plot our data, to plot our data we need to know how every scores impact each principal component at their calculation. This impact is called loading score or weight. Let’s check the loading score of our data:

# Access the PCA components (weights) for each column
pca_weights = pca.components_

# Create a DataFrame to display the PCA weights
pca_weights_df = pd.DataFrame(pca_weights, columns=numerical_data.columns, index=['PC1', 'PC2'])

# Display the PCA weights
print("PCA Weights for Each Column:")
print(pca_weights_df)

PCA Weights for Each Column:
     math score  reading score  writing score
PC1   -0.562649      -0.573977      -0.594959
PC2    0.825612      -0.353292      -0.439943

The formula is simple, we just need to multiply the loading score with the standardized data point. For example for the \(PC1\) our standardized data point is:

\[ \text{Standardized Math Score}: -21.089 \\ \text{Standardized Reading Score}: 10.831000000000003 \\ \text{Standardized Writing Score}: 21.945999999999998 \\ \]

With given weights:

\[ \text{Math loading score}: -0.562649\\ \text{Reading loading score}: -0.573977\\ \text{Writing loading score}: -0.594959 \]

Then we just need to multiply them:

\[ PC1 = -21.089 \times -0.562649 + 10.831000000000003 \times -0.573977 + 21.945999999999998 \times -0.594959 = -7.40801316088976 \]

This would be the value of all of our scores on the plot.

pc1_calculation = pca_weights_df.loc["PC1", "math score"] * math_score_standardized + pca_weights_df.loc["PC1", "reading score"] * reading_score_standardized + pca_weights_df.loc["PC1", "writing score"] * writing_score_standardized
print("PC1 Calculation:", pc1_calculation)

pc2_calculation = pca_weights_df.loc["PC2", "math score"] * math_score_standardized + pca_weights_df.loc["PC2", "reading score"] * reading_score_standardized + pca_weights_df.loc["PC2", "writing score"] * writing_score_standardized
print("PC2 Calculation:", pc2_calculation)

PC1 Calculation: -7.40801316088976
PC2 Calculation: -30.892823442881664

So after we calculate our points, we can plot it on our PCA plot on (-7.40801316088976, -30.892823442881664):

from sklearn.cluster import KMeans
import plotly.express as px

# Perform k-means clustering on the PCA-transformed data
kmeans = KMeans(n_clusters=22, random_state=0)
pca_df['Cluster'] = kmeans.fit_predict(principal_components)
data = {'Principal Component 1': [-7.40801316088976],
        'Principal Component 2': [-30.892823442881664],
        'Math Score': [45],
        'Reading Score': [80],
        'Writing Score': [90],}

pca_df_new = pd.concat([pca_df, pd.DataFrame(data)], ignore_index=True)

# Create a scatter plot for the k-means clustering results
fig4 = px.scatter(
    pca_df_new, 
    x='Principal Component 1', 
    y='Principal Component 2', 
    color='Cluster',
    title=f'K-Means Clustering on PCA Components',
    labels={'Principal Component 1': 'PC1', 'Principal Component 2': 'PC2'},
    color_continuous_scale='rainbow',
    hover_data=['Math Score', 'Reading Score', 'Writing Score'], 
)

# Show the k-means clustering plot
fig4.show()

/home/imam/miniconda3/envs/my-manim-environment/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

As you can see above we have a new single data point plotted that hasn’t been clustered yet. So let’s cluster it!

Finding the cluster

To find the cluster we need to get all centroids first.

def centroids_to_dict(centroids):
    centroid_dict = {}
    for i, centroid in enumerate(centroids):
        centroid_key = f"Centroid {i+1}"
        centroid_dict[centroid_key] = centroid.tolist()
    return centroid_dict


centroid_dict = centroids_to_dict(kmeans.cluster_centers_)
print(centroid_dict)

{'Centroid 1': [-15.943224630791715, -0.9070053183920267], 'Centroid 2': [28.941650614567102, -5.73892623045976], 'Centroid 3': [14.548856375164332, 8.090328619793798], 'Centroid 4': [-36.90728358893218, 8.276377195361517], 'Centroid 5': [-8.037316954950125, -6.620486287888707], 'Centroid 6': [45.074468370806926, 5.384960771273996], 'Centroid 7': [0.5709137281666932, 10.003752461381284], 'Centroid 8': [6.858696711240797, 0.45803211514065306], 'Centroid 9': [82.89587081166918, -0.2214084346961526], 'Centroid 10': [-24.54073462940947, 5.511350838396982], 'Centroid 11': [-38.79923755663632, -3.6061203645457183], 'Centroid 12': [1.9393537216705061, -8.16380734697627], 'Centroid 13': [-4.925904319438298, 2.2747275647729213], 'Centroid 14': [32.650214032600545, 8.316239692597533], 'Centroid 15': [-17.45452344254033, -10.67078097671195], 'Centroid 16': [-15.058151807435763, 10.208996532914504], 'Centroid 17': [-28.630403914997597, -4.858285616417209], 'Centroid 18': [63.318695206200665, -4.273284559916566], 'Centroid 19': [23.348857451348792, 3.3812899585860605], 'Centroid 20': [42.81183801107755, -6.075533049246773], 'Centroid 21': [-51.87254205402511, -0.04876276936114911], 'Centroid 22': [15.940113122611777, -7.8049425291843875]}

Then we need to calculate the distance between each centroid with our new data point and choose the smallest distance using euclidean distance:

import numpy as np

def find_nearest_centroid(x, y, centroids):
    # Create a point as a NumPy array
    point = np.array([x, y])
    
    # Calculate the Euclidean distance between the point and all centroids
    distances = np.linalg.norm(np.array(list(centroids.values())) - point, axis=1)
    
    # Find the index of the nearest centroid
    nearest_centroid_index = np.argmin(distances)
    
    return f"cluster {nearest_centroid_index}"

nearest_centroid = find_nearest_centroid(pc1_calculation, pc2_calculation, centroid_dict)
print(f"The cluster for ({pc1_calculation}, {pc2_calculation}) is {nearest_centroid}")

The cluster for (-7.40801316088976, -30.892823442881664) is cluster 14

Now we know that our data point is closest to cluster 14.

from sklearn.cluster import KMeans
import plotly.express as px

# Perform k-means clustering on the PCA-transformed data
kmeans = KMeans(n_clusters=22, random_state=0)
pca_df['Cluster'] = kmeans.fit_predict(principal_components)
data = {'Principal Component 1': [-8.488375],
        'Principal Component 2': [-30.892823442881664],
        'Math Score': [45],
        'Reading Score': [80],
        'Writing Score': [90],
        'Cluster': [14]
        }

pca_df_new = pd.concat([pca_df, pd.DataFrame(data)], ignore_index=True)

# Create a scatter plot for the k-means clustering results
fig4 = px.scatter(
    pca_df_new, 
    x='Principal Component 1', 
    y='Principal Component 2', 
    color='Cluster',
    title=f'K-Means Clustering on PCA Components',
    labels={'Principal Component 1': 'PC1', 'Principal Component 2': 'PC2'},
    color_continuous_scale='rainbow',
    hover_data=['Math Score', 'Reading Score', 'Writing Score'], 
)

# Show the k-means clustering plot
fig4.show()

/home/imam/miniconda3/envs/my-manim-environment/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

Now we know that a student with: math score 45, reading score 80, and writing score 90, is clustered on cluster 14. Basically if we check our previous notes:

If any students fall to the cluster number 5, 8, 17, 19, they might need major help to fix their scores (because as we can see most data that fall into those clusters are students with low scores)
Some students on 6 and 15 got good grades on math but not on reading and writing, maybe we can help them to improve their reading and writing skills
Some students on 4, 11, and 14 got good grades on reading and writing but not on math, maybe we can help them to improve their math skills
Students on cluster 20 are excellent students, maybe we can give them some special treatment

We can see that this student, because it’s clustered on cluster 14, got really good scores on reading and writing, but not on math, so we can help them to improve their math skills.