Covariance formula and variance

Before we’ve learned about variance which has below formula:

\[ \frac{\sum{(x - \text{mean of } x)^2}}{N} \]

Given \(X\) is a variable, \(\mu\) is the mean of \(X\), and \(N\) is the number of data points.

Which basically means:

Calculate the difference between each data point from the mean so we know how far each data point is from the mean
Square the difference
Sum all the squared differences
Divide the sum by the number of data points

One point that you need to know about variance it’s basically another name of covariance, but it’s just covariance of a variable with itself. Where does the “with itself” come from? The square of the difference:

\[ (x - \text{mean of } x)^2 \]

How so? This is the formula of covariance: \[ Covariance(x,y) = \frac{\sum_{i=1}^{n}{(x_i - \text{mean of } x)(y_i - \text{mean of } y)}}{N} \]

So for \(Covariance(x,x)\) it’s the same as saying \(Variace(x)\).

Now let’s expand a bit more on covariance. If we have 5 data points of \(x\) and \(y\): (3, 9), (4, 7), (5, 10), (8, 12), (10, 7), we need to find the mean of \(x\) and \(y\) first which is 6 and 9 respectively. Then we can calculate the covariance of \(x\) and \(y\):

\[ (3 - 6) \times (9 - 9) = -3 \times 0 = 0 \\ (4 - 6) \times (7 - 9) = -2 \times -2 = 4 \\ (5 - 6) \times (10 - 9) = -1 \times 1 = -1 \\ (8 - 6) \times (12 - 9) = 2 \times 3 = 6 \\ (10 - 6) \times (7 - 9) = 4 \times -2 = -8 \\ \] \[ 0 + 4 - 1 + 6 - 8 = 1 \] \[ \frac{1}{\text{number of data points}} = \frac{1}{5} = 0.2 \]

So basically we find the difference between each \(x\) and \(y\) data points with their respective means to know the variation of each data point from the every other data points, then multiply the differences of each data point to know how they relate to each other, then sum all the multiplication results, and divide the sum by the number of data points.

Covariance basic intuition

So there are three basic scenarios for covariance: - Positive covariance: if one variable increases, the other tends to increase as well - Negative covariance: if one variable increases, the other tends to decrease - Zero covariance: if one variable increases, the other doesn’t tend to increase or decrease

We’ll use it later, for now let’s just keep it in mind.

Covariance matrix

For covariance matrix it’s as easy as we’re making a matrix of covariance of each feature with each other feature. So if we have 3 features, we’ll have a 3x3 matrix. For our given example we have 3 features: math, reading, and writing. So it will be like this:

\[ \begin{bmatrix} Covariance(\text{math, math}) & Covariance(\text{math, reading}) & Covariance(\text{math, writing}) \\ Covariance(\text{reading, math}) & Covariance(\text{reading, reading}) & Covariance(\text{reading, writing}) \\ Covariance(\text{writing, math}) & Covariance(\text{writing, reading}) & Covariance(\text{writing, writing}) \end{bmatrix} \]

Again for covariance of a variable with itself, it’s called variance. And covariance of \(Covariance(\text{math, reading})\) is the same as \(Covariance(\text{reading, math})\) because intuitively if math and reading are related, then reading and math are related as well.

Quick run through of PCA using basic numpy

We can easily create our own PCA using numpy because it’s mostly just statistics and some linear algebra. Let’s try it

import pandas as pd

# Replace 'your_data_file.csv' with the actual path to your CSV file
file_path = 'https://storage.googleapis.com/rg-ai-bootcamp/machine-learning/StudentsPerformance.csv'

# Load the CSV data into a pandas DataFrame
data = pd.read_csv(file_path)

numerical_data = data[['math score', 'reading score', 'writing score']]
numerical_data

	math score	reading score	writing score
0	72	72	74
1	69	90	88
2	90	95	93
3	47	57	44
4	76	78	75
...	...	...	...
995	88	99	95
996	62	55	55
997	59	71	65
998	68	78	77
999	77	86	86

1000 rows × 3 columns

First, we standardize the data by subtracting the mean from each data point (Standardizing data prior to PCA aids in capturing the underlying structure of the data by focusing on the relative variances of variables, making the analysis more meaningful) then we find the covariance matrix:

import numpy as np
import pandas as pd

# Your data
X = data[['math score', 'reading score', 'writing score']].to_numpy()

# Calculate the mean
mean = np.mean(X, axis=0)

# Center the data
centered_X = X - mean

# Compute the covariance matrix
covariance_matrix = np.cov(centered_X, rowvar=False)

# Create a DataFrame with labels
column_labels = ['math score', 'reading score', 'writing score']
covariance_df = pd.DataFrame(covariance_matrix, columns=column_labels, index=column_labels)

# Print the DataFrame
print(covariance_df)

               math score  reading score  writing score
math score     229.918998     180.998958     184.939133
reading score  180.998958     213.165605     211.786661
writing score  184.939133     211.786661     230.907992

Then we create a transformation matrix from given covariance matrix. The math involved some intermediate steps about eigenvalues and eigenvectors that outside of our scope, but the main intuition is basically we’re trying to find lines we project our data points to that the spread of the projected data points is as big as possible (so we don’t lose much information after the projection). This projection lines dictated by using our covariance matrix.

This concept is a little bit hard to understand, but hopefully below illustration can help you to understand it better:

Illustration from https://numxl.com/blogs/principal-component-analysis-pca-101/

So as you can see above we have two different lines that we can project our data points to from two different angles so that the spread of the projected data points is as big as possible. The process of finding the best line fit can be seen below where the best line fit is when the projection line match with purple lines that you can see on the left and right plot:

Illustration from https://medium.com/@ashwin8april/dimensionality-reduction-and-visualization-using-pca-principal-component-analysis-8489b46c2ae0

It’s kind of finding the linear regression line, but after we found from one angle, we rotate it to find the best line fit from another angle. So basically we’re trying to find the best line fit from all angles.

Now let’s find that best line fit, below \(k\) is the number of principal components that we want to have after the dimensionality reduction. So if we want to reduce our data from 3 dimensions to 2 dimensions, \(k\) will be 2.

k= 2
# Step 3: Eigen decomposition
eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)

# Step 4: Sorting eigenvalues and eigenvectors
sorted_indices = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[sorted_indices]
eigenvectors = eigenvectors[:, sorted_indices]

# Step 5: Selecting the top k eigenvectors
top_k_eigenvectors = eigenvectors[:, :k]

top_k_eigenvectors

array([[ 0.56264911,  0.82561176],
       [ 0.57397682, -0.35329218],
       [ 0.59495932, -0.43994302]])

Above we’re having several steps of PCA:

Finding the best line fit from all angles
Choosing only two lines of all the best line fit that we’ve found

And we’ve got \(3 \times 2\) matrix that each of the column is the “best line fit” that we’ve found

Now to project our data to this vector, we just need to multiply our data with our transformation matrix

# Step 6: Projecting the data
reduced_data = np.matmul(centered_X, top_k_eigenvectors)

reduced_data

array([[  8.48837536,   1.26411978],
       [ 25.46144129, -13.73117695],
       [ 43.12175323,  -0.35950596],
       ...,
       [ -4.75467372,  -5.15605377],
       [ 11.46651782,  -5.47790938],
       [ 26.47680822,  -4.83322812]])

df = pd.DataFrame(reduced_data, columns=['X', 'Y'])

# Create a scatter plot using Plotly
fig = px.scatter(df, x='X', y='Y', title='Scatter Plot')
fig.show()

Above scatter plot seems mirrored than using the scikit-learn’s PCA, but the main idea is basically the same, it’s fun isn’t it? 😁