Introduction to Machine Learning

Supervised vs Unsupervised vs Semi-supervised

Supervised

Target/label is given, e.g.

Height	Weight (Label)
180	80
160	60
170	70
190	90

Task: Given height (feature/attribute), predict the weight

Example: Linear regression, logistic regression, decision tree, random forest, support vector machine, neural network

import matplotlib.pyplot as plt
import numpy as np
from math import sqrt

weight = np.random.randint(40, 100, 30)
height = np.sqrt(weight / 20) * 100 + np.random.randint(-20, 20, 30)

# plot
plt.scatter(weight, height)
plt.xlabel('weight')
plt.ylabel('height')
plt.title('weight vs height')
plt.show()

Problem: How to predict the weight of a person with height 175cm?

import matplotlib.pyplot as plt
import numpy as np
from math import sqrt

weight = np.random.randint(40, 100, 30)
height = np.sqrt(weight / 20) * 100 + np.random.randint(-20, 20, 30)

# implement linear regression using scikit to predict weight from height
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(height.reshape(-1, 1), weight)
weight_pred = model.predict(height.reshape(-1, 1))


# plot
plt.scatter(weight, height)
plt.xlabel('weight')
plt.ylabel('height')
plt.title('weight vs height')

# plot the regression line
plt.plot(weight_pred, height, color='red')
plt.show()

Unsupervised

Label is not given, the task is to find the pattern in the data, e.g. anomaly detection, clustering

# Create a sample of clustering data

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Generate sample data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Plot the data
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.show()

Without any label or target, predict which data belongs to which cluster

# Create a sample of clustering data

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Generate sample data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Implement K-NN clustering
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_pred = kmeans.predict(X)

# Plot the clustering result
plt.scatter(X[:, 0], X[:, 1], c=y_pred, s=50, cmap='viridis')
plt.show()

/Users/ruangguru/.pyenv/versions/3.11.1/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(

Semi Supervised

Combination of unsupervised and supervised learning

Example: Google Photos

Unsupervised: Grouping photos based on the face
Supervised: Ask the user to label the face with name

Overfitting

The model doesn’t generalize well, it memorizes the training data

# Create a sample X and Y data

import numpy as np
import matplotlib.pyplot as plt

# set seed
np.random.seed(0)

# Generate sample data
X = np.random.rand(100, 1)
y = 2 + 3 * X + np.random.rand(100, 1)

# Make a overfit regression
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

model = make_pipeline(PolynomialFeatures(20), LinearRegression())
model.fit(X, y)

# Plot the data
plt.scatter(X, y)

# draw the model
X_test = np.linspace(0, 1, 100)
y_pred = model.predict(X_test.reshape(-1, 1))
plt.plot(X_test, y_pred, color='red')

plt.show()

Training Set & Test Set

To prevent overfitting, we need to have a separate data to evaluate the model

One way to do it is to split the data into two sets:

Training set: used to train the model
Test set: used to evaluate the model

When overfitting happen, the model will perform well on the training set but perform poorly on the test set

To split the data, we can do it manually or use train_test_split function from sklearn

# Split the data into training and test set manually

import numpy as np
import matplotlib.pyplot as plt

# set seed
np.random.seed(0)

# Generate sample data
X = np.random.rand(100, 1)
y = 2 + 3 * X + np.random.rand(100, 1)

# Split the data into training and test set manually
X_train, y_train = X[:80], y[:80]
X_test, y_test = X[80:], y[80:]

# Plot the training and test set
plt.scatter(X_train, y_train, label='Training set')
plt.scatter(X_test, y_test, label='Test set')
plt.legend()
plt.show()

But be careful, the data should be shuffled first before splitting

Otherwise, the model will be trained on the same data distribution

# Split the data into training and test set manually

import numpy as np
import matplotlib.pyplot as plt

# set seed
np.random.seed(0)

# Generate sample data
# X being a 100 x 1 matrix, ordered from 0 to 100
X = np.arange(100).reshape(100, 1)
y = 2 + 3 * X + np.random.rand(100, 1)
y[X >= 80] = 2 + 5 * X[X >= 80].flatten() + np.random.rand(len(X[X >= 80]))

# Split the data into training and test set manually
X_train, y_train = X[:80], y[:80]
X_test, y_test = X[80:], y[80:]

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Plot the training and test set
plt.scatter(X_train, y_train, label='Training set')
plt.scatter(X_test, y_test, label='Test set')

# Plot the regression line
X_test = np.linspace(0, 100, 100)
y_pred = model.predict(X_test.reshape(-1, 1))
plt.plot(X_test, y_pred, color='red', label='Regression line')

plt.legend()

plt.show()

# Split the data into training and test set manually

import numpy as np
import matplotlib.pyplot as plt

# set seed
np.random.seed(0)

# Generate sample data
# X being a 100 x 1 matrix, ordered from 0 to 100
X = np.arange(100).reshape(100, 1)
y = 2 + 3 * X + np.random.rand(100, 1)
y[X >= 80] = 2 + 5 * X[X >= 80].flatten() + np.random.rand(len(X[X >= 80]))

# Split the data into training and test using scikit
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True) # Make sure it's shuffled 

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Plot the training and test set
plt.scatter(X_train, y_train, label='Training set')
plt.scatter(X_test, y_test, label='Test set')

# Plot the regression line
X_test = np.linspace(0, 100, 100)
y_pred = model.predict(X_test.reshape(-1, 1))
plt.plot(X_test, y_pred, color='red', label='Regression line')

plt.legend()

plt.show()

Cross Validation

Cross validation is a technique to evaluate the model by splitting the data into training set and test set multiple times.

For example: 5-fold cross validation

Split the data into 5 folds
Train the model using 4 folds, evaluate the model using the remaining fold
Repeat the process 5 times, each time use different fold as test set
Calculate the average score

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_val_score

# Generate sample data
np.random.seed(0)
X = np.random.rand(100, 1)
y = 2 + 3 * X + np.random.rand(100, 1)

# Initialize the linear regression model
model = LinearRegression()

# Initialize the 5-fold cross-validation object
kf = KFold(n_splits=5, shuffle=True)

# Train and evaluate the model on each fold
for train_index, test_index in kf.split(X):
    # Split the data into training and test sets for this fold
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Fit the model on the training data for this fold
    model.fit(X_train, y_train)

    # Evaluate the model on the test data for this fold
    score = model.score(X_test, y_test)

    # Draw the plot
    plt.scatter(X_train, y_train, label='Training set')
    plt.scatter(X_test, y_test, label='Test set')
    
    # Show the plot
    plt.legend()
    plt.show()

    # Print the score for this fold
    print(f"Fold score: {score:.2f}")

# Compute the overall score across all folds
scores = cross_val_score(model, X, y, cv=kf)
print(f"Overall score: {np.mean(scores):.2f}")

Fold score: 0.83

Fold score: 0.93

Fold score: 0.94

Fold score: 0.87

Fold score: 0.88
Overall score: 0.89

Regression vs Classification

Regression

Predict continuous value, e.g. predict the weight of a person (as shown in the supervised learning example)

Classification

Classify the data into different classes, e.g. classify the email into spam or not spam

# Download MNIST dataset

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=True)

/Users/ruangguru/.pyenv/versions/3.11.1/lib/python3.11/site-packages/sklearn/datasets/_openml.py:968: FutureWarning: The default value of `parser` will change from `'liac-arff'` to `'auto'` in 1.4. You can set `parser='auto'` to silence this warning. Therefore, an `ImportError` will be raised from 1.4 if the dataset is dense and pandas is not installed. Note that the pandas parser may return different data types. See the Notes Section in fetch_openml's API doc for details.
  warn(

X = mnist['data']
y = mnist['target']

print(X.shape, y.shape)

(70000, 784) (70000,)

X.head()

	pixel1	pixel2	pixel3	pixel4	pixel5	pixel6	pixel7	pixel8	pixel9	pixel10	...	pixel775	pixel776	pixel777	pixel778	pixel779	pixel780	pixel781	pixel782	pixel783	pixel784
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 784 columns

y.head()

0    5
1    0
2    4
3    1
4    9
Name: class, dtype: category
Categories (10, object): ['0', '1', '2', '3', ..., '6', '7', '8', '9']

# Plot the first 20 images

import matplotlib.pyplot as plt

for i in range(5):
    some_digit = X.iloc[i]
    some_digit_image = some_digit.values.reshape(28, 28)

    plt.imshow(some_digit_image, cmap='binary')
    plt.axis('off')

    # Draw label y on the bottom
    plt.text(0, 28, y[i], fontsize=14, color='g')
    plt.show()

This is classification problem.

Given images, classify the images into correct numbers

Confusion Matrix

Classification matrix needs different metrics to evaluate the model. And the objective metrics can be different for different problems.

Confusion matrix is a table to visualize the performance of the classification model

Example of Confusion Matrix

	Predicted: Not Spam	Predicted: Spam
Actual: Not Spam	True Negative	False Positive
Actual: Spam	False Negative	True Positive

import matplotlib.pyplot as plt

# split the data into train and test
train_size = 60000
train_X, test_X = X[:train_size], X[train_size:]
train_y, test_y = y[:train_size], y[train_size:]

is_7_train = train_y == '7'
is_7_test = test_y == '7'

# Train the model
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(train_X, is_7_train)

SGDClassifier(random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Draw confusion matrix
from sklearn.metrics import confusion_matrix

y_pred = sgd_clf.predict(test_X)
cm = confusion_matrix(is_7_test, y_pred)
print(cm)

[[8902   70]
 [ 107  921]]

False Positive

# Draw the false positive

import matplotlib.pyplot as plt

false_positive = (is_7_test == False) & (y_pred == True)
false_positive_images = test_X[false_positive]

for i in range(3):
    some_digit = false_positive_images.iloc[i]
    some_digit_image = some_digit.values.reshape(28, 28)

    plt.imshow(some_digit_image, cmap='binary')
    plt.axis('off')

    # Draw label y on the bottom
    plt.text(0, 28, 'False Positive (predicted 7, but it is NOT 7)', fontsize=14, color='r')
    plt.show()

False Negative

# Draw the false negative

import matplotlib.pyplot as plt

false_negative = (is_7_test == True) & (y_pred == False)
false_negative_images = test_X[false_negative]

for i in range(3):
    some_digit = false_negative_images.iloc[i]
    some_digit_image = some_digit.values.reshape(28, 28)

    plt.imshow(some_digit_image, cmap='binary')
    plt.axis('off')

    # Draw label y on the bottom
    plt.text(0, 28, 'False Negative (predicted NOT 7, but it is 7)', fontsize=14, color='r')
    plt.show()

True Negative

# Draw the true negative

import matplotlib.pyplot as plt

true_negative = (is_7_test == False) & (y_pred == False)
true_negative_images = test_X[true_negative]

for i in range(3):
    some_digit = true_negative_images.iloc[i]
    some_digit_image = some_digit.values.reshape(28, 28)

    plt.imshow(some_digit_image, cmap='binary')
    plt.axis('off')

    # Draw label y on the bottom
    plt.text(0, 28, 'True Negative (predicted NOT 7, and it is NOT 7)', fontsize=14, color='g')
    plt.show()

True Positive

# Draw the true positive

import matplotlib.pyplot as plt

true_positive = (is_7_test == True) & (y_pred == True)
true_positive_images = test_X[true_positive]

for i in range(3):
    some_digit = true_positive_images.iloc[i]
    some_digit_image = some_digit.values.reshape(28, 28)

    plt.imshow(some_digit_image, cmap='binary')
    plt.axis('off')

    # Draw label y on the bottom
    plt.text(0, 28, 'True Positive (predicted 7, and it is 7)', fontsize=14, color='g')
    plt.show()

Why not use accuracy?

Accuracy may not be a good metric for classification problem, because the data can be imbalanced.

For example:

99% of the email is not spam
1% of the email is spam

If the model always predict the email as not spam, the accuracy is 99%. But the model is not useful at all.

Confusion Matrix:

	Predicted: Not Spam	Predicted: Spam
Actual: Not Spam	99	0
Actual: Spam	1	0

Or in our MNIST example:

When we want to classify if the image is number 7 or not, a model that always predict the image as not 7 will have 90% accuracy.

Confusion Matrix:

	Predicted: Not 7	Predicted: 7
Actual: Not 7	90	0
Actual: 7	10	0

Precision & Recall

We need other metrics!

\[\begin{align*} \text{Recall} &= \frac{\text{True Positive}}{\text{True Positive} + \text{False Negative}} \\ \text{Precision} &= \frac{\text{True Positive}}{\text{True Positive} + \text{False Positive}} \end{align*}\]

precision-recall Source: Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition

Example:

	Predicted: Not 7	Predicted: 7
Actual: Not 7	90	0
Actual: 7	10	0

Recall = 0 / (0 + 10) = 0
Precision = 0 / (0 + 0) = 0

Perfect Recall

Confusion Matrix:

	Predicted: Not 7	Predicted: 7
Actual: Not 7	0	90
Actual: 7	0	10

If our model always predict ALL image as 7, the recall will be 100% (10 / (10 + 0) = 1).

Perfect recall means that the model will never miss any 7, but it will also predict many non-7 as 7.

Q: Is "7" seven?
A: Yes, it is seven
Q: Is "7" seven?
A: Yes, it is seven
Q: Is "8" seven?
A: Yes, it is seven
Q: Is "9" seven?
A: Yes, it is seven

Perfect Precision

Confusion Matrix:

	Predicted: Not 7	Predicted: 7
Actual: Not 7	90	0
Actual: 7	9	1

If our model is very careful and only predict the image as 7 when it is very sure, we will have a perfect precision.

Perfect precision means that when the model predict the image as 7, it is actually 7. But the model may miss a lot of 7s.

Q: Is "7" seven?
A: Yes, it is seven
Q: Is "8" seven?
A: No, it is not seven
Q: Is "9" seven?
A: No, it is not seven
Q: Is "7" seven?
A: No, it is not seven
Q: Is "7" seven?
A: No, it is not seven

When to use precision and when to use recall?

It depends on the problem.

High Recall

High Recall is prioritized when the cost of false negative is high.
A false negative (a person who has cancer but is predicted as not having it) could lead to lack of treatment and dire health implications.
A false positive (a person who doesn’t have cancer but is predicted as having it) would lead to further tests, which might be stressful and costly but isn’t immediately harmful.

High Precision

High Precision is prioritized when the cost of false positives is high.
A false positive (a legitimate transaction is incorrectly flagged as fraudulent), can lead to customer frustration.
A false negative (missing a fraudulent transaction) may be deemed more acceptable than annoying or alienating a large number of genuine customers.

F1 Score

F1 score is a metric that combines precision and recall.

The formula for F1 score is:

\[F1 = \frac{2}{\frac{1}{Precision} + \frac{1}{Recall}}\]

The F1 score favors classifiers that have similar precision and recall. This is not always what you want: in some contexts you mostly care about precision, and in other contexts you really care about recall.