import matplotlib.pyplot as plt
import numpy as np
from math import sqrt
= np.random.randint(40, 100, 30)
weight = np.sqrt(weight / 20) * 100 + np.random.randint(-20, 20, 30)
height
# plot
plt.scatter(weight, height)'weight')
plt.xlabel('height')
plt.ylabel('weight vs height')
plt.title( plt.show()
Introduction to Machine Learning
Supervised vs Unsupervised vs Semi-supervised
Supervised
Target/label is given, e.g.
Height | Weight (Label) |
---|---|
180 | 80 |
160 | 60 |
170 | 70 |
190 | 90 |
Task: Given height (feature/attribute), predict the weight
Example: Linear regression, logistic regression, decision tree, random forest, support vector machine, neural network
Problem: How to predict the weight of a person with height 175cm?
import matplotlib.pyplot as plt
import numpy as np
from math import sqrt
= np.random.randint(40, 100, 30)
weight = np.sqrt(weight / 20) * 100 + np.random.randint(-20, 20, 30)
height
# implement linear regression using scikit to predict weight from height
from sklearn.linear_model import LinearRegression
= LinearRegression()
model -1, 1), weight)
model.fit(height.reshape(= model.predict(height.reshape(-1, 1))
weight_pred
# plot
plt.scatter(weight, height)'weight')
plt.xlabel('height')
plt.ylabel('weight vs height')
plt.title(
# plot the regression line
='red')
plt.plot(weight_pred, height, color
plt.show()
Unsupervised
Label is not given, the task is to find the pattern in the data, e.g. anomaly detection, clustering
# Create a sample of clustering data
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Generate sample data
= make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
X, y
# Plot the data
0], X[:, 1], s=50)
plt.scatter(X[:, plt.show()
Without any label or target, predict which data belongs to which cluster
# Create a sample of clustering data
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Generate sample data
= make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
X, y
# Implement K-NN clustering
from sklearn.cluster import KMeans
= KMeans(n_clusters=4)
kmeans
kmeans.fit(X)= kmeans.predict(X)
y_pred
# Plot the clustering result
0], X[:, 1], c=y_pred, s=50, cmap='viridis')
plt.scatter(X[:, plt.show()
/Users/ruangguru/.pyenv/versions/3.11.1/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
Semi Supervised
Combination of unsupervised and supervised learning
Example: Google Photos
- Unsupervised: Grouping photos based on the face
- Supervised: Ask the user to label the face with name
Overfitting
The model doesn’t generalize well, it memorizes the training data
# Create a sample X and Y data
import numpy as np
import matplotlib.pyplot as plt
# set seed
0)
np.random.seed(
# Generate sample data
= np.random.rand(100, 1)
X = 2 + 3 * X + np.random.rand(100, 1)
y
# Make a overfit regression
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
= make_pipeline(PolynomialFeatures(20), LinearRegression())
model
model.fit(X, y)
# Plot the data
plt.scatter(X, y)
# draw the model
= np.linspace(0, 1, 100)
X_test = model.predict(X_test.reshape(-1, 1))
y_pred ='red')
plt.plot(X_test, y_pred, color
plt.show()
Training Set & Test Set
To prevent overfitting, we need to have a separate data to evaluate the model
One way to do it is to split the data into two sets:
- Training set: used to train the model
- Test set: used to evaluate the model
When overfitting happen, the model will perform well on the training set but perform poorly on the test set
To split the data, we can do it manually or use train_test_split function from sklearn
# Split the data into training and test set manually
import numpy as np
import matplotlib.pyplot as plt
# set seed
0)
np.random.seed(
# Generate sample data
= np.random.rand(100, 1)
X = 2 + 3 * X + np.random.rand(100, 1)
y
# Split the data into training and test set manually
= X[:80], y[:80]
X_train, y_train = X[80:], y[80:]
X_test, y_test
# Plot the training and test set
='Training set')
plt.scatter(X_train, y_train, label='Test set')
plt.scatter(X_test, y_test, label
plt.legend() plt.show()
But be careful, the data should be shuffled first before splitting
Otherwise, the model will be trained on the same data distribution
# Split the data into training and test set manually
import numpy as np
import matplotlib.pyplot as plt
# set seed
0)
np.random.seed(
# Generate sample data
# X being a 100 x 1 matrix, ordered from 0 to 100
= np.arange(100).reshape(100, 1)
X = 2 + 3 * X + np.random.rand(100, 1)
y >= 80] = 2 + 5 * X[X >= 80].flatten() + np.random.rand(len(X[X >= 80]))
y[X
# Split the data into training and test set manually
= X[:80], y[:80]
X_train, y_train = X[80:], y[80:]
X_test, y_test
# Train a linear regression model
= LinearRegression()
model
model.fit(X_train, y_train)
# Plot the training and test set
='Training set')
plt.scatter(X_train, y_train, label='Test set')
plt.scatter(X_test, y_test, label
# Plot the regression line
= np.linspace(0, 100, 100)
X_test = model.predict(X_test.reshape(-1, 1))
y_pred ='red', label='Regression line')
plt.plot(X_test, y_pred, color
plt.legend()
plt.show()
# Split the data into training and test set manually
import numpy as np
import matplotlib.pyplot as plt
# set seed
0)
np.random.seed(
# Generate sample data
# X being a 100 x 1 matrix, ordered from 0 to 100
= np.arange(100).reshape(100, 1)
X = 2 + 3 * X + np.random.rand(100, 1)
y >= 80] = 2 + 5 * X[X >= 80].flatten() + np.random.rand(len(X[X >= 80]))
y[X
# Split the data into training and test using scikit
from sklearn.model_selection import train_test_split
= train_test_split(X, y, shuffle=True) # Make sure it's shuffled
X_train, X_test, y_train, y_test
# Train a linear regression model
= LinearRegression()
model
model.fit(X_train, y_train)
# Plot the training and test set
='Training set')
plt.scatter(X_train, y_train, label='Test set')
plt.scatter(X_test, y_test, label
# Plot the regression line
= np.linspace(0, 100, 100)
X_test = model.predict(X_test.reshape(-1, 1))
y_pred ='red', label='Regression line')
plt.plot(X_test, y_pred, color
plt.legend()
plt.show()
Cross Validation
Cross validation is a technique to evaluate the model by splitting the data into training set and test set multiple times.
For example: 5-fold cross validation
- Split the data into 5 folds
- Train the model using 4 folds, evaluate the model using the remaining fold
- Repeat the process 5 times, each time use different fold as test set
- Calculate the average score
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_val_score
# Generate sample data
0)
np.random.seed(= np.random.rand(100, 1)
X = 2 + 3 * X + np.random.rand(100, 1)
y
# Initialize the linear regression model
= LinearRegression()
model
# Initialize the 5-fold cross-validation object
= KFold(n_splits=5, shuffle=True)
kf
# Train and evaluate the model on each fold
for train_index, test_index in kf.split(X):
# Split the data into training and test sets for this fold
= X[train_index], X[test_index]
X_train, X_test = y[train_index], y[test_index]
y_train, y_test
# Fit the model on the training data for this fold
model.fit(X_train, y_train)
# Evaluate the model on the test data for this fold
= model.score(X_test, y_test)
score
# Draw the plot
='Training set')
plt.scatter(X_train, y_train, label='Test set')
plt.scatter(X_test, y_test, label
# Show the plot
plt.legend()
plt.show()
# Print the score for this fold
print(f"Fold score: {score:.2f}")
# Compute the overall score across all folds
= cross_val_score(model, X, y, cv=kf)
scores print(f"Overall score: {np.mean(scores):.2f}")
Fold score: 0.83
Fold score: 0.93
Fold score: 0.94
Fold score: 0.87
Fold score: 0.88
Overall score: 0.89
Regression vs Classification
Regression
Predict continuous value, e.g. predict the weight of a person (as shown in the supervised learning example)
Classification
Classify the data into different classes, e.g. classify the email into spam or not spam
# Download MNIST dataset
from sklearn.datasets import fetch_openml
= fetch_openml('mnist_784', version=1, as_frame=True) mnist
/Users/ruangguru/.pyenv/versions/3.11.1/lib/python3.11/site-packages/sklearn/datasets/_openml.py:968: FutureWarning: The default value of `parser` will change from `'liac-arff'` to `'auto'` in 1.4. You can set `parser='auto'` to silence this warning. Therefore, an `ImportError` will be raised from 1.4 if the dataset is dense and pandas is not installed. Note that the pandas parser may return different data types. See the Notes Section in fetch_openml's API doc for details.
warn(
= mnist['data']
X = mnist['target']
y
print(X.shape, y.shape)
(70000, 784) (70000,)
X.head()
pixel1 | pixel2 | pixel3 | pixel4 | pixel5 | pixel6 | pixel7 | pixel8 | pixel9 | pixel10 | ... | pixel775 | pixel776 | pixel777 | pixel778 | pixel779 | pixel780 | pixel781 | pixel782 | pixel783 | pixel784 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 784 columns
y.head()
0 5
1 0
2 4
3 1
4 9
Name: class, dtype: category
Categories (10, object): ['0', '1', '2', '3', ..., '6', '7', '8', '9']
# Plot the first 20 images
import matplotlib.pyplot as plt
for i in range(5):
= X.iloc[i]
some_digit = some_digit.values.reshape(28, 28)
some_digit_image
='binary')
plt.imshow(some_digit_image, cmap'off')
plt.axis(
# Draw label y on the bottom
0, 28, y[i], fontsize=14, color='g')
plt.text( plt.show()
This is classification problem.
Given images, classify the images into correct numbers
Confusion Matrix
Classification matrix needs different metrics to evaluate the model. And the objective metrics can be different for different problems.
Confusion matrix is a table to visualize the performance of the classification model
Example of Confusion Matrix
Predicted: Not Spam | Predicted: Spam | |
---|---|---|
Actual: Not Spam | True Negative | False Positive |
Actual: Spam | False Negative | True Positive |
import matplotlib.pyplot as plt
# split the data into train and test
= 60000
train_size = X[:train_size], X[train_size:]
train_X, test_X = y[:train_size], y[train_size:]
train_y, test_y
= train_y == '7'
is_7_train = test_y == '7'
is_7_test
# Train the model
from sklearn.linear_model import SGDClassifier
= SGDClassifier(random_state=42)
sgd_clf sgd_clf.fit(train_X, is_7_train)
SGDClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SGDClassifier(random_state=42)
# Draw confusion matrix
from sklearn.metrics import confusion_matrix
= sgd_clf.predict(test_X)
y_pred = confusion_matrix(is_7_test, y_pred)
cm print(cm)
[[8902 70]
[ 107 921]]
False Positive
# Draw the false positive
import matplotlib.pyplot as plt
= (is_7_test == False) & (y_pred == True)
false_positive = test_X[false_positive]
false_positive_images
for i in range(3):
= false_positive_images.iloc[i]
some_digit = some_digit.values.reshape(28, 28)
some_digit_image
='binary')
plt.imshow(some_digit_image, cmap'off')
plt.axis(
# Draw label y on the bottom
0, 28, 'False Positive (predicted 7, but it is NOT 7)', fontsize=14, color='r')
plt.text( plt.show()
False Negative
# Draw the false negative
import matplotlib.pyplot as plt
= (is_7_test == True) & (y_pred == False)
false_negative = test_X[false_negative]
false_negative_images
for i in range(3):
= false_negative_images.iloc[i]
some_digit = some_digit.values.reshape(28, 28)
some_digit_image
='binary')
plt.imshow(some_digit_image, cmap'off')
plt.axis(
# Draw label y on the bottom
0, 28, 'False Negative (predicted NOT 7, but it is 7)', fontsize=14, color='r')
plt.text( plt.show()
True Negative
# Draw the true negative
import matplotlib.pyplot as plt
= (is_7_test == False) & (y_pred == False)
true_negative = test_X[true_negative]
true_negative_images
for i in range(3):
= true_negative_images.iloc[i]
some_digit = some_digit.values.reshape(28, 28)
some_digit_image
='binary')
plt.imshow(some_digit_image, cmap'off')
plt.axis(
# Draw label y on the bottom
0, 28, 'True Negative (predicted NOT 7, and it is NOT 7)', fontsize=14, color='g')
plt.text( plt.show()
True Positive
# Draw the true positive
import matplotlib.pyplot as plt
= (is_7_test == True) & (y_pred == True)
true_positive = test_X[true_positive]
true_positive_images
for i in range(3):
= true_positive_images.iloc[i]
some_digit = some_digit.values.reshape(28, 28)
some_digit_image
='binary')
plt.imshow(some_digit_image, cmap'off')
plt.axis(
# Draw label y on the bottom
0, 28, 'True Positive (predicted 7, and it is 7)', fontsize=14, color='g')
plt.text( plt.show()
Why not use accuracy?
Accuracy may not be a good metric for classification problem, because the data can be imbalanced.
For example:
- 99% of the email is not spam
- 1% of the email is spam
If the model always predict the email as not spam, the accuracy is 99%. But the model is not useful at all.
Confusion Matrix:
Predicted: Not Spam | Predicted: Spam | |
---|---|---|
Actual: Not Spam | 99 | 0 |
Actual: Spam | 1 | 0 |
Or in our MNIST example:
When we want to classify if the image is number 7 or not, a model that always predict the image as not 7 will have 90% accuracy.
Confusion Matrix:
Predicted: Not 7 | Predicted: 7 | |
---|---|---|
Actual: Not 7 | 90 | 0 |
Actual: 7 | 10 | 0 |
Precision & Recall
We need other metrics!
\[\begin{align*} \text{Recall} &= \frac{\text{True Positive}}{\text{True Positive} + \text{False Negative}} \\ \text{Precision} &= \frac{\text{True Positive}}{\text{True Positive} + \text{False Positive}} \end{align*}\]
Source: Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition
Example:
Predicted: Not 7 | Predicted: 7 | |
---|---|---|
Actual: Not 7 | 90 | 0 |
Actual: 7 | 10 | 0 |
- Recall = 0 / (0 + 10) = 0
- Precision = 0 / (0 + 0) = 0
Perfect Recall
Confusion Matrix:
Predicted: Not 7 | Predicted: 7 | |
---|---|---|
Actual: Not 7 | 0 | 90 |
Actual: 7 | 0 | 10 |
If our model always predict ALL image as 7, the recall will be 100% (10 / (10 + 0) = 1).
Perfect recall means that the model will never miss any 7, but it will also predict many non-7 as 7.
Q: Is "7" seven?
A: Yes, it is seven
Q: Is "7" seven?
A: Yes, it is seven
Q: Is "8" seven?
A: Yes, it is seven
Q: Is "9" seven?
A: Yes, it is seven
Perfect Precision
Confusion Matrix:
Predicted: Not 7 | Predicted: 7 | |
---|---|---|
Actual: Not 7 | 90 | 0 |
Actual: 7 | 9 | 1 |
If our model is very careful and only predict the image as 7 when it is very sure, we will have a perfect precision.
Perfect precision means that when the model predict the image as 7, it is actually 7. But the model may miss a lot of 7s.
Q: Is "7" seven?
A: Yes, it is seven
Q: Is "8" seven?
A: No, it is not seven
Q: Is "9" seven?
A: No, it is not seven
Q: Is "7" seven?
A: No, it is not seven
Q: Is "7" seven?
A: No, it is not seven
When to use precision and when to use recall?
It depends on the problem.
High Recall
- High Recall is prioritized when the cost of false negative is high.
- A false negative (a person who has cancer but is predicted as not having it) could lead to lack of treatment and dire health implications.
- A false positive (a person who doesn’t have cancer but is predicted as having it) would lead to further tests, which might be stressful and costly but isn’t immediately harmful.
High Precision
- High Precision is prioritized when the cost of false positives is high.
- A false positive (a legitimate transaction is incorrectly flagged as fraudulent), can lead to customer frustration.
- A false negative (missing a fraudulent transaction) may be deemed more acceptable than annoying or alienating a large number of genuine customers.
F1 Score
F1 score is a metric that combines precision and recall.
The formula for F1 score is:
\[F1 = \frac{2}{\frac{1}{Precision} + \frac{1}{Recall}}\]
The F1
score favors classifiers that have similar precision and recall. This is not always what you want: in some contexts you mostly care about precision, and in other contexts you really care about recall.