import numpy as np
import matplotlib.pyplot as plt
= np.array([6.50, 7.60, 12.00, 7.20, 9.75, 10.75, 9.00, 11.50, 8.60, 13.25])
x_train = np.array([77.2, 99.8, 120.5, 89.5, 113.5, 124.2, 106.0, 133.0, 101.5, 148.2])
y_train
# Plot the data points
='x', c='b')
plt.scatter(x_train, y_train, marker# Set the title
"Housing Prices")
plt.title(# Set the y-axis label
'Price (in 10.000s of dollars)')
plt.ylabel(# Set the x-axis label
'Size (in 100s sqft)')
plt.xlabel( plt.show()
Supervised Learning
What is Supervised Learning? Maybe that’s one of your big questions when you get into this topic. But, before digging deeper into Supervised Learning, it’s a good idea to first do a quick recap about Machine Learning and what it has to do with Supervised Learning.
Introduction to Machine Learning
Machine Learning (ML) is a subset of Artificial Intelligence (AI) that allows software applications to become more accurate in predicting outcomes without being explicitly programmed. In simple terms, it provides machines the ability to automatically learn and improve from experience.
It’s like teaching a small child how to walk. First, the kid observes how others do it, then they try to do it themselves and keep on trying until they walk perfectly. This process involves a lot of falling and getting up. Similarly, in machine learning, models are given data, with which they make predictions. Based on the outcome of those predictions—right or wrong, the models adjust their behaviours, learning and improving over time to make accurate predictions.
According to Arthur Samuel (American pioneer in AI and gaming), Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed.
Definition and types of Machine Learning Algorithms
There are various machine learning algorithms:
Supervised Learning - In supervised learning, we provide the machine with labelled input data and a correct output. In other words, we guide the machine towards the right output. It’s kind of like a student learning under the guidance of a teacher. The ‘teacher’, in this case, is the label that tells the model about the data so it can learn from it. Once the model is trained with this labelled data, it can start to make predictions or decisions when new, unlabeled data is fed to it. A common example of supervised learning is classifying emails into “spam” or “not spam”.
Unsupervised Learning - Unlike supervised learning, we do not have the comfort of a ‘teacher’ or labelled data here. In unsupervised learning, we only provide input to the machine learning model but no corresponding output. That is, the model is not told the right answer. The idea is for the model to explore the data and find some structure within. A classic example of unsupervised learning is customer segmentation, where you want to group your customers into different segments based on purchasing behaviour, but you don’t have any pre-existing labels.
Recommender Systems - Imagine walking into a bookstore and having an expert assistant who knows your reading habits guide you to books that would perfectly suit your taste – wouldn’t it be a delight? This is what recommender systems strive to do in virtual environments. They are intelligent algorithms that create a personalized list of suggestions based on data about each user’s past behaviour or user-item interactions.
Reinforcement Learning - This is a bit like learning to ride a bike. You don’t know how to do it at first, but with trial and error, you learn that pedalling keeps you balanced and turning the handlebars allows you to change direction. That’s essentially how reinforcement learning works - the model learns by interacting with its environment and receiving rewards or penalties.
What is Supervised Learning?
Now, let’s dig deeper into supervised learning, as it’s the most commonly used type of machine learning.
In supervised learning, we train the model using ‘labeled’ data. That means our data set includes both the input (\(x\)) data and its corresponding correct output (\(y\)). This pair of input-output is also known as a ‘feature’ and a ‘label’. An easy way to think about this is with a recipe: the ‘feature’ is a list of ingredients (our input), and the ‘label’ is the dish that results from those ingredients (our correct output).
The algorithm analyses this input-output pair, maps the function that transforms input to output, and tries to gain understanding of such relationship so that it can apply it to unlabeled, new data.
There are two main types of problems that Supervised Learning tackles: Regression and Classification. 1. Regression: predict a number from many possible numbers (like predicting the price of a house based on its features like size, location etc.). 2. Classification: predicts only a small number of possible outputs or categories (like determining whether an email is spam or not spam, or whether a tumor is malignant or benign, or the picture is of a cat or a dog etc.)
Overall, supervised learning offers an efficient way to use known data to make meaningful predictions about new, future data. By analyzing the relationships within labeled training data, supervised learning models can then apply what they’ve learned to unlabeled, real-world data and make informed decisions or predictions.
Linear Regression
Let’s dive into one of the most basic and fundamental algorithms in the realm of Supervised Learning: Linear Regression.
Introduction to Linear Regression
Imagine you’re a real estate agent, and you’re helping customers sell their houses at a reasonable market price. To do this, you need a way to predict a house’s market price based on its characteristics, such as the number of rooms, the year it was built, the neighborhood it’s in, etc.
This is where Linear Regression comes in!
Linear Regression is like drawing a straight line through a cloud of data points. The line represents the relationship between the independent variables (the house characteristics, or ‘features’) and the dependent variable (the house price, or ‘target variable’).
But you might be wondering, “Why is it called ‘Linear’ Regression?” Well, it’s because this approach assumes that the relationship between the independent and dependent variables is linear. This simply means that if you were to plot this relationship on a graph, you could draw a straight line, or ‘linear’ line, to represent this relationship.
In simple terms, we’re trying to draw a line of best fit through our data points that minimizes the distance between the line and all the data points. Let’s take a look below, this is how the linear regression process works:
Supervised Learning includes both the input features and also the output targets, The output targets are the right answers to the model we’ll learn from. To train the model, you feed the training set, both the input features and the output targets to your learning algorithm. Then your supervised learning algorithm will produce some function (\(f\)). The job with \(f\) is to take a new input \(x\) and output and estimate or a prediction (\(\hat{y}\)), In machine learning, the convention is that \(\hat{y}\) is the estimate or the prediction for \(y\).
The model for Linear Regression is a linear function of the input features. It’s represented as:
\[ f(x_{(i)}) = wx_{(i)} + b\]
General Notation | Description |
---|---|
\(x\) | Training Example feature values |
\(x_{(i)}\) | \(i_{th}\)Training Example |
\(\hat{y}\) | Training Example targets |
m | Number of training examples |
\(w\) | parameter: weight |
\(b\) | parameter: bias |
\(f\) | The model for Linear Regression |
If you pay attention, in a linear function there are \(w\) and \(b\). The \(w\) and \(b\) are called the parameters of the model. In machine learning parameters of the model are the variables you can adjust during training in order to improve the model. Sometimes you also hear the parameters \(w\) and \(b\) referred to as weights and bias. Now let’s take a look at what these parameters \(w\) and \(b\) do.
import numpy as np
import matplotlib.pyplot as plt
= np.arange(0, 5)
x = np.arange(0, 5) y
Case 1
If \(w\) = 0 and \(b\) = 1.5, then \(f\) looks like a horizontal line.
In this case the function $f (x)=0 * x + 1.5 $ so \(f\) is always constant. It always predicts 1.5 for the estimated \(y\) value.
\(\hat{y}\) is always equal to \(b\) and here \(b\) is also called y-intercept because it intersects the vertical axis or y-axis on this graph.
= 0
w = 1.5
b
def compute(x, w, b):
return w * x + b
= compute(x, w, b,)
tmp_fx
= 2.5
tmp_x = w * tmp_x + b
y
='r')
plt.plot(x, tmp_fx, c='o', c='b')
plt.scatter(tmp_x, y, marker0, 3)
plt.xlim(0, 3)
plt.ylim("w=0, b=1.5 -> y = 0x + 1.5")
plt.title( plt.show()
Case 2
If \(w\) = 0.5 and \(b\) = 0, then $f(x)=0.5 * x $.
If \(x\) = 0 then the prediction is also 0, and if \(x\) = 2 then the prediction is 0.5 * 2 which is 1.
You get a line like this and notice that the slope is 0.5 / 1. The value of \(w\) gives you the slope of the line , which is 0.5
= 0.5
w = 0
b
def compute(x, w, b):
return w * x + b
= compute(x, w, b,)
tmp_fx
= 2
tmp_x = w * tmp_x + b
y
='r')
plt.plot(x, tmp_fx, c='o', c='b')
plt.scatter(tmp_x, y, marker0, 3)
plt.xlim(0, 3)
plt.ylim("w=0.5, b=0 -> y = 0.5x + 0")
plt.title( plt.show()
Case 3
If \(w\) = 0.5 and \(b\) = 1, then \(f(x)=0.5 * x + 1\). and if \(x\) = 0, then \(f(x)= b\), which is 1. So the line intersects the vertical axis at \(b\), the intersection \(y\) .
Also if \(x\) = 2, then \(f(x)= 2\), so the line looks like this.
Again, this slope is 0.5 / 1 so the value of \(w\) results in a slope of 0.5.
= 0.5
w = 1
b
def compute(x, w, b):
return w * x + b
= compute(x, w, b,)
tmp_fx
= 2
tmp_x = w * tmp_x + b
y
='r')
plt.plot(x, tmp_fx, c='o', c='b')
plt.scatter(tmp_x, y, marker0, 3)
plt.xlim(0, 3)
plt.ylim("w=0.5, b=1 -> y = 0.5x + 1")
plt.title( plt.show()
Let’s consider a simple case study by Predicting a house’s price based on its size. The train set can be represented as follows:
House Size (100 sq.ft) (\(x\)) | Price (10.000s USD) (\(y\)) |
---|---|
6.5 | 77.2 |
7.6 | 99.8 |
12.0 | 120.5 |
7.20 | 89.5 |
9.75 | 113.5 |
10.75 | 124.2 |
9.0 | 106.0 |
11.5 | 133.0 |
8.6 | 101.5 |
13.25 | 148.2 |
Explanation:
In this example, we have house sizes as our independent variable (input) and corresponding house price as the dependent variable (output we want to predict). Assuming a linear relationship between size and price, we can use a linear regression model to predict the price based on the size.
import numpy as np
import matplotlib.pyplot as plt
= np.array([6.50, 7.60, 12.00, 7.20, 9.75, 10.75, 9.00, 11.50, 8.60, 13.25])
x_train = np.array([77.2, 99.8, 120.5, 89.5, 113.5, 124.2, 106.0, 133.0, 101.5, 148.2])
y_train
# Plot the data points
='x', c='b')
plt.scatter(x_train, y_train, marker# Set the title
"Housing Prices")
plt.title(# Set the y-axis label
'Price (in 10.000s of dollars)')
plt.ylabel(# Set the x-axis label
'Size (in 100s sqft)')
plt.xlabel( plt.show()
Now, the question is: How to predict the price of 9.5 sqft house? We can use the Linear Regression model to predict this.
With the amount of data we have, it would be difficult if we did 1 by 1 calculations manually. So, you can compute the function output in a for
loop as shown in the compute_output
function below.
def compute_output(x, w, b):
"""
Computes the prediction of a linear model
Args:
x : Data, m examples
w,b : model parameters
Returns
f_wb : model prediction
"""
= len(x_train)
m = w * x + b
f_wb
return f_wb
# Please replace the values of w and b with numbers that you think are appropriate
= 9.8
w = 15
b
= compute_output(x_train, w, b,)
tmp_f_wb
# predict the price of 9.5 sqft house
= 9.5
tmp_x = w * tmp_x + b
y_hat
# Plot the price of 9.5 sqft house
='o', c='black',label='9.5 sqft house')
plt.scatter(tmp_x, y_hat, marker11.5, 80, f"Price = {y_hat:.2f}", fontsize=12)
plt.text(
# Plot our model prediction
='r',label='Our Prediction')
plt.plot(x_train, tmp_f_wb, c
# Plot the data points
='x', c='b',label='Actual Values')
plt.scatter(x_train, y_train, marker
# Set the title
"Housing Prices")
plt.title(# Set the y-axis label
'Price (in 10.000s of dollars)')
plt.ylabel(# Set the x-axis label
'Size (in 100s sqft)')
plt.xlabel(
plt.legend() plt.show()
Loss function
Now, how do we know if the line is a good fit?
We can calculate the error between the predicted value and the actual value.
For example, if the actual value is \(y\), and the predicted value is \(\hat{y}\), then the error is \(y - \hat{y}\).
We can calculate the error for each data point, and then sum them up.
The error function is:
\[ E = \sum_{i=1}^{n} (y_i - \hat{y_i})^2 \]
where \(y\) is the actual value, and \(\hat{y}\) is the predicted value.
This is called the sum of squared errors. (SSE)
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
from IPython.display import display
= np.random.rand(100) * 10
x_data = np.random.normal(0, 2, x_data.shape)
noise = 3*x_data + 8 + noise
y_data
# Define the update function
def update(a, b):
= a*x + b
y ='red')
plt.plot(x, y, color=1)
plt.scatter(x_data, y_data, s'x')
plt.xlabel('y')
plt.ylabel(
f"Equation y = {a}x + {b}")
plt.title(
# Calculate the loss
= np.sum((y_data - y)**2)
loss
# Draw the loss
0, 20, f"Loss = {loss:.2f}", fontsize=12)
plt.text(
plt.show()
# Create scattered dot around y = 3x + 8
= np.random.rand(100) * 10
x = np.random.normal(0, 2, x.shape)
noise = 3*x + 8 + noise
y
# Define the slider widgets
= widgets.FloatSlider(min=0, max=10, step=0.1, value=0, description='a:')
a_slider = widgets.FloatSlider(min=0, max=10, step=0.1, value=0, description='b:')
b_slider
# Display the widgets and plot
=a_slider, b=b_slider) widgets.interactive(update, a
So, the goal of linear regression is to find the best a and b that minimize the loss function.
In other words, we want to find the best a and b that make the red line as close as possible to the scattered dots.
Other Loss Functions
There are other loss functions, for example:
Mean Square Error (MSE)
\[ E = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 \]
Mean Absolute Error (MAE)
\[ E = \sum_{i=1}^{n} |y_i - \hat{y_i}| \]
Root Mean Squared Error (RMSE)
\[ E = \sqrt{\frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y_i})^2} \]
RMSE is in the same units as the target variable, which can make it more interpretable in some cases.
The loss function, also called the cost function, tells us how well the model performs so we can try to make it better. The cost function itself measures the difference (error/loss rate) between the model prediction and the actual value for \(y\).
Let’s break down the formula for SSE to help you understand:
The cost function takes the prediction y hat and compares it to the target \(y\) by taking \(\hat{y}\) minus \(y\)
\[ (f(x^{(i)}) - y^{(i)})^2 \]
Then, the formula is the sum of the absolute value of the diff
\[J(w,b) = \sum\limits_{i = 0}^{m-1} (f(x^{(i)}) - y^{(i)})^2\]
But that equation is not fair because it doesn’t take into account the number of examples we have. So we divide it by the number of examples, \(m\).
\[J(w,b) = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f(x^{(i)}) - y^{(i)})^2\]
where \[f(x^{(i)}) = wx^{(i)} + b\]
- \(f(x^{(i)})\) is our prediction for example \(i\) using parameters \(w,b\).
- \((f(x^{(i)}) -y^{(i)})^2\) is the squared difference between the target value and the prediction.
- These differences are summed over all the \(m\) examples and divided by
m
to produce the cost, \(J(w,b)\).
Look at the code below, where it will calculate the cost by looping each example. In each loop:
- \(fx\), a prediction is calculated
- the difference between the target and the prediction is calculated and squared.
- this is added to the total cost.
def compute_cost(x, y, w, b):
"""
Computes the cost function for linear regression.
Args:
x : Data, m examples
y : target values
w,b : model parameters
Returns
total_cost (float): The cost of using w,b as the parameters for linear regression
to fit the data points in x and y
"""
# number of training examples
= len(x_train)
m
= 0
cost_sum for i in range(m):
= w * x[i] + b
fx = (fx - y[i]) ** 2
cost = cost_sum + cost
cost_sum = (1 / m) * cost_sum
total_cost
return total_cost
= np.array([6.50, 7.60, 12.00, 7.20, 9.75, 10.75, 9.00, 11.50, 8.60, 13.25])
x_train = np.array([77.2, 99.8, 120.5, 89.5, 113.5, 124.2, 106.0, 133.0, 101.5, 148.2])
y_train = 10
w = 15
b
= compute_cost(x_train, y_train, w, b)
errors
errors# print(f"Number of errors: {errors}")
Based on the data owned and the value of \(w\) and \(b\), when viewed from the results of the compute_cost
function, the total mistakes we got were 31.641
.
Then, is the best amount? Remember that the purpose of linear regression is to minimize total errors.
Then how do you do it? In other words we want to get the best \(w\) and \(b\) values that make the lines as close as possible to the spread points.
One way is you can use the Scikit-learn library.
import numpy as np
= ()
answer = 100000
minimum_cost
for w in np.arange(8.0, 12.0, 0.001):
for b in np.arange(13.0, 15.0, 0.001):
= compute_cost(x_train, y_train, w, b)
cost if cost < minimum_cost:
= cost
minimum_cost = (w, b)
answer
print(f"Minimum cost: {minimum_cost}")
print(f"w: {answer[0]} b: {answer[1]}")
Scikit-learn for Linear Regression
The sklearn
(or Scikit-learn) library in Python is a powerful, open-source library that provides simple and efficient tools for data analysis and modeling, including machine learning. It has numerous features, ranging from simple regression models to complex machine learning models.
For our Linear Regression model, sklearn
provides the LinearRegression
class within the sklearn.linear_model
module.
Here’s how you can do it:
from sklearn.linear_model import LinearRegression
# Reshape the input array
= np.array([6.50, 7.60, 12.00, 7.20, 9.75, 10.75, 9.00, 11.50, 8.60, 13.25]).reshape((-1, 1))
x_train = np.array([77.2, 99.8, 120.5, 89.5, 113.5, 124.2, 106.0, 133.0, 101.5, 148.2])
y_train
= LinearRegression()
model
# Use this score function to in model.fit
model.fit(x_train, y_train)
# Generates and displays coefficients and intercepts
= model.coef_ # w
coefficient = model.intercept_ # b
intercept print(f"Coefficients: {coefficient}")
print(f"Intercept: {intercept}")
Coefficients: [9.09553728]
Intercept: 23.886409100809274
From the results of the code above, the \(w\) (coefficient) value is 9.09553728
and \(b\) (intercept) is 23.886409100809274
. That way we can produce the minimum possible error/loss value. Now we look again at the data we have with the new \(w\) and \(b\) values and what error/loss we get.
= np.array([6.50, 7.60, 12.00, 7.20, 9.75, 10.75, 9.00, 11.50, 8.60, 13.25])
x_train = np.array([77.2, 99.8, 120.5, 89.5, 113.5, 124.2, 106.0, 133.0, 101.5, 148.2])
y_train = coefficient[0]
w = intercept
b
= compute_cost(x_train, y_train, w, b)
cost = compute_output(x_train, w, b,)
tmp_f_wb
# predict the price of 9.5 sqft house
= 9.5
tmp_x = w * tmp_x + b
y_hat
# Plot the price of 9.5 sqft house
='o', c='black',label='9.5 sqft house')
plt.scatter(tmp_x, y_hat, marker12, 75, f"Price = {y_hat:.2f}", fontsize=10)
plt.text(
# Plot our model prediction
='r',label='Our Prediction')
plt.plot(x_train, tmp_f_wb, c
# Plot the data points
='x', c='b',label='Actual Values')
plt.scatter(x_train, y_train, marker12, 80, f"Cost = {cost:.2f}", fontsize=10)
plt.text(
# Set the title
"Housing Prices")
plt.title(# Set the y-axis label
'Price (in 10.000s of dollars)')
plt.ylabel(# Set the x-axis label
'Size (in 100s sqft)')
plt.xlabel(
plt.legend() plt.show()
Based on the results in the graph above, we can see a decrease in the error/loss that we got, from the previous amount of 31.641
to 27.94
. In this way, the level of accuracy in predicting the price of the house we are looking for will increase.
You can also make predictions with this trained model. For example, if you want to predict the y value for x = 9.5
, you can do it using the predict()
method like this:
= np.array([9.5]).reshape((-1, 1))
x_new = model.predict(x_new)
y_pred print(f"Predicted value for x = 9.5: {y_pred}")
Predicted value for x = 9.5: [110.29401321]
But, how does the sklearn
library calculate the best \(w\) and \(b\) values? The answer is using the Gradient Descent algorithm, which will be discussed in the next section.
Linear Regression with Multiple Input Variables
Multiple Features
In real-world scenarios, an output variable or a result often depends on more than one factor or input variable. In machine learning, these factors are referred to as “features”. In the previous discussion, you have one feature \(x\), the size of the house and you can predict \(y\), the price of the house. But now, what if you don’t just use the size of the house as a feature that can be used to predict prices such as the number of bedrooms and bathrooms and the age of the house. Let’s look at the example below:
General Notation | Description |
---|---|
\(x_1\) | House Size |
\(x_2\) | Number of Bedrooms |
\(x_3\) | Number of Bathrooms |
\(x_4\) | House Age |
\(y\) | Price |
\(x_{j}\) | \(j\) Feature (red rectangle) |
\(n\) | Number of Feature |
\(\vec{x}^{(i)}\) | Feature of \(i^{th}\) training example, also called a row vector (yellow rectangle) |
\({\vec{x}_{j}}^{(i)}\) | Value of Feature \(j\) in \(i^{th}\) in the training example (blue rectangle) |
If you just used the size of a house to predict its price, it would be a simple linear regression problem. In the previous discussion, this is how we defined the model, with \(x\) as a single feature:
\[ f(x) = wx + b\]
But now with many features, we will define them differently. We still multiply \(w\) with \(x\), but since now we have multiples of \(w.x\), we sum them up together. The model then look like this:
\[ f(\vec{x}) = w_{1}x_{1} + w_{2}x_{2} + \dots + w_{n}x_{n} + b\]
or in vector notation can be written like this:
\[ f(\vec{x}) = \vec{w} \cdot \vec{x} + b\]
where \(\cdot\) is a vector dot product
General Notation | Description |
---|---|
\(\vec{w}\) | [ \(w_{1}, w_{2}, \dots , w_{n}\) ] |
\(b\) | Number of Bias Parameter |
\(\vec{x}\) | [ \(x_{1}, x_{2}, \dots , x_{n}\) ] |
Here, \(\vec{w}\) are parameters of the model that it learns during the training phase. The \(\vec{x}\) are the features. Using multiple features usually gives a better prediction than using just one feature, as real-world scenarios are often influenced by more than just one factor. It’s worth noting, however, that adding more and more features can also lead to overfitting where the model learns the training data too well and performs poorly on unseen data.
Note: We’ll learn more about
dot product
in the next material on Matrix and Vector.
Let’s go back to our case, where now we have multiple features to predict house prices
House Size (100 sq.ft) (\(x_1\)) | Bedrooms (\(x_2\)) | Bathrooms (\(x_3\)) | Age (years) (\(x_4\)) | Price (10.000s USD) (\(y\)) |
---|---|---|---|---|
6.50 | 2 | 1 | 15 | 77.2 |
7.60 | 3 | 2 | 5 | 99.8 |
12.00 | 4 | 3 | 10 | 120.5 |
7.20 | 2 | 1 | 7 | 89.5 |
9.75 | 3 | 2 | 12 | 113.5 |
10.75 | 3 | 2 | 3 | 124.2 |
9.00 | 3 | 2 | 8 | 106.0 |
11.50 | 4 | 3 | 10 | 133.0 |
8.60 | 2 | 2 | 15 | 101.5 |
13.25 | 4 | 3 | 1 | 148.2 |
In this example, we not only have the size of the house as the independent variable, but also the number of bedrooms and bathrooms as well as the age of the house itself and don’t forget the price of the house as the dependent variable.
Let’s take what we learn and apply it to code:
= np.array([
X_train 6.50, 2, 1, 15],
[7.60, 3, 2, 5],
[12.00, 4, 3, 10],
[7.20, 2, 1, 7],
[9.75, 3, 2, 12],
[10.75, 3, 2, 3],
[9.00, 3, 2, 8],
[11.50, 4, 3, 10],
[8.60, 2, 2, 15],
[13.25, 4, 3, 1]
[
])
= np.array([77.2, 99.8, 120.5, 89.5, 113.5, 124.2, 106.0, 133.0, 101.5, 148.2])
y_train
print(f"X Shape: {X_train.shape}, X Type:{type(X_train)})")
print(X_train)
print(f"y Shape: {y_train.shape}, y Type:{type(y_train)})")
print(y_train)
X Shape: (10, 4), X Type:<class 'numpy.ndarray'>)
[[ 6.5 2. 1. 15. ]
[ 7.6 3. 2. 5. ]
[12. 4. 3. 10. ]
[ 7.2 2. 1. 7. ]
[ 9.75 3. 2. 12. ]
[10.75 3. 2. 3. ]
[ 9. 3. 2. 8. ]
[11.5 4. 3. 10. ]
[ 8.6 2. 2. 15. ]
[13.25 4. 3. 1. ]]
y Shape: (10,), y Type:<class 'numpy.ndarray'>)
[ 77.2 99.8 120.5 89.5 113.5 124.2 106. 133. 101.5 148.2]
For parameter vector \(w\) and \(b\) can be written like this:
- \(w\) is a vector with \(n\) elements.
- Each element contains the parameter associated with one feature.
- in our dataset, n is 4.
- notionally, we draw this as a column vector
- \(b\) is a scalar parameter.
= 46.68178303822603
b_init = np.array([ 7.05979876, -4.1870478, 8.30627675, -0.94230069])
w_init print(f"w_init shape: {w_init.shape}, b_init type: {type(b_init)}")
w_init shape: (4,), b_init type: <class 'float'>
Now, let’s look at a basic implementation using a for loop for computing the model’s prediction.
\[f(\vec{x}) = \sum\limits_{j = 1}^{n} w_{j}x_{j} + b\]
def predict_single_loop(x, w, b):
"""
single predict using linear regression
Args:
x (ndarray): Shape (n,) example with multiple features
w (ndarray): Shape (n,) model parameters
b (scalar): model parameter
Returns:
f (scalar): prediction
"""
= len(x)
n = 0
f for i in range(n):
= x[i] * w[i]
f_i = f + f_i
f = f + b
f return f
# get a row from our training data
= X_train[0,:]
x_vec print(f"x_vec shape {x_vec.shape}, x_vec value: {x_vec}")
# make a prediction
= predict_single_loop(x_vec, w_init, b_init)
f_x print(f"f_wb shape {f_x.shape}, prediction: {f_x}")
x_vec shape (4,), x_vec value: [ 6.5 2. 1. 15. ]
4
f_wb shape (), prediction: 78.36814577822602
def compute_cost_for_multiple_variables(X, y, w, b):
"""
Computes the cost function for linear regression.
Args:
X : Data, m examples
y : target values
w,b : model parameters
Returns
total_cost (float): The cost of using w,b as the parameters for linear regression
to fit the data points in X and y
"""
= len(y) # number of training examples
m
# hypothesis calculation - output prediction
= np.dot(X, w) + b
fx
# error calculation
= fx - y
errors
# calculate total cost
= (1 / m) * np.dot(errors.T, errors)
total_cost
return total_cost
Now, there is a very interesting trick called vectorization, which will make it easier to implement this algorithm and many other learning algorithms. And it will runs much faster too! Let’s go into the next discussion to see what vectorization is.
Vectorization
Vectorization is a powerful ability within NumPy (and other similar libraries) to express operations as occurring on entire arrays, rather than their individual elements.
When we loop over elements of an array and perform operations, this is called scalar computation. In Python, scalar computation is relatively slow compared to vectorized computation. This is where vectorization can be of great use in machine learning tasks which often require performing operations on large amounts of data.
Now, let’s look at how we can do this using vectorization.
\[ f(\vec{x}) = \vec{w} \cdot \vec{x} + b\]
And now you can implement it with one line of code like below using dot()
method from NumPy:
def predict(x, w, b):
"""
single predict using linear regression
Args:
x (ndarray): Shape (n,) example with multiple features
w (ndarray): Shape (n,) model parameters
b (scalar): model parameter
Returns:
f (scalar): prediction
"""
= np.dot(x, w) + b
f return f
# get a row from our training data
= X_train[0,:]
x_vec print(f"x_vec shape {x_vec.shape}, x_vec value: {x_vec}")
# make a prediction
= predict(x_vec,w_init, b_init)
f_wb print(f"f_wb shape {f_wb.shape}, prediction: {f_wb}")
x_vec shape (4,), x_vec value: [ 6.5 2. 1. 15. ]
f_wb shape (), prediction: 78.36814577822602
How many features do we need?
In the previous discussion, we have discussed how to use multiple features to predict house prices. But, how many features do we need?
Should we add the following features to our model? - The number of floors - The size of the backyard - Is it near a school? - Is it near a mall? - Does it have a swimming pool?
More features are not always better. Adding more features can lead to overfitting, where the model learns the training data too well and performs poorly on unseen data.
Feature Engineering
Feature choices can have a big impact on the performance of your learning algorithm. In fact, for many practical applications, selecting or including the right features is a critical step for an algorithm to function properly. In this topic, let’s see how we can select or engineer the most suitable features for our learning algorthm.
Feature Engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data. In other words, you’re creating new features from the existing ones, or even removing some that you deem unnecessary.
Feature engineering can be considered as adding more useful information for your model to learn from. In the context of linear regression with multiple input variables, feature engineering can be crucial because the relationship between the response variable and predictors may not always be linear or may depend on multiple interacting variables.
For a dataset describing different lands, let’s say we have three original features – Width
, Depth
, and Price
:
LandID | Width (m) | Depth (m) | Price ($) |
---|---|---|---|
1 | 10 | 20 | 2000 |
2 | 20 | 15 | 3000 |
3 | 15 | 20 | 2500 |
4 | 10 | 12 | 1200 |
5 | 12 | 15 | 2250 |
The simple Machine learning model using Width
and Depth
to predict price
could look like this:
\[ f(\vec{x}) = w_{1}x_{1} + w_{2}x_{2} + b\]
Now let’s see how feature engineering can enhance this dataset.
Creating new features
One simple feature engineering task on this dataset could be creating a new feature representing the total area
. While the width
and depth
by themselves might not be the best indicators of price
, land area
(calculated as depth
x width
) could be a stronger feature.
LandID | Width (m) | Depth (m) | Area (m²) | Price ($) |
---|---|---|---|---|
1 | 10 | 20 | 200 | 2000 |
2 | 20 | 15 | 300 | 3000 |
3 | 15 | 20 | 300 | 2500 |
4 | 10 | 12 | 120 | 1200 |
5 | 12 | 15 | 180 | 2250 |
In certain situations, raw features as given in the dataset might not have a clear and direct relationship with the target variable. In the example of predicting land price
, it’s possible that neither the depth
nor the width
of the land independently has a strong predictable effect on the price
.
However, when you combine them into a single feature (the total area, which is depth
multiplied by width
), you can create a new representation of your data that might have a better correlation or association with the target variable (price
, in this case). It’s common that a property, such as area here, can have more correlation to price
than either width
or depth
individually.
In other words, even though the width
and depth
independently may not be good predictors, their product (which represents the total area) might prove to be a very good predictor. This can be understood intuitively too - larger lands (greater area) usually cost more.
This process of creating new and more effective features is a key part of feature engineering. Good feature engineering can often make the difference between a weak model and a highly accurate one. It’s an essential step in any machine learning project and often requires knowledge about the domain from which the data originates
Now with the new feature, the Machine Learning model will look like this:
\[ f(\vec{x}) = w_{1}x_{1} + w_{2}x_{2} + w_{3}x_{3} + b\]
Why Feature Engineering important in Linear Regression
Linear regression aims to establish a linear relationship between the input variables (features) and the single output variable (target). Feature engineering aids in improving this relationship, leading to more accurate models. Good features capture important aspects of the data and provide information that helps the model make correct predictions.
When it comes to linear regression, the importance of feature engineering arises from the fact that the model is quite simple. The model assumes a simple linear relationship between input and outputs. If the features can capture complex dependencies in simple forms, the model has a much easier time learning from them.
It is also important to note that feature engineering often requires domain knowledge. In the land-price example, knowing that land price is often a function of total area is key to creating the “area” feature from the original “width” and “depth” features. As such, while Machine Learning provides many powerful tools, domain expertise is often key to applying these tools effectively.
Here are some common methods used in feature engineering:
Feature scaling: This includes methods such as standardization and normalization. Feature scaling is used to standardize the range of features of data. This is important because features might have different units and may vary in range. Some algorithms, like gradient descent, converge faster when features are on a similar scale.
Polynomial features: This is useful when the relationship between the independent and dependent variable is not linear. Polynomial features create interactions between features and also create features to the power of the exponent. For example, if we have a feature \(x\), we can create a new feature that is \(x^2\) or \(x^3\).
Feature Scalling
Feature Scalling is a technique that will enable gradient descent (the algorithm behind model.fit
) to run much faster. Feature scaling is a step in data preprocessing that aims to standardize the range of features in the data. When the range of values in different features varies greatly, the ML algorithm might take longer to converge, or it may not effectively learn the correct weights for each feature.
Two common types of feature scaling are: - Mean Normalization - Standardization (Z-score Scaling)
Notes: To find out what gradient descent is, it will be discussed in the next material.
Mean Normalization
This method centers the data around 0. It scales the data between -1 and 1 by subtracting each observation by the average of observations and then dividing by the difference between the maximum and minimum value.
\[x_i := \dfrac{x_i - \mu_i}{max - min} \]
So, when do we use it? Mean Normalization is used when we want to center our variables around zero and vary between -1 and 1. Mean normalization can be useful in cases where we want to eliminate the effect of the units on an algorithm. It’s particularly good when our data has outliers, as mean normalization does not bound values to a specific range.
Below is a Python example to illustrate this:
import numpy as np
# Here is an array with some numbers
= np.array([
x 650, 2, 1, 15],
[760, 3, 2, 5],
[1200, 4, 3, 10],
[720, 2, 1, 7],
[975, 3, 2, 12],
[1075, 3, 2, 3],
[900, 3, 2, 8],
[1150, 4, 3, 10],
[860, 2, 2, 15],
[1325, 4, 3, 1]
[=float)
], dtype
# Mean normalization
= (x - np.mean(x)) / (np.max(x) - np.min(x))
x_mean_norm print('Mean Normalized data:')
print(x_mean_norm)
Mean Normalized data:
[[ 0.18518519 -0.48148148 0.51851852]
[ 0.51851852 -0.14814815 -0.14814815]
[-0.14814815 0.18518519 -0.48148148]]
Standardization (Z-score Scaling)
This method transform the feature to have a mean of 0 and standard deviation of 1 by subtracting each value by the mean and then dividing by the standard deviation.
Why do we use standardization? If a feature in our dataset has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly, leading to sub-optimal performance. Standardization mitigates this issue.
To implement z-score normalization, adjust our input values as shown in this formula: \[x^{(i)}_j = \dfrac{x^{(i)}_j - \mu_j}{\sigma_j}\] where \(j\) selects a feature or a column in the \(\mathbf{X}\) matrix. \(µ_j\) is the mean of all the values for feature (j) and \(\sigma_j\) is the standard deviation of feature (j). \[ \begin{align} \mu_j &= \frac{1}{m} \sum_{i=0}^{m-1} x^{(i)}_j\\ \sigma^2_j &= \frac{1}{m} \sum_{i=0}^{m-1} (x^{(i)}_j - \mu_j)^2 \end{align} \]
Here’s an illustrative Python example using sklearn
library’s StandardScaler
:
from sklearn.preprocessing import StandardScaler
import numpy as np
# Here is an array with some random numbers
= np.array([
x 650, 2, 1, 15],
[760, 3, 2, 5],
[1200, 4, 3, 10],
[720, 2, 1, 7],
[975, 3, 2, 12],
[1075, 3, 2, 3],
[900, 3, 2, 8],
[1150, 4, 3, 10],
[860, 2, 2, 15],
[1325, 4, 3, 1]
[=float)
], dtype
# Standardization
= StandardScaler()
std_scaler = std_scaler.fit_transform(x)
x_std
print('Standardized data:')
print(x_std)
Standardized data:
[[ 0. -1.22474487 1.33630621]
[ 1.22474487 0. -0.26726124]
[-1.22474487 1.22474487 -1.06904497]]
The output dimensions of standardization do not have a particular range, which means the data can be negative, zero, or positive. This kind of scaling is less affected by outliers or extreme values as it is based on the mean and standard deviation.
Rule of Thumb Feature Scaling
Feature Range | Status |
---|---|
0 $ x_1 $ 3 | ok, no rescaling |
-2 $ x_2 $ 0.5 | ok, no rescaling |
-100 $ x_3 $ 100 | too large, rescaling |
-0.001 $ x_4 $ 0.001 | too small, rescaling |
Polynomial Regression
Now, we have learned that linear regression algorithm can find a linear line that best fits the data. But, what if the data is not linear? What if the data is curved?
Polynomial Regression is a form of regression analysis in which the relationship between the independent variable \(x\) and the dependent variable \(y\) is modeled as an nth degree polynomial. Polynomial regression fits a nonlinear relationship between the value of \(x\) and the corresponding conditional mean of \(y\).
For example, a simple linear regression equation represent like this: \[ f(x) = wx + b\]
However, in polynomial regression, you might have an equation that looks like this for a second degree (quadratic) polynomial:
\[f(x)= w_0{x_0}^2 + b\]
This equation is still linear in the coefficients. The linearity in polynomial regression applies to the coefficients, and not the degree of the variables.
Polynomial regression can be very useful. When a linear regression model is not sufficient to capture the data relationships, it can be extended with polynomial features.
Let’s consider a simple case study. Imagine we are running a small business selling ice cream. We notice that the sales of our ice cream seems to be heavily influenced by the temperature outside.
Let’s say we’ve collected the following data regarding outside temperature (in Celsius) and ice cream sales for a few days:
Temperature (In Celsius) (\(x\)) | Ice Cream Sales (\(y\)) |
---|---|
14 | 200 |
16 | 225 |
18 | 255 |
20 | 290 |
22 | 330 |
24 | 370 |
26 | 410 |
28 | 440 |
30 | 460 |
32 | 470 |
34 | 480 |
36 | 485 |
37 | 490 |
import numpy as np
= np.array([14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 37])
x_train = np.array([200, 225, 255, 290, 330, 370, 410, 440, 460, 470, 480, 485, 490]) y_train
We’re going to visualize our data first to get a better understanding
import matplotlib.pyplot as plt
plt.scatter(x_train, y_train)'Temperature in Celsius')
plt.xlabel('Ice Cream Sales')
plt.ylabel('Ice Cream Sales vs Temperature')
plt.title( plt.show()
Our data appears to show a non-linear trend (increases, then decreases). So, we decided to use quadratic Polynomial Regression to model this data. In this case we will use the sklearn
library. sklearn
itself provides the PolynomialFeatures
class in the sklearn.preprocessing
module.
The PolynomialFeatures
class in the sklearn.preprocessing
module is a feature transformation tool in Scikit-learn. It generates a new matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.
If we have one feature (x
), PolynomialFeatures
would transform this based on the degree parameter. For a degree of 2, it would transform it to [1, x
, x
^2]. For a degree of 3, it would transform it to [1, x
, x
^2, x
^3] and so on. The 1 is included as it acts as the intercept term.
Here’s an example of how to use PolynomialFeatures
:
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
# Given a simple 2D numpy array
= np.arange(6).reshape(3, 2)
X print("Original Data:")
print(X)
# Create a PolynomialFeatures object and transform the data
= PolynomialFeatures(2) # let's choose degree 2
poly = poly.fit_transform(X)
X_poly
print("Transformed Data:")
print(X_poly)
Now, we return to our case example.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
# Transform our x input to 1D array for fitting
= x_train.reshape(-1, 1)
x_train_fit
# Create Polynomial Features
= PolynomialFeatures(degree=[2, 6])
poly = poly.fit_transform(x_train_fit)
x_poly
# Fit the Polynomial Features to the Linear Regression
= LinearRegression()
poly_model
poly_model.fit(x_poly, y_train)
# Generate a sequence of temperatures for prediction
= 100
num_samples = np.linspace(x_train_fit.min(),x_train_fit.max(),num_samples).reshape(-1,1)
X_seq
# Use model to predict ice cream sales
= poly_model.predict(poly.fit_transform(X_seq))
y_poly_pred
# Visualize the original data and the polynomial fit
plt.scatter(x_train, y_train)='r')
plt.plot(X_seq, y_poly_pred, color'Temperature in Celsius')
plt.xlabel('Ice Cream Sales')
plt.ylabel('Ice Cream Sales vs Temperature')
plt.title( plt.show()
Now we have a model that not only fits our training data but can predict the ice cream sales at any given temperature, which will help in managing our ice cream supply based on predicted sales.