Linear Regression Explained for Beginners in Machine Learning (2024)

What Is Linear Regression & How Does It Work Using Python?

@pramodAIML

Published in

14 min read

Sep 2, 2020

Linear Regression Explained for Beginners in Machine Learning (3)

Data science with the kind of power it gives you to analyze each and every bit of data you have at your disposal, to make smart & intelligent business decisions, is becoming a must-have tool to understand and implement in your organization, it is very important that your business decisions are not based on intuition rather based on data analysis.

Being a data science learner & practitioner, very often

“Data which you have in your repository is a gold mine, which needs to be harnessed with an intent to serve the humanity at large, as they are the key source of the same data. “

Data has a story to tell. Being a data engineer and a business leader it’s your primary responsibility to treat them well, process it with an appropriate ML model, and build a solution that is relevant for both current and future user needs. With this intent, let’s begin our journey of understanding supervised ML using the Linear Regression model.

What Is Supervised Machine Learning?
Type Of Supervised Machine Learning?
What Is Regression & Its Type?
Understanding Linear Regression With Example?
Hands-On Labs Exercise On Linear Regression Using Python & Jupyter

In supervised learning, we are given a labeled data set(labeled training data) and the desired outcome is already known, where every pair of training data has some kind of relationship.

Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.

Y = f(X)

The intent is to train the function so such an extent that whenever we have any new input data (x) you can easily predict the output variables (Y) for that given set of input data.

So here the training happens under the supervision of a teacher/assistant who already has the knowledge of correct answers and the algorithm iteratively makes predictions on the training data and is corrected by the supervisor. So when our learning algorithm achieves the acceptable level of training performance we put an end to the learning process.

The most fundamental way one can categorize any supervised learning methodology is based on the type of problem statement it is trying to solve. At the high level, we can also say, what kind of business problem one is trying to solve using Supervised Machine Learning algorithms.

So, Within supervised machine learning we further categorize problems into the following categories:

Regression
Classification

Regression problems are the problems where we try to make a prediction on a continuous scale. Examples could be predicting the stock price of a company or predicting the temperature tomorrow based on historical data. Here temperature or sales parameters are continuous variables and we are trying to predict the change in sales value based on certain, given input variables like man-hours used, etc..

Regression is a method of modeling a target value based on independent predictors. This method is mostly used for forecasting and finding out the cause and effect relationship between variables. Regression techniques, mostly differ based on the number of independent variables and the type of relationship between the independent and dependent variables.

Regression Types :

Linear Regression
Multiple Linear Regression
Polynomial Regression
Decision Tree Regression
Random Forest Regression

We will cover only Linear regression today and the rest we will cover later.

It is made up of two words Linear & regression. Let’s understand both before we get into the definition of linear regression

Linear: The word linear comes from the Latin word linearis, which means pertaining to or resembling a line

Regression: a kind statistical technique for estimating the relationships among dependent & independent variables.

Let’s combine them and define:

It is a statistical approach to model between a dependent variable and one or more explanatory variables (or independent variables) to come up with a best fit linear line(linear equation, using least squared approach) represented in a most simplified manner as:

Simple linear regression,

X=explanatory variables,

β0=y-intercept (constant term),

β1=slope coefficients for the explanatory variable,

We use linear regression to find the relationship between dependent & independent variables to find the best attribute (input variable)to use for model building in solving the regression type problems.

Linear Regression is further classified as

Simple linear regression: It has only one explanatory variable
Multiple linear regression: It has more than one explanatory variable. Here multiple correlated dependent variables are predicted, rather than a single scalar variable(dependent variable)

It represents line fitment between multiple inputs and one output, typically:

yi=β0+β1xi1+β2xi2+…+βpxip+ϵ

where, for i=n observations:yi=dependent variable

xi=explanatory variables,

Prerequisites:

To start with Linear Regression, you must be aware of a few basic concepts of statistics. i.e.

Correlation (r): Explains the relationship between two variables, possible values -1 to +1
Variance (σ2): Measure of spread in your data
Standard Deviation (σ) : Measure of spread in your data (Square root of Variance)
Normal distribution: Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. In graph form, normal distribution will appear as a bell curve.
Residual (error term): Actual value(Which we have ) Minus Predicted value(Which came using linear regression )

To understand variance, standard deviation, normal distribution, please refer to my article below:https://www.mlanalytics.in/descriptive-statistics-fundamentals-for-data-science-aspirants/

Key Assumptions In the Linear Regression Model:

If we are building a linear regression model we need to take care of the following assumptions, in order to build an effective model that works well.

The Dependent variable is continuous
There is a Linear relationship between Dependent Variable and Independent Variable.
There is no Multicollinearity (no relationship between Independent variables
Residuals should follow Normal Distribution.
Residuals should have constant variance: hom*oscedasticity
Residuals should be independently distributed/no autocorrelation

To check the relationship :

Between dependent and independent variable you can,

1. Perform Bivariate Analysis
2. Calculate Variance Inflation factor: a value which is closer to 1 and till maximum 4

To find whether residuals are normally distributed or not you can,

Perform Histogram/ Boxplot
Perform Kolmogorov Smirnov K’s test

You can Plot Residuals Vs. Predicted values and there should be no pattern in between them when you visualize them using data visualization tools.
Perform the Non-Constant Variance Test.

Simple Linear Regression:

In simple linear regression when we have a single input, we can use statistics to estimate the coefficients. This requires that you calculate statistical properties from the data, such as means, standard, deviations, correlations, and covariance.

When we have more than one input we can use Ordinary Least Squares to estimate the values of the coefficients. The Ordinary Least Squares procedure seeks to minimize the sum of the squared residuals.

It works by starting with random values for each coefficient. The sum of the squared errors is calculated for each pair of input and output values. A learning rate is used as a scale factor and the coefficients are updated in the direction towards minimizing the error.

The process is repeated until a minimum sum squared error is achieved or no further improvement is possible. Here we select a learning rate (alpha) parameter that determines the size of the improvement step to take on each iteration of the procedure. We will look into it in detail later as this is out of scope for today’s article

It is an extension to our linear model where we seek to both minimize the sum of the squared error of the model on the training data (using ordinary least squares) but also to reduce the complexity of the model (like the number or absolute size of the sum of all coefficients in the model).

Two popular examples of regularization procedures for linear regression are:

Lasso Regression: where Ordinary Least Squares is modified to also minimize the absolute sum of the coefficients (called L1 regularization).

Ridge Regression: where Ordinary Least Squares is modified to also minimize the squared absolute sum of the coefficients (called L2 regularization).

We will use Jupyter notebook & do all mathematical calculations to plot the simple line of regression below, then will understand it all along the way.

We will plot a scatter plot to visualize the given arrays x & y and then we will look into plotting a regression line,

Execute the below-given code in your Jupyter notebook(i am assuming that you have already installed anaconda which comes pre-loaded with required python-support & Jupyter IDE)

#Suppose we have the given value x & y and there is a linear #relationship between both of them. import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsx = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) 
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12]) 
# number of observations/points 
n = np.size(x)
#Lets plot a scatter plot for the given values 
colors = np.random.rand(n)
area = 100 # 0 to 15 point radii
plt.scatter(x, y, area, colors, alpha=0.5)

When you run the above code you will see the output as shown below:

Linear Regression Explained for Beginners in Machine Learning (4)

Now we need to find the line which fits best in the above scatter plot so that we can predict the response for any new feature values. (i.e a value of x not present in the dataset). This line is called the regression line.

The equation of the regression line is represented as:

y represents the predicted response value for ith observation.
b0 and b1 are regression coefficients and represent y-intercept and slope of the regression line respectively.
∈, is the residual error.

To build our simple linear regression model, we need to learn or estimate the values of regression coefficients b0 and b1. These coefficients will be used to build the model to predict responses.

We will make use of the Least Squares technique to find the best fit line.

Least squares is a statistical method used to determine the best fit line or the regression line by minimizing the sum of squares created by a mathematical function. The “square” here refers to squaring the distance between a data point and the regression line. The line with the minimum value of the sum of square is the best-fit regression line.

Let’s do some required calculation in python notebook to fins b0, b1. But before that, it is important to understand a few more formulas which we will perform in the python

#bo is intercept
#b1 is slopeb0= (Σy)(Σx2) - (Σx)(Σxy)/ n(Σx2) - (Σx)2
b1=(slope)= n (Σxy) - (Σx)(Σy) /n(Σx2) - (Σx)2

Execute the below-given code in your Jupyter notebook to continue,

#Step2 calculating slope & intercept #b0= (Σy)(Σx2) - (Σx)(Σxy)/ n(Σx2) - (Σx)2
#b1=(slope)= n (Σxy) - (Σx)(Σy) /n(Σx2) - (Σx)2#mean of x and y vector 
m_x, m_y = np.mean(x), np.mean(y)# calculating cross-deviation and deviation about x 
SS_xy = np.sum(y*x) - n*m_y*m_x 
SS_xx = np.sum(x*x) - n*m_x*m_x  # calculating regression coefficients 
b1 = SS_xy / SS_xx 
b0 = m_y - b1*m_x 
 print("Coefficient b1 is: ",b1 )
print("Coefficient b0 is: ",b0 )

Run this code and you will see the output as given below:

Linear Regression Explained for Beginners in Machine Learning (5)

So we have the required coefficient, b0= 1.23, b1= 1.16

#Step 3 : Let's plot the scatter plot along with predicted y value #based on our slope & intercept#plotting the actual points as scatter plot 
plt.scatter(x, y, color = "m", marker = "o", s = 100) 
# predicted response vector 
y_pred = b0 + b1*x# plotting the regression line 
plt.plot(x, y_pred, color = "g")  # putting labels 
plt.xlabel('x') 
plt.ylabel('y') 
 #show plot 
plt.show()

Execute the above code and run, you will see the output as given below in fig:4,

Linear Regression Explained for Beginners in Machine Learning (6)

Linear Regression Explained for Beginners in Machine Learning (7)

Once we have the simple linear regression line( model ) we need to evaluate the same to measure its fitness. We will evaluate the overall fit of a linear model, using the R-squared value

R-squared is the proportion of variance explained
It is the proportion of variance in the observed data that is explained by the model or the reduction in error over the null model
The null model just predicts the mean of the observed response, and thus it has an intercept and no slope
R-squared is between 0 and 1
Higher values are better because it means that more variance is explained by the model.

Linear Regression Explained for Beginners in Machine Learning (8)

Next, we will place another line on our data. This is a key step in calculating our r-squared, as you will see in a minute. Write the below-given code and compile it

#plot a horizontal line along mean of y
line2 = np.full([m-x],[m_y])
plt.scatter(x,y)
plt.plot(x,line2, c = 'r')
plt.show()

The output is given below in fig:5

Linear Regression Explained for Beginners in Machine Learning (9)

Write the below given code and compile to get the r-squared value,

differences_line1 = y_pred-yline1sum = 0for i in differences_line1:
 line1sum = line1sum + (i*i)
line1sum
print(line1sum)differences_line2 = line2 — y
line2sum = 0for i in differences_line2:
 line2sum = line2sum + (i*i)
line2sum
print(line2sum)#Variance of our linear model: 5.624#Total variance of the target variable: 118.5diff = line2sum-line1sum
print(diff)rsquared= diff/line2sum
print(“R-Squared is : “, rsquared)
#Let’s Verify The r-squared we calculated by using sklearn “r2_score” function:from sklearn.metrics import r2_score
r2Score = r2_score(y, y_pred)
print(“Rsquared usinf sklearn: “, r2Score)#Observationprint("\nAs r-sqaured value is almost close to 1 , we can easily say that our linear regression model, y_pred = b0 + b1*x is a good fit linear regression line.")

R-Squared value comes out to be: 0.95

Higher values are better because it means that more variance is explained by the model.

In our case, the r-squared value has come quite high to almost 0.95 very close to 1. So we can say that our model is better as it explains a larger variance in the data.

You cannot use R-squared to determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.

Caution: R-squared does not indicate if a regression model provides an adequate fit to your data. A good model can have a low R2 value. On the other hand, a biased model can have a high R2 value!

So there is one another type of R2: adjusted R-squared and predicted R-squared. These two statistics address particular problems with R-squared. They provide extra information by which you can assess your regression model’s goodness-of-fit. We will cover this later.

Let’s understand the same example:

Here we can perform the same calculation to find the simple linear regression model using sklearn in just a few lines:

Here we go,

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsfrom sklearn.linear_model import LinearRegressionx = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]).reshape(-1,1) 
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12]) 
#invoke the LinearRegression function and find the bestfit model on #our given dataregression_model = LinearRegression()
regression_model.fit(x, y) #this will give the best fit line # Let us explore the coefficients for each of the independent #attributesb1 = regression_model.coef_
b0 = regression_model.intercept_
print("b1 is: {} and b0 is: {}".format(b1, b0))
plt.scatter(x, y, color = "m", marker = "o", s = 100) 
plt.plot(x, b1*x+b0)

When you write & compile the above-given code snippet you will get the scatter plot with a line of regression, as shown below:

Linear Regression Explained for Beginners in Machine Learning (10)

#sklearn has a function to calculate R-Squared value as seen #belowfrom sklearn.metrics import r2_score
#y_pred is the predicted value which our linear regression model #predicted when we plotted the best fit line y_pred= regression_model.predict(x)r2Score = r2_score(y, y_pred) #here y is our original value 
print(r2Score)

Output:

When you will compile the above code you will get the R-squared value to be 0.95 which we also calculated mathematically previously.

We covered the basics of simple linear regression and understood how we can find the linear regression model with one predictor value X. But there is more to the linear regression. We will not be often dealing with only one predictor value instead we will have large data sets with multiple independent values where you need to deal with multiple linear regression & polynomial type linear regression

Will cover the following topics in the next part of Linear Regression,

Multiple linear regression model with one case study
polynomial regression model
Concept of underfitting & overfitting
Various techniques of error minimization in linear regression with examples
Linear regression learning models like gradient descent, OLS, Regularization

Thanks for being with me all along, will be back soon, keep loving keep sharing.