Ordinary Least Squared (OLS) Regression (2024)

Interpreting:

Nishigandha Sharma

Published in

Analytics Vidhya

10 min read

Aug 13, 2020

Ordinary Least Squared (OLS) Regression (3)

OLS (Ordinary Least Squared) Regression is the most simple linear regression model also known as the base model for Linear Regression. While it is a simple model, in Machine learning it is not given much weightage. OLS is one such model which tells you much more than only the accuracy of the overall model. It also tells you how each variables have fared, if we have unwanted variables, if there is autocorrelation in the data and so on.

It is also one of the easier and more intuitive techniques to understand, and it provides a good basis for learning more advanced concepts and techniques. This post explains how to perform linear regression using the statsmodels Python package.

Note: There is also a Logit Regression which is similar to Sklearn’s Logistic Regression and works for classification problems.

OLS reflects the relationship between X and y variables following the simple formula:

Y = b1X +b0 #Simple Linear

𝑦 = b0 + b1X1 + b2X2…. + 𝜀 #Multi Linear

Where

· b0 — y — intercept

· b1,b2 — slope

· X, X1, X2 — predictor

· y — Target variable

OLS is an estimator in which the values of b1 and b0 (from the above equation) are chosen in such a way as to minimize the sum of the squares of the differences between the observed dependent variable and predicted dependent variable. That’s why it’s named ordinary least squares.

Also when the model is trying to reduce the error rate between predicted and actual, it means its trying to cut down on losses and predict better. You are trying to predict the impact of your predictors on the results.

Note: Ideally before computing the model building using OLS, the linear assumptions need to be met. The aim of this article is to interpret all the elements in an OLS model.

Lets understand this better looking at this example, I have taken a simple dataset — Advertising data:

Ordinary Least Squared (OLS) Regression (4)

In linear models, the coefficient of 1 variable is dependent on other independent variables. Hence if there is a reduction or addition in the data, it will affect the whole model. For example, suppose in the future, we also have another advertising medium say Social Media, we will have to re-fit and re-calculate the coefficients and the constants as they are dependent on dimensions of the dataset.

In case you want to check out the formula for multi linear regression:

Multi-linear RegressionMultiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses…medium.com

So practically, it’s not feasible to keep adding variables and checking their linear relationship. The idea is to pick the best of variables using the following 2 steps:

1. Domain Knowledge

2. Statistical tests — Not only the parametric and non-parametric tests but also check if there is multicollinearity between independent variables and correlation with target variables.

Ordinary Least Squared (OLS) Regression (5)

Here we quickly check the correlation for the data and its evident that Sales and TV advertising has a strong correlation. Say for example, an increase in Advertising leads to an increase in Sales, however if a medium like Newspaper has a low readership, it may also lead to a negative correlation.

Thus getting our variables right is the most important step for any model building. This will help reduce processing costs while building the right Machine Learning models.

Step 1: Import the libraries and add a constant element. We do so because we expect our dependent variables to take a non-zero value when all the other included regressors are set to zero. This value would show up as a constant i.e. 1. Later when we form the model, the coefficient of the constant value will be b0 in our multi linear formula.

Ordinary Least Squared (OLS) Regression (6)

Step 2: Fit the X and y variables and check the summary. Now let’s run and have a look at the results. We will interpret each and every section of this summary table.

Ordinary Least Squared (OLS) Regression (7)

For easy explanation purposes, we will divide the summary report into 4 sections.

SECTION 1:

Over all our model is performing well with 89% accuracy. Let’s quickly jump in and start with the top left section first:

Ordinary Least Squared (OLS) Regression (8)

This section gives us the basic details of the model which you can read and understand like which is our y variable, when the model was built and so on. Lets look at the elements highlighted in red Df Residuals and Df model number here:

Df Residuals: Before we understand this term lets understand what is Df and Residuals:

Df here is Degrees of Freedom (DF) which indicates the number of independent values that can vary in an analysis without breaking any constraints.

Residuals in regression is simply the error rate which is not explained by the model. It’s the distance between the data point and the regression line.

Residuals = (Observed value) — (Fitted/ Expected value)

The df (Residual) is the sample size minus the number of parameters being estimated, so it becomes df(Residual) = n — (k+1) or df(Residual) = n -k -1.

Hence the calculation in our case is:

200 (total records)-3(number of X variables) -1 (for Degree of Freedom)

Df Model: Its simple the number of X variables in the data barring the Constant variable which is 3.

SECTION 2:

Ordinary Least Squared (OLS) Regression (9)

R-squared: It’s the degree of the variation in the dependent variable y that is explained by the dependent variables in X. Like in our case we can say that with the given X variables and a multi linear model, 89.7% variance is explained by the model. In regression, it also means that our predicted values are 89.7% closer to the actual value i.e y. R2 and attain values between 0 to 1.

The drawback with an R2 score it that, more the number of variables in X, R2 has a tendency to be constant or increase even by a miniscule number. However, the new added variable may or may not be significant.

R2 = Variance Explained by the model / Total Variance

OLS Model: Overall model R2 is 89.7%

Adjusted R-squared: This resolves the drawback of R2 score and hence is known to be more reliable. Adj. R2 doesn’t consider the variables which are not significant for the model. In a single linear regression, the value of R2 and Adjusted R2 will be the same. If more number of insignificant variables are added to the model, the gap between R2 and Adjusted R2 will keep increasing.

Adjusted R Squared = 1 — [((1 — R2) * (n — 1)) / (n — k — 1)]

Where n — number of records and k is number of significant variables barring constant.

OLS Model: Adjusted R2 for the model is 89.6% which is 0.1% less than R2.

F-statistic and Prob(F-statistic): Here ANOVA is applied on the model with the following hypothesis:

H0: b1, b2, b3 (Regression coefficients) are 0 or model with no independent variables fits the data better.

H1: Atleast 1 of the coefficients (b1,b2,b3) is not equal to 0 or the current model with independent variable fits the data better than the intercept only model.

Now practically speaking, having all of the independent variables to have coefficients 0 is not likely and we end up Rejecting the null hypothesis. However, it’s possible that each variable isn’t predictive enough on its own to be statistically significant. In other words, your sample provides sufficient evidence to conclude that your model is significant, but not enough to conclude that any individual variable is significant.

F-statistic = Explained variance / unexplained variance

OLS Model: The F-stat probability is 1.58e-96 which is much lower than 0.05 which is or alpha value. It simply means that the probability of getting atleast 1 coefficient to be a nonzero value is 1.58e-96.

Log-Likelihood: Log Likelihood value is a measure of goodness of fit for any model or to derive the maximum likelihood estimator. Higher the value, better is the model. We should remember that Log Likelihood can lie between -Inf to +Inf. Hence, the absolute look at the value cannot give any indication. The estimator is obtained by solving that is, by finding the parameter that maximizes the log-likelihood of the observed sample .

AIC and BIC: Akaike Information Criterion(AIC) and Bayesian Information Criterion (BIC) are 2 methods of scoring and selecting model.

AIC = -2/N * LL + 2 * k/N

BIC = -2 * LL + log(N) * k

Where N is the number of examples in the training dataset, LL is the log-likelihood of the model on the training dataset, and k is the number of parameters in the model.

The score, as defined above, is minimized, e.g. the model with the lowest AIC and BIC is selected.

The quantity calculated is different from AIC, although can be shown to be proportional to the AIC. Unlike the AIC, the BIC penalizes the model more for its complexity, meaning that more complex models will have a worse (larger) score and will, in turn, be less likely to be selected.

SECTION 3:

Woof.. that was a lot of information. We have 2 more sections to go, Lets jump into the central part which is the main part of the summary:

Ordinary Least Squared (OLS) Regression (10)

Now we know, the column coef is the value of b0, b1, b2 and b3. So the equation of the line is:

y = 2.94 + 0.046 * (TV) + 0.188* (Radio) + (-0.001)*(Newspaper)

Std err is the standard error for each variable, it’s the distance that the variable is away from the regression line.

t and P>|t|: t is simply the t-stat value of each variable with the following hypothesis:

H0: Slope / Coefficient = 0

H1: Slope / Coefficient is not = 0

Basis this, it gives us the t stat values and the P>|t| gives us the p-value. With alpha at 5%, we measure if the variables are significant.

[0.025, 0.975] — At default 5% alpha or 95% Confidence interval, if the coef value lies in this region, we say that the coef value lies within the Acceptance region.

Looking at the p-values, we know we have to remove ‘Newspaper’ from our list and it’s not a significant variable. Before we come to that lets quickly interpret the last section of the model.

SECTION 4:

Ordinary Least Squared (OLS) Regression (11)

Omnibus: They test whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. Its a test of the skewness and kurtosis of the residual. We hope for the Omnibus score to be close to 0 and its probability close to 1 which means the residuals follow normalcy.

In our case Omnibus score is very high, way over 60 and its probability is 0. This means our residuals or error rate does not follow a normal distribution.

Skew — Its a measure of data symmetry. We want to see something close to zero, indicating the residual distribution is normal. Note that this value also drives the Omnibus.

We can see that our residuals are negatively skewed at -1.37.

Kurtosis — Its a measure of curvature of the data. Higher peaks lead to greater Kurtosis. Greater Kurtosis can be interpreted as a tighter clustering of residuals around zero, implying a better model with few outliers.

Looking at the results, our kurtosis is 6.33 which means our data doesn’t have outliers.

Durbin-Watson — The Durbin Watson (DW) statistic is a test for autocorrelation in the residuals from a statistical regression analysis. The Durbin-Watson statistic will always have a value between 0 and 4. A value of 2.0 means that there is no autocorrelation detected in the sample.

Durbin-Watson value is 2.084 which is very close to 2 and we conclude that the data doesn’t have autocorrelation.

Note: Autocorrelation, also known as serial correlation, it is the similarity between observations as a function of the time lag between them.

Jarque-Bera (JB)/Prob(JB) — JB score simply tests the normality of the residuals with the following hypothesis:

H0: Residuals follow a normal distribution

H1: Residuals don’t follow a normal distribution

Prob(JB) is very low, close to 0 and hence we reject the null hypothesis.

Cond. No. : The condition number is used to help diagnose collinearity. Collinearity is when one independent variable is close to being a linear combination of a set of other variables.

The condition number is 454 in our case, when we reduce our variables lets see how the score reduces.

Okay we are almost at the end of our article, we have already seen the interpretation of each and every element in this OLS model. Just 1 last section where we update our OLS model and compare the results:

If we look at our model, only Newspaper with p-value 0.86 is higher than 0.05. Hence we will rebuild a model after removing the Newspaper:

Ordinary Least Squared (OLS) Regression (12)

Please note as already mentioned, the coefficient values for each variable is dependent on the other. Hence we should always remove columns 1 by 1 so that we can gauge the difference.

When we remove the newspaper, our accuracy levels do not change however the coefficients have been updated. However the AIC, BIC scores and Cond. No. have reduced which proves we have improved the efficiency of the model.

Wow! we are finally at the end of this article. There are lots of elements in an OLS model which we have interpreted above. I have tried to simplify and throw light on each and every section of the OLS summary.

If you have any queries do ask in the comments also if you find something amiss do let me know.

Drop in a clap if you learnt something new in this article.

References:

1. https://realpython.com/linear-regression-in-python/

2.https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html

3. http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm

4. https://statisticsbyjim.com/regression/ols-linear-regression-assumptions/