One of the oldest and most basic forms of predictions, linear regressions are still widely used in many different fields to extrapolate and interpolate data.
In this article, I’ll explain the basics of how and when to use them, with the help of Python’s Scikit-learn.
Linear regression was firstly published by the french mathematician Legendre in 1805. With its first applications in astronomy.
This modelling technique tries to predict the value of a variable dependent on another variable.
For example, a crop in which the yield depends on the amount of rain or a product in which the sales depend on their price.
As a side note — maybe you were fooled by the joyful caricature of our friend here, but let me tell you, this is the guy who figured out how to hit stuff with cannonballs. He did so by using air resistance, the projectile’s initial velocity, and the angle of the cannon to calculate the trajectory of the shot.
For this example, I’ll be using Jupyter Notebooks to run my code, Pandas for data wrangling, Numpy for math operations, Matplotlib for the visualizations, and Scikit learn for statistics.
Let’s start with small steps and build a straightforward dataset to test our linear model.
Our data is symmetrical, x = y, so it should be easy for our regression to predict that the value of y for x(10) is 10.
It’s important to remember that the “prediction” is the application of our conclusion in the broader population, it’s an extrapolation of the results. The conclusion itself, or the product of regressions, is a formula describing the relationship between the variables.
X = coefficient * y
let’s try it!
First, we define our model, with linear_model.LinearRegression(), then we fit our model with .fit(x, y).
Now, let’s check our coefficient.
It’s 1. This means our formula is: x = 1 * y
Let’s predict the next values.
It works!
*Scikit requires X to be a matrix (2d array), this is because you can have more than one explanatory variable in your model.
Ok, besides having a perfectly linear relationship, our data starts at the same point, but what happens when it doesn’t?
That’s not because linear regressions can’t handle it, it’s because we’re missing one crucial part in our formula. The intercept!
The intercept is a constant value that adds to our formula regardless of the values of the variables. So let’s remove that “fit_intercept=False” and try again.
y = intercept + coefficient * x
y = (-10) + (1* x)
Awesome, we know how to apply a basic linear regression, and we know how to interpret its results.
But, what is the model doing?
So, first things first, the type of regression we’re using is OLS — Ordinary Least Squares.
Let’s see how Scikit describes this model.
LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation
Alright, pretty condensed statement over there. Let’s try to distil what it is trying to say.
*The way our model evaluates the results is often referred to as its metrics.
This model looks for the residual sum of squares. The residual is simply the difference between the predicted value by the model and the actual value.
Take a look at this example:
So the linear regression represented by the blue line indicates the prediction, but we can see that the data points are not exactly in the blue line. The difference between those points is the so-called residual.
Ok, so the residual sum of squares is the sum of the differences between our prediction and our actual data, it’s squared so we can avoid negative numbers in our sum — So an OLS linear regression tries to minimize those differences.
Metrics are very significant when we’re comparing different models.
If we’re looking at different models, trying to predict the same data we can say that the model with the lowest residual sum of squares is the most accurate.
It’s convenient to get those metrics with Scikit. We just need the metrics class.
The most common metrics for linear regressions are the R2 score and the mean squared error, but I highly suggest that you check the other available metrics in the documentation and think about the best metric for your purpose.
Maybe you don’t mind having a higher average residual, but you can’t afford to have extremely high errors, so a max_error(y_true, y_pred) could be more suitable for your needs.
We don’t have residuals in this regression, so it’s Mean Squared Error will be zero.
R2 is the coefficient of determination, where a perfect score would be 1, decreasing as the predicted points move farther from the actual values.
*Even though we’re talking about squares, this is a coefficient, so it can be negative.
Cool, we have seen how to build a simple OLS model, how to fit our data to it, extrapolate on our conclusions, and the metrics we can use to compare the models.
Not quite yet, Neo.
Now, let’s get to one of the most important and most forgotten parts of this methodology. The ASSUMPTIONS!
Assumptions are like pre-requisites; we have to check to know if this method will work in our data.
Breaking assumptions, or running your model regardless of them, may result in consequences. Some may have higher impacts than others in your prediction, and I suggest some further research about those assumptions if you’ll be heavily relying on this model, just as I would recommend for any statistical model.
For the sake of simplicity, let’s take a look at the most meaningful of them.
Linearity — This means the relationship between the two variables must be linear.
Independence — The observations should be independent of each other. The amount of daily rain doesn’t depend on how much it rained in the next or the previous day, but the amount of water in a reservatory is highly dependent on its prior values.
Homoscedasticity — This means our residuals should have a consistent variance. Time-series usually have problems with this. Initially, we have consistent predictions, but the error gets higher as we move further in time.
Normality — The errors should be normally distributed. Here, we’re mostly looking for significant deviations from a normal distribution, like a very right or left-skewed error distribution.
Sample Size — The sample should have at least 20 observations. We’ll break this assumption here for simplification.
That’s a lot, now let’s try an OLS model with multiple explanatory variables!
It’s important to remember that the explanatory variables should be independent of each other.
First, let’s create some dummy data for our milkshake store.
I’ve created three variables to help us predict our milkshake sales, temperature, the probability of raining on that day, and the daily closing value of Microsoft’s stock price.
Maybe, the last one is not the best option to help us predict milkshake sales.
There are many ways to check if a variable helps predict the outcome of another. Those methods are called Feature Selection.
One of the most used ways is checking the p-value.
The p-value tells us the probability of finding values as extreme as the ones observed in our tests— So, the coefficients are a measure of the relationship between two variables, and the p-value of the coefficients are a measure of the statistical significance of that relationship. Phew… sounds complicated, let’s see how to apply it.
We used Scikit’s class f_regression from feature_selection. This method returns the F values and the p-values, for now, we’ll just go over the p-values.
We’re looking for relationships with a statistical significance equal to or higher than 95%, in other words, a p-value lower than 5%.
As we can see from our results, Microsoft’s stock price is not a useful variable for predicting our milkshake sales.
Now, let’s finish this!
Cool!
Now we know that if the forecast is telling the temperature tomorrow will be 27 with only a 10% chance of rain, we expect to sell something around 171 milkshakes.
Sales = -33.91 + (-20.3 * 0.1) + (7.66 * 27)
Sales = -33.91 -2.03 +206.82
Sales = ~171
And that’s it! There are lots of other things surrounding this topic, different methods of feature selection, metrics, assumptions to help you check your data, cross-validation methods, and so much more. I hope you could get an understanding of the basics, and that you’ll be able to build more knowledge on that.
Thanks for taking the time to read my article!
Did you know that we have four publications and a YouTube channel? You can find all of this from our homepage at plainenglish.io — show some love by giving our publications a follow and subscribing to our YouTube channel!
FAQs
Does sklearn linear regression use OLS? ›
The LinearRegression method of sklearn uses the Ordinary Least Square method (pictured above). So, if X is the feature matrix (with, say, n columns representing n features based on which we will make the predictions ) and w is the weight vector (with n values), Xw will be the prediction of the LinearRegression model.
How do you solve OLS regression? ›- Step 1: For each (x,y) point calculate x2 and xy.
- Step 2: Sum all x, y, x2 and xy, which gives us Σx, Σy, Σx2 and Σxy (Σ means "sum up")
- Step 3: Calculate Slope m:
- m = N Σ(xy) − Σx Σy N Σ(x2) − (Σx)2
- Step 4: Calculate Intercept b:
- b = Σy − m Σx N.
- Step 5: Assemble the equation of a line.
Ordinary least squares (OLS) regression is a method that allows us to find a line that best describes the relationship between one or more predictor variables and a response variable. where: ŷ: The estimated response value. b0: The intercept of the regression line.
How to do linear regression in scikit learn? ›- Step 1: Importing All the Required Libraries. import numpy as np. ...
- Step 2: Reading the Dataset. ...
- Step 3: Exploring the Data Scatter. ...
- Step 4: Data Cleaning. ...
- Step 5: Training Our Model. ...
- Step 6: Exploring Our Results. ...
- Step 7: Working With a Smaller Dataset.
Yes, although 'linear regression' refers to any approach to model the relationship between one or more variables, OLS is the method used to find the simple linear regression of a set of data. Linear regression refers to any approach to model a LINEAR relationship between one or more variables.
When should we use OLS regression? ›The OLS method can be used to estimate the unknown parameters (m and b) by minimizing the sum of squared residuals. In other words, the OLS method finds the best-fit line for the data by minimizing the sum of squared errors or residuals between the actual and predicted values.
What is the formula for the OLS estimator? ›In all cases the formula for OLS estimator remains the same: ^β = (X′X)−1X′y, the only difference is in how we interpret this result. OLS estimation can be viewed as a projection onto the linear space spanned by the regressors.
What are the key conditions for the OLS regression? ›- The regression model has linearity in its error term and coefficients. ...
- The error term's population mean is zero. ...
- There are no correlations between the independent variables and the error term. ...
- Each observation of the error term is independent of others. ...
- The error term's variance is constant.
An example of multiple OLS regression
Ice cream consumption = 0.197 – 1.044 price + 0.033 income + 0.003 temperature. The parameter for α (0.197) indicates the predicted consumption when all explanatory variables are equal to zero.
...
Ordinary Least Squares Using Statsmodels.
How do you explain OLS regression? ›
Ordinary least squares (OLS) regression is a statistical method of analysis that estimates the relationship between one or more independent variables and a dependent variable; the method estimates the relationship by minimizing the sum of the squares in the difference between the observed and predicted values of the ...
What is OLS regression summary? ›OLS is a common technique used in analyzing linear regression. In brief, it compares the difference between individual points in your data set and the predicted best fit line to measure the amount of error produced.
How hard is linear regression? ›Regression analysis is not difficult. If you repeat it enough times you will believe it, and believing it will make it much less daunting. Right? Well, if that did not minimize your fear related to regression analyses, hopefully these quick and dirty pointers will help you out!
Which python library is best for linear regression? ›scikit-learn is one of the best Python libraries for statistical/machine learning and it is adapted for fitting and making predictions. It gives the user different options for numerical calculations and statistical modelling. Its most important sub-module for linear regression is LinearRegression.
Why should you use OLS regression? ›In data analysis, we use OLS for estimating the unknown parameters in a linear regression model. The goal is minimizing the differences between the collected observations in some arbitrary dataset and the responses predicted by the linear approximation of the data.
Why is OLS regression good? ›OLS is the most efficient linear regression estimator when the assumptions hold true. Another benefit of satisfying these assumptions is that as the sample size increases to infinity, the coefficient estimates converge on the actual population parameters.
What are the advantages of OLS? ›- Advantages: The statistical method reveals information about cost structures and distinguishes between different variables' roles in affecting output. ...
- Disadvantages: Large data set is necessary in order to obtain reliable results.
OLS Estimator is Efficient
An estimator that is unbiased and has the minimum variance is the best (efficient). The OLS estimator is the best (efficient) estimator because OLS estimators have the least variance among all linear and unbiased estimators.
Effect in ordinary least squares
The presence of omitted-variable bias violates this particular assumption. The violation causes the OLS estimator to be biased and inconsistent. The direction of the bias depends on the estimators as well as the covariance between the regressors and the omitted variables.
Key Takeaways. Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. Multiple regression is an extension of linear (OLS) regression that uses just one explanatory variable.
What is a real life example of regression? ›
For example, it can be used to predict the relationship between reckless driving and the total number of road accidents caused by a driver, or, to use a business example, the effect on sales and spending a certain amount of money on advertising. Regression is one of the most common models of machine learning.
How do you read OLS data? ›DLS data quality can be interpreted by analyzing the autocorrelation function (ACF) and its fit, plus characteristics of static light scattering. Here are some parameters that indicate aspects of data quality: ACF baseline – should reach 1.00. ACF sum of squares (SOS) – for a monomodal sample, this should be under 100.
Is OLS only for linear? ›In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the principle of least squares: minimizing the sum of the squares of the ...
How do you implement linear regression in Python from scratch? ›- def linear_regression(x, y): N = len(x) ...
- N = len(x) x_mean = x.mean() ...
- B1_num = ((x - x_mean) * (y - y_mean)).sum() B1_den = ((x - x_mean)**2).sum() ...
- B0 = y_mean - (B1 * x_mean) ...
- def corr_coef(x, y): ...
- def predict(B0, B1, new_x):
Ordinary Least Squares or OLS is one of the simplest (if you can call it so) methods of linear regression. The goal of OLS is to closely "fit" a function with the data. It does so by minimizing the sum of squared errors from the data.
What kind of math is linear regression? ›Linear regression is a statistical practice of calculating a straight line that specifies a mathematical relationship between two variables. Linear regression is defined as an algorithm that provides a linear relationship between an independent variable and a dependent variable to predict the outcome of future events.
What is the main problem with linear regression? ›Since linear regression assumes a linear relationship between the input and output varaibles, it fails to fit complex datasets properly. In most real life scenarios the relationship between the variables of the dataset isn't linear and hence a straight line doesn't fit the data properly.
What are the weaknesses of linear regression? ›Weaknesses: Linear regression performs poorly when there are non-linear relationships. They are not naturally flexible enough to capture more complex patterns, and adding the right interaction terms or polynomials can be tricky and time-consuming.
How do you answer linear regression? ›The formula for simple linear regression is Y = mX + b, where Y is the response (dependent) variable, X is the predictor (independent) variable, m is the estimated slope, and b is the estimated intercept.
How do you solve a linear equation in two variables in Python? ›To solve the two equations for the two variables x and y , we'll use SymPy's solve() function. The solve() function takes two arguments, a tuple of the equations (eq1, eq2) and a tuple of the variables to solve for (x, y) . The SymPy solution object is a Python dictionary.
How to find slope and intercept in linear regression Python? ›
Find the Slope and Intercept Using Python
The np. polyfit() function returns the slope and intercept.
Ordinary Least Squares regression (OLS) is a common technique for estimating coefficients of linear regression equations which describe the relationship between one or more independent quantitative variables and a dependent variable (simple or multiple linear regression).
What algorithm does sklearn linear regression use? ›The one that sklearn uses is ORDINARY LEAST SQUARES, derived from Gauss-Markov theorem.
What is the difference between SM OLS and sklearn linear regression? ›The main difference between these two is how they handle constants. Scikit-learn allows the user to or not to add a constant through a parameter, while Statsmodels' OLS class has a function that adds a constant to a given array.
What is the difference between Statsmodel OLS and Scikit linear regression? ›A key difference between the two libraries is how they handle constants. Scikit-learn allows the user to specify whether or not to add a constant through a parameter, while statsmodels' OLS class has a function that adds a constant to a given array.