Understanding Linear Regression

Saarthak Sangamnerkar
Sep 4, 2018
6 min read

Updated: Jan 2, 2019

In this article, we will explore Linear Regression in depth and see how to implement it in Sci-kit Learn. At the end, I will post some resources from where you can also familiarize yourself with the coding of Linear Regression from scratch.

Let's talk about the one of the most popular Supervised Learning algorithms, Linear Regression.

In the blog post where I discussed about some fundamentals of machine learning, we saw that there are two categories in which Supervised Learning algorithms are divided - Regression and Classification. While the former predicts continuous value outputs, the latter predicts discrete outputs. For eg. predicting the score of a student is a regression problem while judging whether a tumor is malignant or benign falls under the latter category.

Regression

Let us quickly revise what Regression means. Regression is a method of modelling a target value based on independent predictors. This method is used for predicting and finding out cause and effect relationship between variables.

Linear Regression

The term "linearity" in algebra refers to a linear relationship between two or more variables. If we draw this relationship in a two dimensional space (between two variables, in this case), we get a straight line.

Simple linear regression is a type of regression where the number of independent variables is one and there is a linear relationship between the independent(x) and dependent(y) variable.

The red line in the above graph is referred to as the best fit straight line. Based on the given data points, we try to plot a line that models the points the best. The line can be modeled based on the linear equation shown below.

β0 : is the y-intercept when x = 0.

β1: Is the slope of our line.

To find the best model parameters: 1. Define a cost function, or loss function, that measures how inaccurate our model’s prediction is. 2. Find the parameter that minimizes loss, i.e. make our model as accurate as possible.

The β1 is called a scale factor or coefficient and β0 is called bias coefficient. The bias coefficient gives an extra degree of freedom to this model. This equation is similar to the line equation y=mx+b with m=β1(Slope) and b=β0(Intercept). So in this Simple Linear Regression model we want to draw a line between X and Y which estimates the relationship between X and Y.

But how do we find these coefficients? That’s the learning procedure. We can find these using different approaches. One is called Ordinary Least Square Method and other one is called Gradient Descent Approach.

Ordinary Least Square Method

To understand the Ordinary Least Square method, let us take the same scatter plot we saw above. Notice the red line in the graph. The red line is the final goal. We need to find line or henceforth, to be referred to as a model with least error. Any good model will always have least error. We have to find this line by reducing the error. We need to hypothesize a line which will be closer to all the points.

The error of each point is the distance between line and that point as shown in the figure above. And total error of this model is the sum of all errors of each point. i.e.

We are squaring each distance because some points will be above the line and some points will be below the line. We can minimize the error in the model by minimizing D. Minimizing D, we get

In these equations ¯x¯ is the mean value of input variable X and ¯y¯ is the mean value of output variable Y.

This model is known as Ordinary Least Square method.

Cost Function

The cost function is used to compute the best possible values for coefficient β0 and β1 which would provide the closest fit line for the data points.

So while the cost function may seem a little complex, it isn't rocket science by any means. But if you notice it closely, the first term (β1xᵢ - β0) is nothing but model prediction ŷ. The cost function tells us to find the difference between each real data point (yᵢ) and model's prediction (ŷ) and square it in order to have only positive values and penalizing larger differences. Finally add them up and take average. Dividing by 2*n ensures that the cost function doesn't depend on the number of elements in the training set. This allows a better comparison across models.

It is rather very easy to derive the beta parameters for a 2-dimensional model. However, as the dimensions increase, it becomes very difficult for computing the beta parameters for each variable. As I have mentioned in earlier blog article as well, real world can have million dimensions, so this method will be rendered infeasible. A method which we have discussed in brief in the first blog post about Machine Learning will come in handy now.

Gradient Descent

Imagine you’re standing somewhere on a mountain (point A). You want to get as low as possible as fast as possible, so you decide to take the following steps:

- You check your current altitude, your altitude a step north, a step south, a step east, a step west. Using this, you figure out which direction you should step to reduce your altitude as much as possible in this step. — Repeat until stepping any direction will cause you to go up again (point B).

This is Gradient Descent

Gradient Descent aims to find the minimum point of our predictor's cost function by iteratively computing improved approximation. This is achieved through iteratively, moving in which direction the ground is sloping down most steeply, until stepping any direction will cause you to go up again.

Assume you have a U-shaped pit. Your goal is to go to the bottom of the pit. You can only take discrete steps. If you take smaller steps, you will reach the bottom eventually but it will take a long time. Whereas, if you will take longer steps, you will reach quickly but there's a chance that you may overshoot the bottom. In the gradient descent algorithm, the number of steps you take is the learning rate. This decides on how fast the algorithm converges to the minima.

The function is ( β0 , β1 ) = z. To begin gradient descent, we start by making a guess of the parameters β0 and β1 that minimize the function.

Next, we take the partial derivative of the loss function with respect to each beta parameters: [dz/dβ0, dz/dβ1]. The partial derivative indicates how much total loss increased or decreased if you increase β0 or β1 by a very small amount. If the partial derivative of dz/dβ1 is a negative number, then increasing β1 is good as it will reduce our total loss. If it is a positive number, you want to decrease β1. If it is zero, we don’t change β1, as it means we have reached optimum.

We do this until we reach the bottom, i.e, the algorithm converges and the loss has been minimized.

Overfitting

Overfitting a model will result in the model predicting perfect results with the training data, but once real data is provided it will generate inaccurate results compared to what the actual value should be. Overfitting a model won’t generalize to data that it has not seen before which will produce an inaccurate prediction. Overfitting occurs when a model over-learns from the training data to the point where it starts picking up idiosyncrasies that aren’t representative of patterns in the real world. This becomes especially problematic as you make your model increasingly complex. Underfitting is related to the issue where your model is not complex enough to capture the underlying trends in the data.

Bias-Variance Tradeoff

Bias: is the amount of error introduced by approximating real-world phenomena with a simplified model. Variance: is how much your model’s test error changes based on variation in the training data. It reflects the model’s sensitivity to the idiosyncrasies of the data set it was trained on. As a model increases in complexity and it becomes flexible, its bias decreases (it does a good job of explaining the training data), but variance increases (it doesn’t generalise as well).

In order to have a good model, you need one with low bias and low variance

How to tackle overfitting?

User regularization: Add in a penalty in the loss function for building a model that assigns too much explanatory power to any one feature or allows to many features to be taken into account

The first piece of the sum above is our normal cost function. The second piece is a regularization term that adds a penalty for large beta coefficients that give too much explanatory power to any specific feature. The lambda coefficient of the regularization term in the cost function is a hyperparameter: a general setting of your model that can be increased or decreased (i.e. tuned) in order to improve performance. A higher lambda value harshly penalizes large beta coefficients that could lead to potential overfitting. To decide on the best value of lambda (λ), you’d use a method known as cross-validation which involves holding our a portion of the training data during training, then seeing how well your model explains the held-out portion.