Linear Regression — All about it
This all started in the 1800s with a guy named Francis Galton. Galton was studying the relationship between the heights of parents and their sons. he discovered that a man’s son tended to be roughly as tall as his father. However, his breakthrough was that the son’s height tended to be closer to the overall average height of all people. Let’s take basketball player Shaquille O’Neal as an example. he is really tall (7ft). If he has a son, chances are he’ll be pretty tall too. however, Shaq is such an anomaly that there is a very good chance that his son will not be as tall as Shaq. Shaq’s Son is pretty tall (6 ft 7 in), but not nearly as tall as his dad. Galton called this phenomenon regression, as in “A father’s son’s height tends to regress (or drift towards) the mean (average) height.”
What is Linear Regression?
linear regression is a linear approach or supervised machine learning model to modeling the relationship between a dependent variable or target variables and one or more feature variables i.e.independent variables i.e. explanatory variables. i.e. In simple linear regression, we establish a relationship between the target variable and input variables by fitting a line, known as the regression line.
Example Let’s start with an example — suppose we have a dataset with information about the area of a house (in square feet) and its price (in thousands of dollars) and our task is to build a machine learning model that can predict the price given the area. Here is what our dataset looks like
If we plot our data, we might get something similar to the following:
In general, a line can be represented by linear equation y = m * x + b. Where, y is the dependent variable, x is the independent variable, m is the slope, b is the intercept.
In machine learning, we rewrite our equation as y(x) = w0 + w1 * x, where w’s are the parameters of the model, x is the input, and y is the target variable. This is the standard notation in machine learning and makes it easier to add more dimensions. We can simply add variables w2, w3, … and x2, x3, … as we add more dimensions.
Different values of w0 and w1 will give us different lines, as shown below
Each of the values of the parameters determines what predictions the model will make. For example, let’s consider (w0, w1) = (0.0, 0.2), and the first data point, where x = 3456 and ytrue = 600. The prediction made by the model, y(x) = 0.0 + 0.2*3456 = 691.2. If instead the weights were (w0, w1) = (80.0, 0.15), then the prediction would be y(x) = 80.0 + 0.15*3456 = 598.4, which is much closer to the ytrue = 600.
Cost functions
Different values of the weights (w0, w1, w2,… wn) gives us different lines (or hyperplanes), and our task is to find weights for which we get best fit.
One question you may have is; how can we determine how well a particular line fits our data? Or, given two lines, how do we determine which one is better? For this, we introduce a cost function which measures, given a particular value for the w’s, how close the y’s are to corresponding true's. That is, how well do a particular set of weights predict the target value.
For linear regression, the most commonly used cost function is the mean squared error cost function. It is the average over the various data points (xi, yi) of the squared error between the predicted value y(x) and the target value ytrue.
Residuals
The cost function defines a cost based on the distance between true target and predicted target (shown in the graph as lines between sample points and the regression line), also known as the residual. The residuals are visualized below,
Why mean squared error?
One question you might have is; why do we not use the sum of the residuals as our error function? Why squared? Why mean?
1. Squaring makes the existence of any “large” residuals negatively impact the cost function more than if a linear weight (not squared) was used. The result is a regression with more uniform residuals and less drastic outliers.
2. Mean so that the result is independent of the number of data points used. A sum would be proportional to the number of data points, while a mean is not. It makes a comparison between data sets easier and the results more meaningful to when performing regressions in different problem spaces.
Optimization using Gradient Descent
Each value of the weight vector w gives us a corresponding cost J(w). We want to find the value of weights for which cost is minimum. We can visualize this as follows:
Note: Above we have used the word “global” because the shape of the cost function for linear regression is convex (i.e. like a bowl). It has a single minimum, and it smoothly increases in all directions around it.
Given the linear regression model and the cost function, we can use Gradient Descent (covered in the next article) to find a good set of values for the weight vector. The process of finding the best model out of the many possible models is called optimization.
Example:
consider the following example where we need to fit the following green line on the curve. how to determine its parameter?
we try to find the parameters of the above equation by minimizing the cost function.
above is the optimal fitting of the line on a curve, hence cost is zero.
Hence finally it forms the cost function curve as follows by using all possible values of weight (wo/theta0 and w1/theta1):
As the hypothesis generates the target value far away than actual value, the cost function also generates value far away then the minimum of the cost function.
Here the cost function plain we get for above: -
in 2-D Here the cost function plain we get:
We will use contour plots/figures to show the above figure in 2D: -
How to determine?
Solution: — Gradient descent
We use the help of derivatives to determine the direction in which we need to move to find the optimal solution.
The derivative
The derivative measures the steepness of the graph of a function at some particular point of the graph. Thus, the derivative is a slope. (That means that it is a ratio of change in the value of the function to change in the independent variable.)
We can find an average slope between two points.
But how do we find the slope at a point?
Therefore, with derivatives, we use a small difference then have it shrink towards zero
here follow these steps:
- Fill in this slope formula:
- Simplify it as best we can
- Then make Δx shrink towards zero.
by using derivative we can determine the direction in which we need to move to converge to minimum.
back to our Gradient descent algorithm:
Gradient descent for linear regression (Linear regression with one variable)
The cost function for linear regression would be bowl-shaped function / “convex function”: -
Problem 1.
till now in above we used only one input (explanatory) variable i.e. X hence it was not a big deal to visualize the input data (X).
Solution:
The best notation to represent above larger features and understand what’s going on with the data (even we fail to visualize that data) is the Linear Algebra.
Linear Algebra
Here X = features, Y = target variable.
Using Linear Algebra we can do a set of things with our data in a manageable way.
now we use the following notation:
we do similar changes for cost function and Gradient descent also.
Problem 2.
okay, now we are able to easily maintain and visualize our data in linear algebra form, but what if we have our features on vary different scale as following: -
consider the following example:
Here the features are not in the same scale x1 (in range of 1000s) and x2 (in range of 1–10s) then the contour would be plotted as follows (very-very skewed). Then the Gradient descendent method would take very much time to find the weights for which the cost function is minimum (Global Minimum).
Solution:
Scale the features, so the Gradient descent will run much faster. we have covered feature scaling in our following article.