- 📘 Math
- 📘 Polynomial Regression
- 💻 Code
Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, that output variable (y) can be calculated from a linear combination of the input variables (x).
On the image above there is an example of dependency between input variable x and output variable y. The red line in the above graph is referred to as the best fit straight line. Based on the given data points (training examples), we try to plot a line that models the points the best. In the real world scenario we normally have more than one input variable.
We can measure the accuracy of our hypothesis function by using a cost function. This takes an average difference (actually a fancier version of an average) of all the results of the hypothesis with inputs from x's and the actual output y's.
Function that shows how accurate the predictions of the hypothesis are with current set of parameters.
x i - input (features) of i th training example
y i - output of i th training example
m - number of training examples
h𝜽(x i) - (y i) - difference between the predicted value and the actual value
This function is otherwise called the "Squared error function", or "Mean squared error". The mean is halved 1/2 as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 1/2 term.
Gradient descent is an iterative optimization algorithm for finding the minimum of a cost function described above. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.
"Batch": Each step of gradient descent uses all the traning examples.
It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.
We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum. The red arrows show the minimum points in the graph.
The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is called the learning rate.
For example, the distance between each 'star' in the graph above represents a step determined by our parameter α. A smaller α would result in a smaller step and a larger α results in a larger step. The direction in which the step is taken is determined by the partial derivative of J(𝜽0,𝜽1). Depending on where one starts on the graph, one could end up at different points. The image above shows us two different starting points that end up in two different places.
The gradient descent algorithm is:
repeat until convergence:
where j=0,1 represents the feature index number.
At each iteration j, one should simultaneously update the parameters θ. Updating a specific parameter prior to calculating another one on the j(th) iteration would yield to a wrong implementation.
- The gradient descent can converge to a local minimum, even with the learning rate α fixed.
- As we approach a local minimum, gradient descent will automatically take smaller steps. So no need to decrease α over time.
- If α is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.
- If α is too small, gradient descent can be slow.
When specifically applied to the case of linear regression, a new form of the gradient descent equation can be derived. We can substitute our actual cost function and our actual hypothesis function and modify the equation to:
where m is the size of the training set, θ0 a constant that will be changing simultaneously with θ1 and xi, yi are values of the given training set (data).
- Note that we have separated out the two cases for θj into separate equations for θ0 and θ1.
The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply these gradient descent equations, our hypothesis will become more and more accurate.
The gradient descent equation itself is generally the same form; we just have to repeat it for our 'n' features:
In other words:
Two techniques to help with this are feature scaling and mean normalization. Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula:
Where µi is the average of all the values for feature (i) and is the range of values (max - min), or si is the standard deviation.
Note that dividing by the range, or dividing by the standard deviation, give different results. The quizzes in this course use range - the programming exercises use standard deviation.
Make a plot with number of iterations on the x-axis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease α.
Automatic convergence test. Declare convergence if J(θ) decreases by less than ℇ in one iteration, where **ℇ** is some small value such as 10-3. However in practice it's difficult to choose this threshold value.
It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.
- If α is too small: slow convergence.
- If α is too large: may not decrease on every iteration and thus may not converge.
We can combine multiple features into one. For example, we can combine x1 and x2 into a new feature x3 by taking x1∙x2.
We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).
Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x.
Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the hypothesis function is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.
For example, if our hypothesis function is h𝜽(x) = 𝜽0 + 𝜽1x1 then we can create additional features based on x1, to get the quadratic function h𝜽(x) = 𝜽0 + 𝜽1x1 + 𝜽2x22 or the cubic function h𝜽(x) = 𝜽0 + 𝜽1x1 + 𝜽2x22 + 𝜽3x33.
In the cubic version, we have created new features x2 and x3, where x2 = x12 and x3 = x12.
For example, if the price of the apartment is in non-linear dependency of its size then you might add several new size-related features:
h𝜽(x) = 𝜽0 + 𝜽1x1 + 𝜽2x2 + 𝜽3x3 = 𝜽0 + 𝜽1(size) + 𝜽2(size)2 + 𝜽3(size)3.
- One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.
For example, if x1 has range 1-1'000 then range of x12 becomes 1-1'000'000 and that of x13 becomes 1-1'000'000'000.
In the "Normal Equation" method, we will minimize J by explicitly taking its derivatives with respect to the 𝜽j ’s, and setting them to zero. This allows us to find the optimum theta without iteration.
- The normal equation formula is given below:
There is no need to do feature scaling with the normal equation.
The following is a comparison of gradient descent and the normal equation:
Gradient Descent | Normal Equation |
---|---|
Need to choose alpha | No need to choose alpha |
Needs many iterations | No need to iterate |
𝛰(k n2) | 𝛰(n3), need to calculate inverse of 𝘟𝘛𝘟 |
Works well when n is large | Slow if n is very large |
With the normal equation, computing the inversion has complexity 𝛰(n3). So if we have a very large number of features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.
When implementing the normal equation in Octave use the pinv
function rather than inv
.
The pinv
function will give you a value of 𝜽 even if 𝘟𝘛𝘟 is not invertible.
If 𝘟𝘛𝘟 is noninvertible, the common causes might be having:
-
Redundant features, where two features are very closely related (i.e. they are linearly dependent).
-
Too many features (e.g. m ≤ n). In this case, delete some features or use "regularization" (to be explained in a later lesson).
-
Solutions to the above problems include deleting a feature that is linearly dependent with another or deleting one or more features when there are too many features.