Linear Regression

Definition

Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, that output variable (y) can be calculated from a linear combination of the input variables (x).

On the image above there is an example of dependency between input variable x and output variable y. The red line in the above graph is referred to as the best fit straight line. Based on the given data points (training examples), we try to plot a line that models the points the best. In the real world scenario we normally have more than one input variable.

Cost Function

We can measure the accuracy of our hypothesis function by using a cost function. This takes an average difference (actually a fancier version of an average) of all the results of the hypothesis with inputs from x's and the actual output y's.

Function that shows how accurate the predictions of the hypothesis are with current set of parameters.

xⁱ - input (features) of i^th training example

yⁱ - output of i^th training example

m - number of training examples

h_𝜽(x_i) - (y_i) - difference between the predicted value and the actual value

This function is otherwise called the "Squared error function", or "Mean squared error". The mean is halved 1/2 as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 1/2 term.

"Batch" Gradient Descent

Gradient descent is an iterative optimization algorithm for finding the minimum of a cost function described above. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.

"Batch": Each step of gradient descent uses all the traning examples.

It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.

We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum. The red arrows show the minimum points in the graph.

The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is called the learning rate.

For example, the distance between each 'star' in the graph above represents a step determined by our parameter α. A smaller α would result in a smaller step and a larger α results in a larger step. The direction in which the step is taken is determined by the partial derivative of J(𝜽₀,𝜽₁). Depending on where one starts on the graph, one could end up at different points. The image above shows us two different starting points that end up in two different places.

The gradient descent algorithm is:

repeat until convergence:

where j=0,1 represents the feature index number.

At each iteration j, one should simultaneously update the parameters θ. Updating a specific parameter prior to calculating another one on the j^(th) iteration would yield to a wrong implementation.

The gradient descent can converge to a local minimum, even with the learning rate α fixed.
As we approach a local minimum, gradient descent will automatically take smaller steps. So no need to decrease α over time.

If α is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.
If α is too small, gradient descent can be slow.

Gradient Descent For Linear Regression

When specifically applied to the case of linear regression, a new form of the gradient descent equation can be derived. We can substitute our actual cost function and our actual hypothesis function and modify the equation to:

where m is the size of the training set, θ₀ a constant that will be changing simultaneously with θ₁ and x_i, y_i are values of the given training set (data).

Note that we have separated out the two cases for θ_j into separate equations for θ₀ and θ₁.

The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply these gradient descent equations, our hypothesis will become more and more accurate.

Gradient Descent For Multiple Variables

The gradient descent equation itself is generally the same form; we just have to repeat it for our 'n' features:

In other words:

Gradient Descent - Feature Scaling

Two techniques to help with this are feature scaling and mean normalization. Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula:

Where µ_i is the average of all the values for feature (i) and is the range of values (max - min), or s_i is the standard deviation.

Note that dividing by the range, or dividing by the standard deviation, give different results. The quizzes in this course use range - the programming exercises use standard deviation.

Debugging Gradient Descent

Make a plot with number of iterations on the x-axis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease α.

Automatic convergence test. Declare convergence if J(θ) decreases by less than ℇ in one iteration, where **ℇ** is some small value such as 10^-3. However in practice it's difficult to choose this threshold value.

It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.

If α is too small: slow convergence.
If α is too large: may not decrease on every iteration and thus may not converge.

Polynomial Regression

We can combine multiple features into one. For example, we can combine x₁ and x₂ into a new feature x₃ by taking x₁∙x₂.

We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).

Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an n^th degree polynomial in x.

Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the hypothesis function is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

For example, if our hypothesis function is h_𝜽(x) = 𝜽₀ + 𝜽₁x₁ then we can create additional features based on x₁, to get the quadratic function h_𝜽(x) = 𝜽₀ + 𝜽₁x₁ + 𝜽₂x₂² or the cubic function h_𝜽(x) = 𝜽₀ + 𝜽₁x₁ + 𝜽₂x₂² + 𝜽₃x₃³.

In the cubic version, we have created new features x₂ and x₃, where x₂ = x₁² and x₃ = x₁².

For example, if the price of the apartment is in non-linear dependency of its size then you might add several new size-related features:

h_𝜽(x) = 𝜽₀ + 𝜽₁x₁ + 𝜽₂x₂ + 𝜽₃x₃ = 𝜽₀ + 𝜽₁(size) + 𝜽₂(size)² + 𝜽₃(size)³.

One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.

For example, if x₁ has range 1-1'000 then range of x₁² becomes 1-1'000'000 and that of x₁³ becomes 1-1'000'000'000.

Normal Equation

In the "Normal Equation" method, we will minimize J by explicitly taking its derivatives with respect to the 𝜽_j ’s, and setting them to zero. This allows us to find the optimum theta without iteration.

The normal equation formula is given below:

There is no need to do feature scaling with the normal equation.

The following is a comparison of gradient descent and the normal equation:

Gradient Descent	Normal Equation
Need to choose alpha	No need to choose alpha
Needs many iterations	No need to iterate
𝛰(k n²)	𝛰(n³), need to calculate inverse of 𝘟^𝘛𝘟
Works well when n is large	Slow if n is very large

With the normal equation, computing the inversion has complexity 𝛰(n³). So if we have a very large number of features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.

Normal Equation Noninvertibility

When implementing the normal equation in Octave use the pinv function rather than inv.

The pinv function will give you a value of 𝜽 even if 𝘟^𝘛𝘟 is not invertible.

If 𝘟^𝘛𝘟 is noninvertible, the common causes might be having:

Redundant features, where two features are very closely related (i.e. they are linearly dependent).
Too many features (e.g. m ≤ n). In this case, delete some features or use "regularization" (to be explained in a later lesson).
Solutions to the above problems include deleting a feature that is linearly dependent with another or deleting one or more features when there are too many features.

Name		Name	Last commit message	Last commit date
parent directory ..
Week 2		Week 2
images		images
README.md		README.md
ex1.pdf		ex1.pdf
warmUpExercise.m		warmUpExercise.m

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linear Regression

Linear Regression

README.md

Linear Regression

Definition

Cost Function

"Batch" Gradient Descent

Gradient Descent For Linear Regression

Gradient Descent For Multiple Variables

Gradient Descent - Feature Scaling

Debugging Gradient Descent

Polynomial Regression

Normal Equation

Normal Equation Noninvertibility

📚 References

Files

Linear Regression

Directory actions

More options

Directory actions

More options

Latest commit

History

Linear Regression

Folders and files

parent directory

README.md

Linear Regression

Definition

Cost Function

"Batch" Gradient Descent

Gradient Descent For Linear Regression

Gradient Descent For Multiple Variables

Gradient Descent - Feature Scaling

Debugging Gradient Descent

Polynomial Regression

Normal Equation

Normal Equation Noninvertibility

📚 References