# Personal/Andrew Ng Notes/VII Regularization

From Deep_Learning_Machine_Learning_and_Artificial_Intelligence

## Contents |

## The problem of overfitting

https://class.coursera.org/ml-003/lecture/39

Problem of underfitting and overfitting using the example of house price. Let's define hypothesis with the size of the house: x.

- w0 + w1x: underfitting: high bias
- w0 + w1x + w2x^2:
- w0 + w1x + w2x^2 + w3x^3 + w4x^4 + ...: overfitting: high variance
- overfitting: If we have too many features, the learned hypothesis may fit the training set very well, but fail to generalize to new examples - predicting the price for unknown example.

How to address overfitting?

- Try to reduce number of features; manually select important features to keep. But throwing away a features may throw away some of information away.
- Regularization: keep all features, but reduce magnitude and values of parameters: theta(j). This works well when we have a lot of features, each of which contribute a little bit.

## Cost function

https://class.coursera.org/ml-003/lecture/40

Penalize the wegith (theta) for the high order (x^3, x^4) of features. Let's say

- When defining cost function, add 1000 * theta_3 ^ 2 + 1000 * theta_4 ^ 2.
- Then we get tiny contribution from high order terms.

Regularization

- Small values for parameters
- Simpler hypothesis - quadratic function
- Less prone to overfitting.

- For housing problem, let's say we have 100 features and 100 parameters
- Pick less likely relevant features?
- To cost function, add: lambda * sum(theta .^ 2)
- Higher value of parameters get penalized.
- We get small value of parameters and each contibutes to the hypothesis.
- Don't penalize the theta_0.
- lambda: regularization parameter. This trades off between "fitting well" and "keeping the parameters small".
- if lambda is too big, underfit: most parameters get close to 0. hypothesis will be close to theta_0

## Regularized linear regression

https://class.coursera.org/ml-003/lecture/41

Updating theta (weight):

- Actually effect of introduction lambda is, at every iteration, we reduce the the theta value by certain ratio.
- theta = theta * (1 - alpha * labmda / m))
- (1 - alpha * labmda / m)) < 1.
- alpha * labmda / m should be small number.
- if alpha * labmda / m is 0.01, that means, it reduces the theta by 1% at every iterations.
- the other part of theta update is same as gradient descent.

What about updating theta with normal equation?

- with normal equation. theta = (X' X)-1 X' y
- with regularized linear equation, theta = (X' X + lambda * special matrix)-1 X' y
- here, the special matrix is similar to identity matrix except [1,1] is 0.

For non-invertible

- With regularized (lambda things), it can turn singular (X' X) matrix invertible.

## Regularized logistic regression

https://class.coursera.org/ml-003/lecture/42

With logistic regression

- cost function adds lambda / 2m * sum(theta^2)

With advanced optimization

- we define cost function that returns cost, and gradient
- cost function changes to include lambda
- gradient(1) does not change
- gradient(2) : + labmda/m * theta_1
- gradient(3) : + lambda/m * theta_2