Personal/Andrew Ng Notes/VII Regularization
The problem of overfitting
Problem of underfitting and overfitting using the example of house price. Let's define hypothesis with the size of the house: x.
- w0 + w1x: underfitting: high bias
- w0 + w1x + w2x^2:
- w0 + w1x + w2x^2 + w3x^3 + w4x^4 + ...: overfitting: high variance
- overfitting: If we have too many features, the learned hypothesis may fit the training set very well, but fail to generalize to new examples - predicting the price for unknown example.
How to address overfitting?
- Try to reduce number of features; manually select important features to keep. But throwing away a features may throw away some of information away.
- Regularization: keep all features, but reduce magnitude and values of parameters: theta(j). This works well when we have a lot of features, each of which contribute a little bit.
Penalize the wegith (theta) for the high order (x^3, x^4) of features. Let's say
- When defining cost function, add 1000 * theta_3 ^ 2 + 1000 * theta_4 ^ 2.
- Then we get tiny contribution from high order terms.
- Small values for parameters
- Simpler hypothesis - quadratic function
- Less prone to overfitting.
- For housing problem, let's say we have 100 features and 100 parameters
- Pick less likely relevant features?
- To cost function, add: lambda * sum(theta .^ 2)
- Higher value of parameters get penalized.
- We get small value of parameters and each contibutes to the hypothesis.
- Don't penalize the theta_0.
- lambda: regularization parameter. This trades off between "fitting well" and "keeping the parameters small".
- if lambda is too big, underfit: most parameters get close to 0. hypothesis will be close to theta_0
Regularized linear regression
Updating theta (weight):
- Actually effect of introduction lambda is, at every iteration, we reduce the the theta value by certain ratio.
- theta = theta * (1 - alpha * labmda / m))
- (1 - alpha * labmda / m)) < 1.
- alpha * labmda / m should be small number.
- if alpha * labmda / m is 0.01, that means, it reduces the theta by 1% at every iterations.
- the other part of theta update is same as gradient descent.
What about updating theta with normal equation?
- with normal equation. theta = (X' X)-1 X' y
- with regularized linear equation, theta = (X' X + lambda * special matrix)-1 X' y
- here, the special matrix is similar to identity matrix except [1,1] is 0.
- With regularized (lambda things), it can turn singular (X' X) matrix invertible.
Regularized logistic regression
With logistic regression
- cost function adds lambda / 2m * sum(theta^2)
With advanced optimization
- we define cost function that returns cost, and gradient
- cost function changes to include lambda
- gradient(1) does not change
- gradient(2) : + labmda/m * theta_1
- gradient(3) : + lambda/m * theta_2