Personal/Andrew Ng Notes/VII Regularization

From Deep_Learning_Machine_Learning_and_Artificial_Intelligence
Jump to: navigation, search


The problem of overfitting

Problem of underfitting and overfitting using the example of house price. Let's define hypothesis with the size of the house: x.

  • w0 + w1x: underfitting: high bias
  • w0 + w1x + w2x^2:
  • w0 + w1x + w2x^2 + w3x^3 + w4x^4 + ...: overfitting: high variance
  • overfitting: If we have too many features, the learned hypothesis may fit the training set very well, but fail to generalize to new examples - predicting the price for unknown example.

How to address overfitting?

  • Try to reduce number of features; manually select important features to keep. But throwing away a features may throw away some of information away.
  • Regularization: keep all features, but reduce magnitude and values of parameters: theta(j). This works well when we have a lot of features, each of which contribute a little bit.

Cost function

Penalize the wegith (theta) for the high order (x^3, x^4) of features. Let's say

  • When defining cost function, add 1000 * theta_3 ^ 2 + 1000 * theta_4 ^ 2.
  • Then we get tiny contribution from high order terms.


  • Small values for parameters
    • Simpler hypothesis - quadratic function
    • Less prone to overfitting.
  • For housing problem, let's say we have 100 features and 100 parameters
    • Pick less likely relevant features?
    • To cost function, add: lambda * sum(theta .^ 2)
    • Higher value of parameters get penalized.
    • We get small value of parameters and each contibutes to the hypothesis.
    • Don't penalize the theta_0.
    • lambda: regularization parameter. This trades off between "fitting well" and "keeping the parameters small".
    • if lambda is too big, underfit: most parameters get close to 0. hypothesis will be close to theta_0

Regularized linear regression

Updating theta (weight):

  • Actually effect of introduction lambda is, at every iteration, we reduce the the theta value by certain ratio.
  • theta = theta * (1 - alpha * labmda / m))
  • (1 - alpha * labmda / m)) < 1.
  • alpha * labmda / m should be small number.
  • if alpha * labmda / m is 0.01, that means, it reduces the theta by 1% at every iterations.
  • the other part of theta update is same as gradient descent.

What about updating theta with normal equation?

  • with normal equation. theta = (X' X)-1 X' y
  • with regularized linear equation, theta = (X' X + lambda * special matrix)-1 X' y
  • here, the special matrix is similar to identity matrix except [1,1] is 0.

For non-invertible

  • With regularized (lambda things), it can turn singular (X' X) matrix invertible.

Regularized logistic regression

With logistic regression

  • cost function adds lambda / 2m * sum(theta^2)

With advanced optimization

  • we define cost function that returns cost, and gradient
  • cost function changes to include lambda
  • gradient(1) does not change
  • gradient(2) : + labmda/m * theta_1
  • gradient(3) : + lambda/m * theta_2
Personal tools