Personal/Studying Notes

From Deep_Learning_Machine_Learning_and_Artificial_Intelligence
Jump to: navigation, search

5/23

XOR with neural network


5/22

http://sebastianruder.com/optimizing-gradient-descent/

Gradient descent variations

  • batch : computes weights for entire training data set. slow, and doesn't work if dataset do not fit in memory. can't work with sample on-the-fly
  • scochastic (SGD): computes weights for each training example. flucturate. Need to shuffle samples for each iteration. can be used to learn online. complicates convergence. it may overshoot.
  • mini-batch: computes weights for min-batch. Need to shuffle samples too. mini batch size 50 or so. used to neural networks.

Optimization algorithms

  • momentum: when it goes to same direction, give momentum (accelerate). When it reverses, slow down.
  • Nesterov accelerated gradient
  • Adagrad: well-suited with sparse data.
  • Adadelta: extension of Adagrad
  • RMSprop:

5/22

Google TPU

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox