# Personal/Studying Notes

From Deep_Learning_Machine_Learning_and_Artificial_Intelligence

## 5/23

XOR with neural network

- python : http://www.bogotobogo.com/python/python_Neural_Networks_Backpropagation_for_XOR_using_one_hidden_layer.php
- https://aimatters.wordpress.com/2016/01/11/solving-xor-with-a-neural-network-in-python/
- neuralpy: http://pythonhosted.org/neuralpy/gettingstarted.html

## 5/22

http://sebastianruder.com/optimizing-gradient-descent/

Gradient descent variations

- batch : computes weights for entire training data set. slow, and doesn't work if dataset do not fit in memory. can't work with sample on-the-fly
- scochastic (SGD): computes weights for each training example. flucturate. Need to shuffle samples for each iteration. can be used to learn online. complicates convergence. it may overshoot.
- mini-batch: computes weights for min-batch. Need to shuffle samples too. mini batch size 50 or so. used to neural networks.

Optimization algorithms

- momentum: when it goes to same direction, give momentum (accelerate). When it reverses, slow down.
- Nesterov accelerated gradient
- Adagrad: well-suited with sparse data.
- Adadelta: extension of Adagrad
- RMSprop:

## 5/22

Google TPU