01-24

deep feedforward networks
no feedback connections
final approximation f is function composition of each layer
neuron: perceptron
1. weights (includes bias)
2. activation function
  - assuming this is linear:
    - a = w’{T}z + b’
    - a = w’(w^{T}x + b) + b’
    - a = w’w^{T}x + w’b + b’
    - can’t learn non-linearly separable functions (e.g. XOR)
  - relu: rectified linear unit
    - g(z) = max(0, z)
    - 99% of hidden units
    - ensures mostly active so derivatives can pass through
    - pros:
      1. large and consistent gradients when active
      2. efficient to optimize
    - cons:
      1. “die” when inactive
    - generalized
      - leaky ReLU / PReLU
    - cons:
  - sigmoid
    - saturate across most of domain
    - prefer tanh in hidden
    - use mostly in output
  - softmax
  - tanh
    - s shape from -1 to 1
cost functions
- use maximum likelihood
- tightly coupled with activation function
activation functions