01-24
- deep feedforward networks
- no feedback connections
- final approximation f is function composition of each layer
- neuron: perceptron
- weights (includes bias)
- activation function
- assuming this is linear:
- a = w’{T}z + b’
- a = w’(w^{T}x + b) + b’
- a = w’w^{T}x + w’b + b’
- can’t learn non-linearly separable functions (e.g. XOR)
- relu: rectified linear unit
- g(z) = max(0, z)
- 99% of hidden units
- ensures mostly active so derivatives can pass through
- pros:
- large and consistent gradients when active
- efficient to optimize
- cons:
- “die” when inactive
- generalized
- cons:
- sigmoid
- saturate across most of domain
- prefer tanh in hidden
- use mostly in output
- softmax
- tanh
- cost functions
- use maximum likelihood
- tightly coupled with activation function
- activation functions