More on activation functions, output layers
In most cases you can use the ReLU activation function in the hidden
layers (or one of its variants).
It is a bit faster to compute than other activation functions, and the
gradient descent optimization does in general not get stuck.
For the output layer:
- For classification the softmax activation function is generally a good choice for classification tasks (when the classes are mutually exclusive).
- For regression tasks, you can simply use no activation function at all.