Define model and architecture

Our simple feed-forward neural network will consist of an input layer, a single hidden layer and an output layer. The activation \( y \) of each neuron is a weighted sum of inputs, passed through an activation function. In case of the simple perceptron model we have

$$ z = \sum_{i=1}^n w_i a_i ,$$

$$ y = f(z) ,$$

where \( f \) is the activation function, \( a_i \) represents input from neuron \( i \) in the preceding layer and \( w_i \) is the weight to input \( i \). The activation of the neurons in the input layer is just the features (e.g. a pixel value).

The simplest activation function for a neuron is the Heaviside function:

$$ f(z) = \begin{cases} 1, & z > 0\\ 0, & \text{otherwise} \end{cases} $$

A feed-forward neural network with this activation is known as a perceptron. For a binary classifier (i.e. two classes, 0 or 1, dog or not-dog) we can also use this in our output layer. This activation can be generalized to \( k \) classes (using e.g. the one-against-all strategy), and we call these architectures multiclass perceptrons.

However, it is now common to use the terms Single Layer Perceptron (SLP) (1 hidden layer) and Multilayer Perceptron (MLP) (2 or more hidden layers) to refer to feed-forward neural networks with any activation function.

Typical choices for activation functions include the sigmoid function, hyperbolic tangent, and Rectified Linear Unit (ReLU). We will be using the sigmoid function \( \sigma(x) \):

$$ f(x) = \sigma(x) = \frac{1}{1 + e^{-x}} ,$$

which is inspired by probability theory (see logistic regression) and was most commonly used until about 2011. See the discussion below concerning other activation functions.