The training

The training of the parameters is done through various gradient descent approximations with

$$ w_{i}\leftarrow w_{i}- \eta \delta_i a_{i-1}, $$

and

$$ b_i \leftarrow b_i-\eta \delta_i, $$

with \( \eta \) is the learning rate.

One iteration consists of one feed forward step and one back-propagation step. Each back-propagation step does one update of the parameters \( \boldsymbol{\Theta} \).

For the first hidden layer \( a_{i-1}=a_0=x \) for this simple model.