Basic layout





We need to specify the initial activity state of all the hidden and output units

  1. We could just fix these initial states to have some default value like 0.5.
  2. But it is better to treat the initial states as learned parameters.
  3. We learn them in the same way as we learn the weights.
  4. Start off with an initial random guess for the initial states.
    1. At the end of each training sequence, backpropagate through time all the way to the initial states to get the gradient of the error function with respect to each initial state.
    2. Adjust the initial states by following the negative gradient.

We can specify inputs in several ways

  1. Specify the initial states of all the units.
  2. Specify the initial states of a subset of the units.
  3. Specify the states of the same subset of the units at every time step.

This is the natural way to model most sequential data.

We can specify targets in several ways

  1. Specify desired final activities of all the units
  2. Specify desired activities of all units for the last few steps
  3. Good for learning attractors
  4. It is easy to add in extra error derivatives as we backpropagate.
  5. The other units are input or hidden units.
















Backpropagation through time

We can think of the recurrent net as a layered, feed-forward net with shared weights and then train the feed-forward net with weight constraints.

We can also think of this training algorithm in the time domain:

  1. The forward pass builds up a stack of the activities of all the units at each time step.
  2. The backward pass peels activities off the stack to compute the error derivatives at each time step.
  3. After the backward pass we add together the derivatives at all the different times for each weight.

The backward pass is linear

  1. There is a big difference between the forward and backward passes.
  2. In the forward pass we use squashing functions (like the logistic) to prevent the activity vectors from exploding.
  3. The backward pass, is completely linear. If you double the error derivatives at the final layer, all the error derivatives will double.

The forward pass determines the slope of the linear function used for backpropagating through each neuron