Backpropagation through time

We can think of the recurrent net as a layered, feed-forward net with shared weights and then train the feed-forward net with weight constraints.

We can also think of this training algorithm in the time domain:

  1. The forward pass builds up a stack of the activities of all the units at each time step.
  2. The backward pass peels activities off the stack to compute the error derivatives at each time step.
  3. After the backward pass we add together the derivatives at all the different times for each weight.