Backpropagation Through Time (BPTT) and Gradients

  1. Training an RNN involves computing gradients through time by unfolding the network: treat the unrolled RNN as a very deep feedforward net.
  2. We compute the loss \( L = \frac{1}{T}\sum_{t=1}^T \ell(y_t,\hat y_t) \) and backpropagate from \( t=T \) down to \( t=1. \)
  3. The computational graphs in the figures below shows how each hidden state depends on inputs and parameters across time .
  4. BPTT applies the chain rule along this graph, accumulating gradients from each time step into the shared parameters.