Truncated BPTT and Gradient Clipping

  1. Truncated BPTT: Instead of backpropagating through all T steps, we may backpropagate through a fixed window of length \( \tau \). This approximates the full gradient and reduces computation.
  2. Concretely, one computes gradients up to \( \tau \) steps and treats gradients beyond as zero. This still allows learning short-term patterns efficiently.
  3. Gradient Clipping: Cap the gradient norm to a maximum value to prevent explosion. For example in PyTorch: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) ensures \( \|\nabla\|\le 1 \).
  4. These techniques help stabilize training, but the fundamental vanishing problem motivates using alternative RNN cells (LSTM/GRU) in practice (see below).