Truncated BPTT and Gradient Clipping
- Truncated BPTT: Instead of backpropagating through all T steps, we may backpropagate through a fixed window of length \( \tau \). This approximates the full gradient and reduces computation.
- Concretely, one computes gradients up to \( \tau \) steps and treats gradients beyond as zero. This still allows learning short-term patterns efficiently.
- Gradient Clipping: Cap the gradient norm to a maximum value to prevent explosion. For example in PyTorch: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) ensures \( \|\nabla\|\le 1 \).
- These techniques help stabilize training, but the fundamental vanishing problem motivates using alternative RNN cells (LSTM/GRU) in practice (see below).