LSTM details

The first stage is called the forget gate, where we combine the input at (say, time \( t \)), and the hidden cell state input at \( t-1 \), passing it through the Sigmoid activation function and then performing an element-wise multiplication, denoted by \( \odot \).

Mathematically we have (see also figure below)

$$ \mathbf{f}^{(t)} = \sigma(W_{fx}\mathbf{x}^{(t)} + W_{fh}\mathbf{h}^{(t-1)} + \mathbf{b}_f) $$

where the $W$s are the weights to be trained.