Let assume we have an input matrix \( I \) of dimensionality \( 3\times 3 \) and a \( 2\times 2 \) filter \( W \) given by the following matrices
$$ \boldsymbol{I}=\begin{bmatrix}i_{00} & i_{01} & i_{02} \\ i_{10} & i_{11} & i_{12} \\ i_{20} & i_{21} & i_{22} \end{bmatrix}, $$and
$$ \boldsymbol{W}=\begin{bmatrix}w_{00} & w_{01} \\ w_{10} & w_{11}\end{bmatrix}. $$We introduce now the hyperparameter \( S \) stride. Stride represents how the filter \( W \) moves the convolution process on the matrix \( I \). We strongly recommend the repository on Arithmetic of deep learning by Dumoulin and Visin
Here we set the stride equal to \( S=1 \), which means that, starting with the element \( i_{00} \), the filter will act on \( 2\times 2 \) submatrices each time, starting with the upper corner and moving according to the stride value column by column.
Here we perform the operation
$$ S_(i,j)=(I * W)(i,j) = \sum_m\sum_n I(i-m,j-n)W(m,n), $$and obtain
$$ \boldsymbol{S}=\begin{bmatrix}i_{00}w_{00}+i_{01}w_{01}+i_{10}w_{10}+i_{11}w_{11} & i_{01}w_{00}+i_{02}w_{01}+i_{11}w_{10}+i_{12}w_{11} \\ i_{10}w_{00}+i_{11}w_{01}+i_{20}w_{10}+i_{21}w_{11} & i_{11}w_{00}+i_{12}w_{01}+i_{21}w_{10}+i_{22}w_{11}\end{bmatrix}. $$We can rewrite this operation in terms of a matrix-vector multiplication by defining a new vector where we flatten out the inputs as a vector \( \boldsymbol{I}' \) of length \( 9 \) and a matrix \( \boldsymbol{W}' \) with dimension \( 4\times 9 \) as
$$ \boldsymbol{I}'=\begin{bmatrix}i_{00} \\ i_{01} \\ i_{02} \\ i_{10} \\ i_{11} \\ i_{12} \\ i_{20} \\ i_{21} \\ i_{22} \end{bmatrix}, $$and the new matrix
$$ \boldsymbol{W}'=\begin{bmatrix} w_{00} & w_{01} & 0 & w_{10} & w_{11} & 0 & 0 & 0 & 0 \\ 0 & w_{00} & w_{01} & 0 & w_{10} & w_{11} & 0 & 0 & 0 \\ 0 & 0 & 0 & w_{00} & w_{01} & 0 & w_{10} & w_{11} & 0 \\ 0 & 0 & 0 & 0 & w_{00} & w_{01} & 0 & w_{10} & w_{11}\end{bmatrix}. $$We see easily that performing the matrix-vector multiplication \( \boldsymbol{W}'\boldsymbol{I}' \) is the same as the above convolution with stride \( S=1 \), that is
$$ S=(\boldsymbol{W}*\boldsymbol{I}), $$is now given by \( \boldsymbol{W}'\boldsymbol{I}' \) which is a vector of length \( 4 \) instead of the originally resulting \( 2\times 2 \) output matrix.
The collection of kernels/filters \( W \) defining a discrete convolution has a shape corresponding to some permutation of \( (n, m, k_1, \ldots, k_N) \), where
$$ \begin{equation*} \begin{split} n &\equiv \text{number of output feature maps},\\ m &\equiv \text{number of input feature maps},\\ k_j &\equiv \text{kernel size along axis $j$}. \end{split} \end{equation*} $$The following properties affect the output size \( o_j \) of a convolutional layer along axis \( j \):
For instance, the above examples shows a \( 2\times 2 \) kernel/filter \( \boldsymbol{W} \) applied to a \( 3 \times 3 \) input padded with a \( 0 \times 0 \) border of zeros using \( 1 \times 1 \) strides.
Note that strides constitute a form of subsampling. As an alternative to being interpreted as a measure of how much the kernel/filter is translated, strides can also be viewed as how much of the output is retained. For instance, moving the kernel by hops of two is equivalent to moving the kernel by hops of one but retaining only odd output elements.