CNNs in more detail, simple example

Let assume we have an input matrix \( X \) of dimensionality \( 3\times 3 \) and a \( 2\times 2 \) filter \( W \) given by the following matrices

$$ \boldsymbol{X}=\begin{bmatrix}x_{00} & x_{01} & x_{02} \\ x_{10} & x_{11} & x_{12} \\ x_{20} & x_{21} & x_{22} \end{bmatrix}, $$

and

$$ \boldsymbol{W}=\begin{bmatrix}w_{00} & w_{01} \\ w_{10} & w_{11}\end{bmatrix}. $$

We introduce now the hyperparameter \( S \) stride. Stride represents how the filter \( W \) moves the convolution process on the matrix \( X \). We strongly recommend the repository on Arithmetic of deep learning by Dumoulin and Visin

Here we set the stride equal to \( S=1 \), which means that, starting with the element \( x_{00} \), the filter will act on \( 2\times 2 \) submatrices each time, starting with the upper corner and moving according to the stride value column by column.

Here we perform the operation

$$ Y_(i,j)=(X * W)(i,j) = \sum_m\sum_n X(i-m,j-n)W(m,n), $$

and obtain

$$ \boldsymbol{Y}=\begin{bmatrix}x_{00}w_{00}+x_{01}w_{01}+x_{10}w_{10}+x_{11}w_{11} & x_{01}w_{00}+x_{02}w_{01}+x_{11}w_{10}+x_{12}w_{11} \\ x_{10}w_{00}+x_{11}w_{01}+x_{20}w_{10}+x_{21}w_{11} & x_{11}w_{00}+x_{12}w_{01}+x_{21}w_{10}+x_{22}w_{11}\end{bmatrix}. $$

We can rewrite this operation in terms of a matrix-vector multiplication by defining a new vector where we flatten out the inputs as a vector \( \boldsymbol{X}' \) of length \( 9 \) and a matrix \( \boldsymbol{W}' \) with dimension \( 4\times 9 \) as

$$ \boldsymbol{X}'=\begin{bmatrix}x_{00} \\ x_{01} \\ x_{02} \\ x_{10} \\ x_{11} \\ x_{12} \\ x_{20} \\ x_{21} \\ x_{22} \end{bmatrix}, $$

and the new matrix

$$ \boldsymbol{W}'=\begin{bmatrix} w_{00} & w_{01} & 0 & w_{10} & w_{11} & 0 & 0 & 0 & 0 \\ 0 & w_{00} & w_{01} & 0 & w_{10} & w_{11} & 0 & 0 & 0 \\ 0 & 0 & 0 & w_{00} & w_{01} & 0 & w_{10} & w_{11} & 0 \\ 0 & 0 & 0 & 0 & w_{00} & w_{01} & 0 & w_{10} & w_{11}\end{bmatrix}. $$

We see easily that performing the matrix-vector multiplication \( \boldsymbol{W}'\boldsymbol{X}' \) is the same as the above convolution with stride \( S=1 \), that is

$$ Y=(\boldsymbol{W}*\boldsymbol{X}), $$

is now given by \( \boldsymbol{W}'\boldsymbol{X}' \) which is a vector of length \( 4 \) instead of the originally resulting \( 2\times 2 \) output matrix.