CNNs in more detail

Let assume we have an input matrix \( I \) of dimensionality \( 3\times 3 \) and a \( 2\times 2 \) filter \( W \) given by the following matrices

$$ \boldsymbol{I}=\begin{bmatrix}i_{00} & i_{01} & i_{02} \\ i_{10} & i_{11} & i_{12} \\ i_{20} & i_{21} & i_{22} \end{bmatrix}, $$

and

$$ \boldsymbol{W}=\begin{bmatrix}w_{00} & w_{01} \\ w_{10} & w_{11}\end{bmatrix}. $$

We introduce now the hyperparameter \( S \) stride. Stride represents how the filter \( W \) moves the convolution process on the matrix \( I \). We strongly recommend the repository on Arithmetic of deep learning by Dumoulin and Visin

Here we set the stride equal to \( S=1 \), which means that, starting with the element \( i_{00} \), the filter will act on \( 2\times 2 \) submatrices each time, starting with the upper corner and moving according to the stride value column by column.

Here we perform the operation

$$ S_(i,j)=(I * W)(i,j) = \sum_m\sum_n I(i-m,j-n)W(m,n), $$

and obtain

$$ \boldsymbol{S}=\begin{bmatrix}i_{00}w_{00}+i_{01}w_{01}+i_{10}w_{10}+i_{11}w_{11} & i_{01}w_{00}+i_{02}w_{01}+i_{11}w_{10}+i_{12}w_{11} \\ i_{10}w_{00}+i_{11}w_{01}+i_{20}w_{10}+i_{21}w_{11} & i_{11}w_{00}+i_{12}w_{01}+i_{21}w_{10}+i_{22}w_{11}\end{bmatrix}. $$

We can rewrite this operation in terms of a matrix-vector multiplication by defining a new vector where we flatten out the inputs as a vector \( \boldsymbol{I}' \) of length \( 9 \) and a matrix \( \boldsymbol{W}' \) with dimension \( 4\times 9 \) as

$$ \boldsymbol{I}'=\begin{bmatrix}i_{00} \\ i_{01} \\ i_{02} \\ i_{10} \\ i_{11} \\ i_{12} \\ i_{20} \\ i_{21} \\ i_{22} \end{bmatrix}, $$

and the new matrix

$$ \boldsymbol{W}'=\begin{bmatrix} w_{00} & w_{01} & 0 & w_{10} & w_{11} & 0 & 0 & 0 & 0 \\ 0 & w_{00} & w_{01} & 0 & w_{10} & w_{11} & 0 & 0 & 0 \\ 0 & 0 & 0 & w_{00} & w_{01} & 0 & w_{10} & w_{11} & 0 \\ 0 & 0 & 0 & 0 & w_{00} & w_{01} & 0 & w_{10} & w_{11}\end{bmatrix}. $$

We see easily that performing the matrix-vector multiplication \( \boldsymbol{W}'\boldsymbol{I}' \) is the same as the above convolution with stride \( S=1 \), that is

$$ S=(\boldsymbol{W}*\boldsymbol{I}), $$

is now given by \( \boldsymbol{W}'\boldsymbol{I}' \) which is a vector of length \( 4 \) instead of the originally resulting \( 2\times 2 \) output matrix.

The collection of kernels/filters \( W \) defining a discrete convolution has a shape corresponding to some permutation of \( (n, m, k_1, \ldots, k_N) \), where

$$ \begin{equation*} \begin{split} n &\equiv \text{number of output feature maps},\\ m &\equiv \text{number of input feature maps},\\ k_j &\equiv \text{kernel size along axis $j$}. \end{split} \end{equation*} $$

The following properties affect the output size \( o_j \) of a convolutional layer along axis \( j \):

  1. \( i_j \): input size along axis \( j \),
  2. \( k_j \): kernel/filter size along axis \( j \),
  3. stride (distance between two consecutive positions of the kernel/filter) along axis \( j \),
  4. zero padding (number of zeros concatenated at the beginning and at the end of an axis) along axis \( j \).

For instance, the above examples shows a \( 2\times 2 \) kernel/filter \( \boldsymbol{W} \) applied to a \( 3 \times 3 \) input padded with a \( 0 \times 0 \) border of zeros using \( 1 \times 1 \) strides.

Note that strides constitute a form of subsampling. As an alternative to being interpreted as a measure of how much the kernel/filter is translated, strides can also be viewed as how much of the output is retained. For instance, moving the kernel by hops of two is equivalent to moving the kernel by hops of one but retaining only odd output elements.