Pooling

In addition to discrete convolutions themselves, {\em pooling\/} operations make up another important building block in CNNs. Pooling operations reduce the size of feature maps by using some function to summarize subregions, such as taking the average or the maximum value.

Pooling works by sliding a window across the input and feeding the content of the window to a {\em pooling function}. In some sense, pooling works very much like a discrete convolution, but replaces the linear combination described by the kernel with some other function. Poolin provides an example for average pooling, and does the same for max pooling.

The following properties affect the output size \( o_j \) of a pooling layer along axis \( j \):

  1. \( i_j \): input size along axis \( j \),
  2. \( k_j \): pooling window size along axis \( j \),
  3. \( s_j \): stride (distance between two consecutive positions of the pooling window) along axis \( j \).

The analysis of the relationship between convolutional layer properties is eased by the fact that they don't interact across axes, i.e., the choice of kernel size, stride and zero padding along axis \( j \) only affects the output size of axis \( j \). Because of that, we will focus on the following simplified setting:

  1. 2-D discrete convolutions (\( N = 2 \)),
  2. square inputs (\( i_1 = i_2 = i \)),
  3. square kernel size (\( k_1 = k_2 = k \)),
  4. same strides along both axes (\( s_1 = s_2 = s \)),
  5. same zero padding along both axes (\( p_1 = p_2 = p \)).

This facilitates the analysis and the visualization, but keep in mind that the results outlined here also generalize to the N-D and non-square cases.