It means that to represent the entire dataset of images, we require a 4D matrix or tensor. This tensor has the dimensions: (n_{inputs},\, n_{pixels, width},\, n_{pixels, height},\, depth) .