Setting it up

It means that to represent the entire dataset of images, we require a 4D matrix or tensor. This tensor has the dimensions:

$$ (n_{inputs},\, n_{pixels, width},\, n_{pixels, height},\, depth) . $$