The MNIST dataset consists of grayscale images with a pixel size of 28\times 28 , meaning we require 28 \times 28 = 724 weights to each neuron in the first hidden layer.
If we were to analyze images of size 128\times 128 we would require 128 \times 128 = 16384 weights to each neuron. Even worse if we were dealing with color images, as most images are, we have an image matrix of size 128\times 128 for each color dimension (Red, Green, Blue), meaning 3 times the number of weights = 49152 are required for every single neuron in the first hidden layer.