The following code starts with a Gaussian distribution with mean value \( \mu =100 \) and variance \( \sigma=15 \). We use this to generate the data used in the bootstrap analysis. The bootstrap analysis returns a data set after a given number of bootstrap operations (as many as we have data points). This data set consists of estimated mean values for each bootstrap operation. The histogram generated by the bootstrap method shows that the distribution for these mean values is also a Gaussian, centered around the mean value \( \mu=100 \) but with standard deviation \( \sigma/\sqrt{n} \), where \( n \) is the number of bootstrap samples (in this case the same as the number of original data points). The value of the standard deviation is what we expect from the central limit theorem.
%matplotlib inline
from numpy import *
from numpy.random import randint, randn
from time import time
from scipy.stats import norm
import matplotlib.pyplot as plt
# Returns mean of bootstrap samples
def stat(data):
return mean(data)
# Bootstrap algorithm
def bootstrap(data, statistic, R):
t = zeros(R); n = len(data); inds = arange(n); t0 = time()
# non-parametric bootstrap
for i in range(R):
t[i] = statistic(data[randint(0,n,n)])
# analysis
print("Runtime: %g sec" % (time()-t0)); print("Bootstrap Statistics :")
print("original bias std. error")
print("%8g %8g %14g %15g" % (statistic(data), std(data),\
mean(t), \
std(t)))
return t
mu, sigma = 100, 15
datapoints = 10000
x = mu + sigma*random.randn(datapoints)
# bootstrap returns the data sample t = bootstrap(x, stat, datapoints)
# the histogram of the bootstrapped data
t = bootstrap(x, stat, datapoints)
# the histogram of the bootstrapped data
n, binsboot, patches = plt.hist(t, bins=50, density='true',histtype='bar', color='red', alpha=0.75)
# add a 'best fit' line
y = norm.pdf( binsboot, mean(t), std(t))
lt = plt.plot(binsboot, y, 'r--', linewidth=1)
plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.axis([99.5, 100.6, 0, 3.0])
plt.grid(True)
plt.show()