As an example, let e=0,1,2,3,⋯ denote the current epoch and let t0,t1>0 be two fixed numbers. Furthermore, let t=e⋅m+i where m is the number of minibatches and i=0,⋯,m−1. Then the function γj(t;t0,t1)=t0t+t1 goes to zero as the number of epochs gets large. I.e. we start with a step length γj(0;t0,t1)=t0/t1 which decays in time t.
In this way we can fix the number of epochs, compute β and evaluate the cost function at the end. Repeating the computation will give a different result since the scheme is random by design. Then we pick the final β that gives the lowest value of the cost function.
import numpy as np
def step_length(t,t0,t1):
return t0/(t+t1)
n = 100 #100 datapoints
M = 5 #size of each minibatch
m = int(n/M) #number of minibatches
n_epochs = 500 #number of epochs
t0 = 1.0
t1 = 10
gamma_j = t0/t1
j = 0
for epoch in range(1,n_epochs+1):
for i in range(m):
k = np.random.randint(m) #Pick the k-th minibatch at random
#Compute the gradient using the data in minibatch Bk
#Compute new suggestion for beta
t = epoch*m+i
gamma_j = step_length(t,t0,t1)
j += 1
print("gamma_j after %d epochs: %g" % (n_epochs,gamma_j))