Time decay rate

As an example, let

$e = 0,1,2,3,\cdots$ denote the current epoch and let

$t_0, t_1 > 0$ be two fixed numbers. Furthermore, let

$t = e \cdot m + i$ where

$m$ is the number of minibatches and

$i=0,\cdots,m-1$ . Then the function

$\gamma_j(t; t_0, t_1) = \frac{t_0}{t+t_1}$ goes to zero as the number of epochs gets large. I.e. we start with a step length

$\gamma_j (0; t_0, t_1) = t_0/t_1$ which decays in time

$t$ .

In this way we can fix the number of epochs, compute

$\beta$ and evaluate the cost function at the end. Repeating the computation will give a different result since the scheme is random by design. Then we pick the final

$\beta$ that gives the lowest value of the cost function.

import numpy as np 

def step_length(t,t0,t1):
    return t0/(t+t1)

n = 100 #100 datapoints 
M = 5   #size of each minibatch
m = int(n/M) #number of minibatches
n_epochs = 500 #number of epochs
t0 = 1.0
t1 = 10

gamma_j = t0/t1
j = 0
for epoch in range(1,n_epochs+1):
    for i in range(m):
        k = np.random.randint(m) #Pick the k-th minibatch at random
        #Compute the gradient using the data in minibatch Bk
        #Compute new suggestion for beta
        t = epoch*m+i
        gamma_j = step_length(t,t0,t1)
        j += 1

print("gamma_j after %d epochs: %g" % (n_epochs,gamma_j))