Gradient Methods
Contents
6. Gradient Methods¶
6.1. Top-down start¶
We will start with a top-down view, with a simple harmonic oscillator problem in one dimension as case.
Thereafter we continue with implementing the simplest possible steepest descent approach to our two-electron problem with an electrostatic (Coulomb) interaction. Our code includes also importance sampling. The simple Python code here illustrates the basic elements which need to be included in our own code.
Then we move on to the mathematical description of various gradient methods.
6.2. Motivation¶
Our aim with this part of the project is to be able to
find an optimal value for the variational parameters using only some few Monte Carlo cycles
use these optimal values for the variational parameters to perform a large-scale Monte Carlo calculation
To achieve this will look at methods like Steepest descent and the conjugate gradient method. Both these methods allow us to find the minima of a multivariable function like our energy (function of several variational parameters). Alternatively, you can always use Newton’s method. In particular, since we will normally have one variational parameter, Newton’s method can be easily used in finding the minimum of the local energy.
6.3. Simple example and demonstration¶
Let us illustrate what is needed in our calculations using a simple example, the harmonic oscillator in one dimension. For the harmonic oscillator in one-dimension we have a trial wave function and probability
which results in a local energy
We can compare our numerically calculated energies with the exact energy as function of
6.4. Simple example and demonstration¶
The derivative of the energy with respect to
and a second derivative which is always positive (meaning that we find a minimum)
The condition
gives the optimal
6.5. Exercise 1: Find the local energy for the harmonic oscillator¶
a) Derive the local energy for the harmonic oscillator in one dimension and find its expectation value.
b)
Show also that the optimal value of optimal
c)
Repeat the above steps in two dimensions for
6.6. Variance in the simple model¶
We can also minimize the variance. In our simple model the variance is
which yields a second derivative which is always positive.
6.7. Computing the derivatives¶
In general we end up computing the expectation value of the energy in terms
of some parameters
This leads to an energy minimization problem where we need the derivative of the energy as a function of the variational parameters.
In the above example this was easy and we were able to find the expression for the derivative by simple derivations. However, in our actual calculations the energy is represented by a multi-dimensional integral with several variational parameters. How can we can then obtain the derivatives of the energy with respect to the variational parameters without having to resort to expensive numerical derivations?
6.8. Expressions for finding the derivatives of the local energy¶
To find the derivatives of the local energy expectation value as function of the variational parameters, we can use the chain rule and the hermiticity of the Hamiltonian.
Let us define
as the derivative of the energy with respect to the variational parameter
6.9. Derivatives of the local energy¶
The elements of the gradient of the local energy are then (using the chain rule and the hermiticity of the Hamiltonian)
From a computational point of view it means that you need to compute the expectation values of
and
6.10. Exercise 2: General expression for the derivative of the energy¶
a) Show that
b) Find the corresponding expression for the variance.
6.11. Python program for 2-electrons in 2 dimensions¶
%matplotlib inline
# 2-electron VMC code for 2dim quantum dot with importance sampling
# Using gaussian rng for new positions and Metropolis- Hastings
# Added energy minimization with gradient descent using fixed step size
# To do: replace with optimization codes from scipy and/or use stochastic gradient descent
from math import exp, sqrt
from random import random, seed, normalvariate
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
import sys
# Trial wave function for the 2-electron quantum dot in two dims
def WaveFunction(r,alpha,beta):
r1 = r[0,0]**2 + r[0,1]**2
r2 = r[1,0]**2 + r[1,1]**2
r12 = sqrt((r[0,0]-r[1,0])**2 + (r[0,1]-r[1,1])**2)
deno = r12/(1+beta*r12)
return exp(-0.5*alpha*(r1+r2)+deno)
# Local energy for the 2-electron quantum dot in two dims, using analytical local energy
def LocalEnergy(r,alpha,beta):
r1 = (r[0,0]**2 + r[0,1]**2)
r2 = (r[1,0]**2 + r[1,1]**2)
r12 = sqrt((r[0,0]-r[1,0])**2 + (r[0,1]-r[1,1])**2)
deno = 1.0/(1+beta*r12)
deno2 = deno*deno
return 0.5*(1-alpha*alpha)*(r1 + r2) +2.0*alpha + 1.0/r12+deno2*(alpha*r12-deno2+2*beta*deno-1.0/r12)
# Derivate of wave function ansatz as function of variational parameters
def DerivativeWFansatz(r,alpha,beta):
WfDer = np.zeros((2), np.double)
r1 = (r[0,0]**2 + r[0,1]**2)
r2 = (r[1,0]**2 + r[1,1]**2)
r12 = sqrt((r[0,0]-r[1,0])**2 + (r[0,1]-r[1,1])**2)
deno = 1.0/(1+beta*r12)
deno2 = deno*deno
WfDer[0] = -0.5*(r1+r2)
WfDer[1] = -r12*r12*deno2
return WfDer
# Setting up the quantum force for the two-electron quantum dot, recall that it is a vector
def QuantumForce(r,alpha,beta):
qforce = np.zeros((NumberParticles,Dimension), np.double)
r12 = sqrt((r[0,0]-r[1,0])**2 + (r[0,1]-r[1,1])**2)
deno = 1.0/(1+beta*r12)
qforce[0,:] = -2*r[0,:]*alpha*(r[0,:]-r[1,:])*deno*deno/r12
qforce[1,:] = -2*r[1,:]*alpha*(r[1,:]-r[0,:])*deno*deno/r12
return qforce
# Computing the derivative of the energy and the energy
def EnergyMinimization(alpha, beta):
NumberMCcycles= 10000
# Parameters in the Fokker-Planck simulation of the quantum force
D = 0.5
TimeStep = 0.05
# positions
PositionOld = np.zeros((NumberParticles,Dimension), np.double)
PositionNew = np.zeros((NumberParticles,Dimension), np.double)
# Quantum force
QuantumForceOld = np.zeros((NumberParticles,Dimension), np.double)
QuantumForceNew = np.zeros((NumberParticles,Dimension), np.double)
# seed for rng generator
seed()
energy = 0.0
DeltaE = 0.0
EnergyDer = np.zeros((2), np.double)
DeltaPsi = np.zeros((2), np.double)
DerivativePsiE = np.zeros((2), np.double)
#Initial position
for i in range(NumberParticles):
for j in range(Dimension):
PositionOld[i,j] = normalvariate(0.0,1.0)*sqrt(TimeStep)
wfold = WaveFunction(PositionOld,alpha,beta)
QuantumForceOld = QuantumForce(PositionOld,alpha, beta)
#Loop over MC MCcycles
for MCcycle in range(NumberMCcycles):
#Trial position moving one particle at the time
for i in range(NumberParticles):
for j in range(Dimension):
PositionNew[i,j] = PositionOld[i,j]+normalvariate(0.0,1.0)*sqrt(TimeStep)+\
QuantumForceOld[i,j]*TimeStep*D
wfnew = WaveFunction(PositionNew,alpha,beta)
QuantumForceNew = QuantumForce(PositionNew,alpha, beta)
GreensFunction = 0.0
for j in range(Dimension):
GreensFunction += 0.5*(QuantumForceOld[i,j]+QuantumForceNew[i,j])*\
(D*TimeStep*0.5*(QuantumForceOld[i,j]-QuantumForceNew[i,j])-\
PositionNew[i,j]+PositionOld[i,j])
GreensFunction = exp(GreensFunction)
ProbabilityRatio = GreensFunction*wfnew**2/wfold**2
#Metropolis-Hastings test to see whether we accept the move
if random() <= ProbabilityRatio:
for j in range(Dimension):
PositionOld[i,j] = PositionNew[i,j]
QuantumForceOld[i,j] = QuantumForceNew[i,j]
wfold = wfnew
DeltaE = LocalEnergy(PositionOld,alpha,beta)
DerPsi = DerivativeWFansatz(PositionOld,alpha,beta)
DeltaPsi += DerPsi
energy += DeltaE
DerivativePsiE += DerPsi*DeltaE
# We calculate mean values
energy /= NumberMCcycles
DerivativePsiE /= NumberMCcycles
DeltaPsi /= NumberMCcycles
EnergyDer = 2*(DerivativePsiE-DeltaPsi*energy)
return energy, EnergyDer
#Here starts the main program with variable declarations
NumberParticles = 2
Dimension = 2
# guess for variational parameters
alpha = 0.9
beta = 0.2
# Set up iteration using gradient descent method
Energy = 0
EDerivative = np.zeros((2), np.double)
eta = 0.01
Niterations = 50
#
for iter in range(Niterations):
Energy, EDerivative = EnergyMinimization(alpha,beta)
alphagradient = EDerivative[0]
betagradient = EDerivative[1]
alpha -= eta*alphagradient
beta -= eta*betagradient
print(alpha, beta)
print(Energy, EDerivative[0], EDerivative[1])
1.0088427878653348 0.3178149975170346
3.0061947633799724 -0.04475836384278331 -0.07290099872381006
6.12. Using Broyden’s algorithm in scipy¶
The following function uses the above described BFGS algorithm. Here we have defined a function which calculates the energy and a function which computes the first derivative.
# 2-electron VMC code for 2dim quantum dot with importance sampling
# Using gaussian rng for new positions and Metropolis- Hastings
# Added energy minimization using the BFGS algorithm, see p. 136 of https://www.springer.com/it/book/9780387303031
from math import exp, sqrt
from random import random, seed, normalvariate
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
from scipy.optimize import minimize
import sys
# Trial wave function for the 2-electron quantum dot in two dims
def WaveFunction(r,alpha,beta):
r1 = r[0,0]**2 + r[0,1]**2
r2 = r[1,0]**2 + r[1,1]**2
r12 = sqrt((r[0,0]-r[1,0])**2 + (r[0,1]-r[1,1])**2)
deno = r12/(1+beta*r12)
return exp(-0.5*alpha*(r1+r2)+deno)
# Local energy for the 2-electron quantum dot in two dims, using analytical local energy
def LocalEnergy(r,alpha,beta):
r1 = (r[0,0]**2 + r[0,1]**2)
r2 = (r[1,0]**2 + r[1,1]**2)
r12 = sqrt((r[0,0]-r[1,0])**2 + (r[0,1]-r[1,1])**2)
deno = 1.0/(1+beta*r12)
deno2 = deno*deno
return 0.5*(1-alpha*alpha)*(r1 + r2) +2.0*alpha + 1.0/r12+deno2*(alpha*r12-deno2+2*beta*deno-1.0/r12)
# Derivate of wave function ansatz as function of variational parameters
def DerivativeWFansatz(r,alpha,beta):
WfDer = np.zeros((2), np.double)
r1 = (r[0,0]**2 + r[0,1]**2)
r2 = (r[1,0]**2 + r[1,1]**2)
r12 = sqrt((r[0,0]-r[1,0])**2 + (r[0,1]-r[1,1])**2)
deno = 1.0/(1+beta*r12)
deno2 = deno*deno
WfDer[0] = -0.5*(r1+r2)
WfDer[1] = -r12*r12*deno2
return WfDer
# Setting up the quantum force for the two-electron quantum dot, recall that it is a vector
def QuantumForce(r,alpha,beta):
qforce = np.zeros((NumberParticles,Dimension), np.double)
r12 = sqrt((r[0,0]-r[1,0])**2 + (r[0,1]-r[1,1])**2)
deno = 1.0/(1+beta*r12)
qforce[0,:] = -2*r[0,:]*alpha*(r[0,:]-r[1,:])*deno*deno/r12
qforce[1,:] = -2*r[1,:]*alpha*(r[1,:]-r[0,:])*deno*deno/r12
return qforce
# Computing the derivative of the energy and the energy
def EnergyDerivative(x0):
# Parameters in the Fokker-Planck simulation of the quantum force
D = 0.5
TimeStep = 0.05
NumberMCcycles= 10000
# positions
PositionOld = np.zeros((NumberParticles,Dimension), np.double)
PositionNew = np.zeros((NumberParticles,Dimension), np.double)
# Quantum force
QuantumForceOld = np.zeros((NumberParticles,Dimension), np.double)
QuantumForceNew = np.zeros((NumberParticles,Dimension), np.double)
energy = 0.0
DeltaE = 0.0
alpha = x0[0]
beta = x0[1]
EnergyDer = 0.0
DeltaPsi = 0.0
DerivativePsiE = 0.0
#Initial position
for i in range(NumberParticles):
for j in range(Dimension):
PositionOld[i,j] = normalvariate(0.0,1.0)*sqrt(TimeStep)
wfold = WaveFunction(PositionOld,alpha,beta)
QuantumForceOld = QuantumForce(PositionOld,alpha, beta)
#Loop over MC MCcycles
for MCcycle in range(NumberMCcycles):
#Trial position moving one particle at the time
for i in range(NumberParticles):
for j in range(Dimension):
PositionNew[i,j] = PositionOld[i,j]+normalvariate(0.0,1.0)*sqrt(TimeStep)+\
QuantumForceOld[i,j]*TimeStep*D
wfnew = WaveFunction(PositionNew,alpha,beta)
QuantumForceNew = QuantumForce(PositionNew,alpha, beta)
GreensFunction = 0.0
for j in range(Dimension):
GreensFunction += 0.5*(QuantumForceOld[i,j]+QuantumForceNew[i,j])*\
(D*TimeStep*0.5*(QuantumForceOld[i,j]-QuantumForceNew[i,j])-\
PositionNew[i,j]+PositionOld[i,j])
GreensFunction = exp(GreensFunction)
ProbabilityRatio = GreensFunction*wfnew**2/wfold**2
#Metropolis-Hastings test to see whether we accept the move
if random() <= ProbabilityRatio:
for j in range(Dimension):
PositionOld[i,j] = PositionNew[i,j]
QuantumForceOld[i,j] = QuantumForceNew[i,j]
wfold = wfnew
DeltaE = LocalEnergy(PositionOld,alpha,beta)
DerPsi = DerivativeWFansatz(PositionOld,alpha,beta)
DeltaPsi += DerPsi
energy += DeltaE
DerivativePsiE += DerPsi*DeltaE
# We calculate mean values
energy /= NumberMCcycles
DerivativePsiE /= NumberMCcycles
DeltaPsi /= NumberMCcycles
EnergyDer = 2*(DerivativePsiE-DeltaPsi*energy)
return EnergyDer
# Computing the expectation value of the local energy
def Energy(x0):
# Parameters in the Fokker-Planck simulation of the quantum force
D = 0.5
TimeStep = 0.05
# positions
PositionOld = np.zeros((NumberParticles,Dimension), np.double)
PositionNew = np.zeros((NumberParticles,Dimension), np.double)
# Quantum force
QuantumForceOld = np.zeros((NumberParticles,Dimension), np.double)
QuantumForceNew = np.zeros((NumberParticles,Dimension), np.double)
energy = 0.0
DeltaE = 0.0
alpha = x0[0]
beta = x0[1]
NumberMCcycles= 10000
#Initial position
for i in range(NumberParticles):
for j in range(Dimension):
PositionOld[i,j] = normalvariate(0.0,1.0)*sqrt(TimeStep)
wfold = WaveFunction(PositionOld,alpha,beta)
QuantumForceOld = QuantumForce(PositionOld,alpha, beta)
#Loop over MC MCcycles
for MCcycle in range(NumberMCcycles):
#Trial position moving one particle at the time
for i in range(NumberParticles):
for j in range(Dimension):
PositionNew[i,j] = PositionOld[i,j]+normalvariate(0.0,1.0)*sqrt(TimeStep)+\
QuantumForceOld[i,j]*TimeStep*D
wfnew = WaveFunction(PositionNew,alpha,beta)
QuantumForceNew = QuantumForce(PositionNew,alpha, beta)
GreensFunction = 0.0
for j in range(Dimension):
GreensFunction += 0.5*(QuantumForceOld[i,j]+QuantumForceNew[i,j])*\
(D*TimeStep*0.5*(QuantumForceOld[i,j]-QuantumForceNew[i,j])-\
PositionNew[i,j]+PositionOld[i,j])
GreensFunction = exp(GreensFunction)
ProbabilityRatio = GreensFunction*wfnew**2/wfold**2
#Metropolis-Hastings test to see whether we accept the move
if random() <= ProbabilityRatio:
for j in range(Dimension):
PositionOld[i,j] = PositionNew[i,j]
QuantumForceOld[i,j] = QuantumForceNew[i,j]
wfold = wfnew
DeltaE = LocalEnergy(PositionOld,alpha,beta)
energy += DeltaE
# We calculate mean values
energy /= NumberMCcycles
return energy
#Here starts the main program with variable declarations
NumberParticles = 2
Dimension = 2
# seed for rng generator
seed()
# guess for variational parameters
x0 = np.array([0.9,0.2])
# Using Broydens method
res = minimize(Energy, x0, method='BFGS', jac=EnergyDerivative, options={'gtol': 1e-4,'disp': True})
print(res.x)
Warning: Desired error not necessarily achieved due to precision loss.
Current function value: 2.996029
Iterations: 4
Function evaluations: 25
Gradient evaluations: 13
[0.96645852 0.44826449]
Note that the minimize function returns the finale values for the variable
6.13. Brief reminder on Newton-Raphson’s method¶
Let us quickly remind ourselves how we derive the above method.
Perhaps the most celebrated of all one-dimensional root-finding
routines is Newton’s method, also called the Newton-Raphson
method. This method requires the evaluation of both the
function
6.14. The equations¶
The Newton-Raphson formula consists geometrically of extending the
tangent line at a current point until it crosses zero, then setting
the next guess to the abscissa of that zero-crossing. The mathematics
behind this method is rather simple. Employing a Taylor expansion for
For small enough values of the function and for well-behaved functions, the terms beyond linear are unimportant, hence we obtain
yielding
Having in mind an iterative procedure, it is natural to start iterating with
6.15. Simple geometric interpretation¶
The above is Newton-Raphson’s method. It has a simple geometric
interpretation, namely
6.16. Extending to more than one variable¶
Newton’s method can be generalized to systems of several non-linear equations and variables. Consider the case with two equations
which we Taylor expand to obtain
Defining the Jacobian matrix
we can rephrase Newton’s method as
where we have defined
We need thus to compute the inverse of the Jacobian matrix and it
is to understand that difficulties may
arise in case
It is rather straightforward to extend the above scheme to systems of more than two non-linear equations. In our case, the Jacobian matrix is given by the Hessian that represents the second derivative of cost function.
6.17. Steepest descent¶
The basic idea of gradient descent is
that a function
It can be shown that if
with
For
6.18. More on Steepest descent¶
The previous observation is the basis of the method of steepest
descent, which is also referred to as just gradient descent (GD). One
starts with an initial guess
The parameter
6.19. The ideal¶
Ideally the sequence
In machine learing we are often faced with non-convex high dimensional cost functions with many local minima. Since GD is deterministic we will get stuck in a local minimum, if the method converges, unless we have a very good intial guess. This also implies that the scheme is sensitive to the chosen initial condition.
Note that the gradient is a function of
6.20. The sensitiveness of the gradient descent¶
The gradient descent method
is sensitive to the choice of learning rate
Many of these shortcomings can be alleviated by introducing randomness. One such method is that of Stochastic Gradient Descent (SGD), see below.
6.21. Convex functions¶
Ideally we want our cost/loss function to be convex(concave).
First we give the definition of a convex set: A set
The convex subsets of
6.22. Convex function¶
Convex function: Let
6.23. Conditions on convex functions¶
In the following we state first and second-order conditions which
ensures convexity of a function
First order condition.
Suppose
Second order condition.
Assume that
This condition is particularly useful since it gives us an procedure for determining if the function under consideration is convex, apart from using the definition.
6.24. More on convex functions¶
The next result is of great importance to us and the reason why we are going on about convex functions. In machine learning we frequently have to minimize a loss/cost function in order to find the best parameters for the model we are considering.
Ideally we want the global minimum (for high-dimensional models it is hard to know if we have local or global minimum). However, if the cost/loss function is convex the following result provides invaluable information:
Any minimum is global for convex functions.
Consider the problem of finding
This result means that if we know that the cost/loss function is convex and we are able to find a minimum, we are guaranteed that it is a global minimum.
6.25. Some simple problems¶
Show that
is convex for using the definition of convexity. Hint: If you re-write the definition, is convex if the following holds for all and any .Using the second order condition show that the following functions are convex on the specified domain.
is convex for . is convex for .
Let
and . Show that and is convex for . Also show that if is any convex function than is convex.A norm is any function that satisfy the following properties
for all . for all with equality if and only if
Using the definition of convexity, try to show that a function satisfying the properties above is convex (the third condition is not needed to show this).
6.26. Standard steepest descent¶
Before we proceed, we would like to discuss the approach called the standard Steepest descent, which again leads to us having to be able to compute a matrix. It belongs to the class of Conjugate Gradient methods (CG).
The success of the CG method for finding solutions of non-linear problems is based on the theory of conjugate gradients for linear systems of equations. It belongs to the class of iterative methods for solving problems from linear algebra of the type
In the iterative process we end up with a problem like
where
When we have found the exact solution,
6.27. Gradient method¶
The residual is zero when we reach the minimum of the quadratic equation
with the constraint that the matrix
6.28. Steepest descent method¶
We denote the initial guess for
or consider the system
instead.
6.29. Steepest descent method¶
One can show that the solution
This suggests taking the first basis vector
and
6.30. Final expressions¶
We can compute the residual iteratively as
which equals
or
which gives
leading to the iterative scheme
6.31. Code examples for steepest descent¶
6.32. Simple codes for steepest descent and conjugate gradient using a matrix, in c++, Python code to come¶
#include <cmath>
#include <iostream>
#include <fstream>
#include <iomanip>
#include "vectormatrixclass.h"
using namespace std;
// Main function begins here
int main(int argc, char * argv[]){
int dim = 2;
Vector x(dim),xsd(dim), b(dim),x0(dim);
Matrix A(dim,dim);
// Set our initial guess
x0(0) = x0(1) = 0;
// Set the matrix
A(0,0) = 3; A(1,0) = 2; A(0,1) = 2; A(1,1) = 6;
b(0) = 2; b(1) = -8;
cout << "The Matrix A that we are using: " << endl;
A.Print();
cout << endl;
xsd = SteepestDescent(A,b,x0);
cout << "The approximate solution using Steepest Descent is: " << endl;
xsd.Print();
cout << endl;
}
6.33. The routine for the steepest descent method¶
Vector SteepestDescent(Matrix A, Vector b, Vector x0){
int IterMax, i;
int dim = x0.Dimension();
const double tolerance = 1.0e-14;
Vector x(dim),f(dim),z(dim);
double c,alpha,d;
IterMax = 30;
x = x0;
r = A*x-b;
i = 0;
while (i <= IterMax){
z = A*r;
c = dot(r,r);
alpha = c/dot(r,z);
x = x - alpha*r;
r = A*x-b;
if(sqrt(dot(r,r)) < tolerance) break;
i++;
}
return x;
}
6.34. Steepest descent example¶
import numpy as np
import numpy.linalg as la
import scipy.optimize as sopt
import matplotlib.pyplot as pt
from mpl_toolkits.mplot3d import axes3d
def f(x):
return 0.5*x[0]**2 + 2.5*x[1]**2
def df(x):
return np.array([x[0], 5*x[1]])
fig = pt.figure()
ax = fig.gca(projection="3d")
xmesh, ymesh = np.mgrid[-2:2:50j,-2:2:50j]
fmesh = f(np.array([xmesh, ymesh]))
ax.plot_surface(xmesh, ymesh, fmesh)
/var/folders/td/3yk470mj5p931p9dtkk0y6jw0000gn/T/ipykernel_67814/1833905057.py:16: MatplotlibDeprecationWarning: Calling gca() with keyword arguments was deprecated in Matplotlib 3.4. Starting two minor releases later, gca() will take no keyword arguments. The gca() function should only be used to get the current axes, or if no axes exist, create new axes with default keyword arguments. To create a new axes with non-default arguments, use plt.axes() or plt.subplot().
ax = fig.gca(projection="3d")
<mpl_toolkits.mplot3d.art3d.Poly3DCollection at 0x1059b3910>

And then as countor plot
pt.axis("equal")
pt.contour(xmesh, ymesh, fmesh)
guesses = [np.array([2, 2./5])]

Find guesses
x = guesses[-1]
s = -df(x)
Run it!
def f1d(alpha):
return f(x + alpha*s)
alpha_opt = sopt.golden(f1d)
next_guess = x + alpha_opt * s
guesses.append(next_guess)
print(next_guess)
[ 1.33333333 -0.26666667]
What happened?
pt.axis("equal")
pt.contour(xmesh, ymesh, fmesh, 50)
it_array = np.array(guesses)
pt.plot(it_array.T[0], it_array.T[1], "x-")
[<matplotlib.lines.Line2D at 0x11c389fa0>]

6.35. Conjugate gradient method¶
In the CG method we define so-called conjugate directions and two vectors
The philosophy of the CG method is to perform searches in various conjugate directions
of our vectors
Two vectors are conjugate if they are orthogonal with respect to
this inner product. Being conjugate is a symmetric relation: if
6.36. Conjugate gradient method¶
An example is given by the eigenvectors of the matrix
which is zero unless
6.37. Conjugate gradient method¶
Assume now that we have a symmetric positive-definite matrix
We assume that
6.38. Conjugate gradient method¶
The coefficients are given by
Multiplying with
and we can define the coefficients
6.39. Conjugate gradient method and iterations¶
If we choose the conjugate vectors
We denote the initial guess for
or consider the system
instead.
6.40. Conjugate gradient method¶
One can show that the solution
This suggests taking the first basis vector
and
6.41. Conjugate gradient method¶
Let
Note that
under the conjugacy constraint.
This gives the following expression
6.42. Conjugate gradient method¶
We can also compute the residual iteratively as
which equals
or
which gives
6.43. Simple implementation of the Conjugate gradient algorithm¶
Vector ConjugateGradient(Matrix A, Vector b, Vector x0){
int dim = x0.Dimension();
const double tolerance = 1.0e-14;
Vector x(dim),r(dim),v(dim),z(dim);
double c,t,d;
x = x0;
r = b - A*x;
v = r;
c = dot(r,r);
int i = 0; IterMax = dim;
while(i <= IterMax){
z = A*v;
t = c/dot(v,z);
x = x + t*v;
r = r - t*z;
d = dot(r,r);
if(sqrt(d) < tolerance)
break;
v = r + (d/c)*v;
c = d; i++;
}
return x;
}
6.44. Broyden–Fletcher–Goldfarb–Shanno algorithm¶
The optimization problem is to minimize
The algorithm begins at an initial estimate for the optimal value
The search direction
where
over the scalar
6.45. Stochastic Gradient Descent¶
Stochastic gradient descent (SGD) and variants thereof address some of the shortcomings of the Gradient descent method discussed above.
The underlying idea of SGD comes from the observation that a given
function, which we want to minimize, can almost always be written as a
sum over
6.46. Computation of gradients¶
This in turn means that the gradient can be
computed as a sum over
Stochasticity/randomness is introduced by only taking the
gradient on a subset of the data called minibatches. If there are
6.47. SGD example¶
As an example, suppose we have
The idea is now to approximate the gradient by replacing the sum over all data points with a sum over the data points in one the minibatches picked at random in each gradient descent step
6.48. The gradient step¶
Thus a gradient descent step now looks like
where
6.49. Simple example code¶
import numpy as np
n = 100 #100 datapoints
M = 5 #size of each minibatch
m = int(n/M) #number of minibatches
n_epochs = 10 #number of epochs
j = 0
for epoch in range(1,n_epochs+1):
for i in range(m):
k = np.random.randint(m) #Pick the k-th minibatch at random
#Compute the gradient using the data in minibatch Bk
#Compute new suggestion for
j += 1
Taking the gradient only on a subset of the data has two important
benefits. First, it introduces randomness which decreases the chance
that our opmization scheme gets stuck in a local minima. Second, if
the size of the minibatches are small relative to the number of
datapoints (
6.50. When do we stop?¶
A natural question is when do we stop the search for a new minimum?
One possibility is to compute the full gradient after a given number
of epochs and check if the norm of the gradient is smaller than some
threshold and stop if true. However, the condition that the gradient
is zero is valid also for local minima, so this would only tell us
that we are close to a local/global minimum. However, we could also
evaluate the cost function at this point, store the result and
continue the search. If the test kicks in at a later stage we can
compare the values of the cost function and keep the
6.51. Slightly different approach¶
Another approach is to let the step length
As an example, let
In this way we can fix the number of epochs, compute
import numpy as np
def step_length(t,t0,t1):
return t0/(t+t1)
n = 100 #100 datapoints
M = 5 #size of each minibatch
m = int(n/M) #number of minibatches
n_epochs = 500 #number of epochs
t0 = 1.0
t1 = 10
gamma_j = t0/t1
j = 0
for epoch in range(1,n_epochs+1):
for i in range(m):
k = np.random.randint(m) #Pick the k-th minibatch at random
#Compute the gradient using the data in minibatch Bk
#Compute new suggestion for beta
t = epoch*m+i
gamma_j = step_length(t,t0,t1)
j += 1
print("gamma_j after %d epochs: %g" % (n_epochs,gamma_j))
gamma_j after 500 epochs: 9.97108e-05
6.52. Program for stochastic gradient¶
# Importing various packages
from math import exp, sqrt
from random import random, seed
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDRegressor
x = 2*np.random.rand(100,1)
y = 4+3*x+np.random.randn(100,1)
xb = np.c_[np.ones((100,1)), x]
theta_linreg = np.linalg.inv(xb.T.dot(xb)).dot(xb.T).dot(y)
print("Own inversion")
print(theta_linreg)
sgdreg = SGDRegressor(n_iter = 50, penalty=None, eta0=0.1)
sgdreg.fit(x,y.ravel())
print("sgdreg from scikit")
print(sgdreg.intercept_, sgdreg.coef_)
theta = np.random.randn(2,1)
eta = 0.1
Niterations = 1000
m = 100
for iter in range(Niterations):
gradients = 2.0/m*xb.T.dot(xb.dot(theta)-y)
theta -= eta*gradients
print("theta frm own gd")
print(theta)
xnew = np.array([[0],[2]])
xbnew = np.c_[np.ones((2,1)), xnew]
ypredict = xbnew.dot(theta)
ypredict2 = xbnew.dot(theta_linreg)
n_epochs = 50
t0, t1 = 5, 50
m = 100
def learning_schedule(t):
return t0/(t+t1)
theta = np.random.randn(2,1)
for epoch in range(n_epochs):
for i in range(m):
random_index = np.random.randint(m)
xi = xb[random_index:random_index+1]
yi = y[random_index:random_index+1]
gradients = 2 * xi.T.dot(xi.dot(theta)-yi)
eta = learning_schedule(epoch*m+i)
theta = theta - eta*gradients
print("theta from own sdg")
print(theta)
plt.plot(xnew, ypredict, "r-")
plt.plot(xnew, ypredict2, "b-")
plt.plot(x, y ,'ro')
plt.axis([0,2.0,0, 15.0])
plt.xlabel(r'$x$')
plt.ylabel(r'$y$')
plt.title(r'Random numbers ')
plt.show()
Own inversion
[[4.29601551]
[2.72779233]]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [10], in <cell line: 15>()
13 print("Own inversion")
14 print(theta_linreg)
---> 15 sgdreg = SGDRegressor(n_iter = 50, penalty=None, eta0=0.1)
16 sgdreg.fit(x,y.ravel())
17 print("sgdreg from scikit")
TypeError: __init__() got an unexpected keyword argument 'n_iter'
6.53. Using gradient descent methods, limitations¶
Gradient descent (GD) finds local minima of our function. Since the GD algorithm is deterministic, if it converges, it will converge to a local minimum of our energy function. Because in ML we are often dealing with extremely rugged landscapes with many local minima, this can lead to poor performance.
GD is sensitive to initial conditions. One consequence of the local nature of GD is that initial conditions matter. Depending on where one starts, one will end up at a different local minima. Therefore, it is very important to think about how one initializes the training process. This is true for GD as well as more complicated variants of GD.
Gradients are computationally expensive to calculate for large datasets. In many cases in statistics and ML, the energy function is a sum of terms, with one term for each data point. For example, in linear regression,
; for logistic regression, the square error is replaced by the cross entropy. To calculate the gradient we have to sum over all data points. Doing this at every GD step becomes extremely computationally expensive. An ingenious solution to this, is to calculate the gradients using small subsets of the data called “mini batches”. This has the added benefit of introducing stochasticity into our algorithm.GD is very sensitive to choices of learning rates. GD is extremely sensitive to the choice of learning rates. If the learning rate is very small, the training process take an extremely long time. For larger learning rates, GD can diverge and give poor results. Furthermore, depending on what the local landscape looks like, we have to modify the learning rates to ensure convergence. Ideally, we would adaptively choose the learning rates to match the landscape.
GD treats all directions in parameter space uniformly. Another major drawback of GD is that unlike Newton’s method, the learning rate for GD is the same in all directions in parameter space. For this reason, the maximum learning rate is set by the behavior of the steepest direction and this can significantly slow down training. Ideally, we would like to take large steps in flat directions and small steps in steep directions. Since we are exploring rugged landscapes where curvatures change, this requires us to keep track of not only the gradient but second derivatives. The ideal scenario would be to calculate the Hessian but this proves to be too computationally expensive.
GD can take exponential time to escape saddle points, even with random initialization. As we mentioned, GD is extremely sensitive to initial condition since it determines the particular local minimum GD would eventually reach. However, even with a good initialization scheme, through the introduction of randomness, GD can still take exponential time to escape saddle points.
6.54. Codes from numerical recipes¶
You can however use codes we have adapted from the text Numerical Recipes in C++, see chapter 10.7.
Here we present a program, which you also can find at the webpage of the course we use the functions dfpmin and lnsrch. This is a variant of the Broyden et al algorithm discussed in the previous slide.
The program uses the harmonic oscillator in one dimensions as example.
The program does not use armadillo to handle vectors and matrices, but employs rather my own vector-matrix class. These auxiliary functions, and the main program model.cpp can all be found under the program link here.
Below we show only excerpts from the main program. For the full program, see the above link.
6.55. Finding the minimum of the harmonic oscillator model in one dimension¶
// Main function begins here
int main()
{
int n, iter;
double gtol, fret;
double alpha;
n = 1;
// reserve space in memory for vectors containing the variational
// parameters
Vector g(n), p(n);
cout << "Read in guess for alpha" << endl;
cin >> alpha;
gtol = 1.0e-5;
// now call dfmin and compute the minimum
p(0) = alpha;
dfpmin(p, n, gtol, &iter, &fret, Efunction, dEfunction);
cout << "Value of energy minimum = " << fret << endl;
cout << "Number of iterations = " << iter << endl;
cout << "Value of alpha at minimum = " << p(0) << endl;
return 0;
} // end of main program
6.56. Functions to observe¶
The functions Efunction and dEfunction compute the expectation value of the energy and its derivative. They use the the quasi-Newton method of Broyden, Fletcher, Goldfarb, and Shanno (BFGS) It uses the first derivatives only. The BFGS algorithm has proven good performance even for non-smooth optimizations. These functions need to be changed when you want to your own derivatives.
// this function defines the expectation value of the local energy
double Efunction(Vector &x)
{
double value = x(0)*x(0)*0.5+1.0/(8*x(0)*x(0));
return value;
} // end of function to evaluate
// this function defines the derivative of the energy
void dEfunction(Vector &x, Vector &g)
{
g(0) = x(0)-1.0/(4*x(0)*x(0)*x(0));
} // end of function to evaluate
You need to change these functions in order to compute the local energy for your system. I used 1000
cycles per call to get a new value of