Week 37: Gradient descent methods
Contents
Plans for week 37, lecture Monday
Readings and Videos:
Material for lecture Monday September 8
Gradient descent and revisiting Ordinary Least Squares from last week
Gradient descent example
The derivative of the cost/loss function
The Hessian matrix
Simple program
Gradient Descent Example
Gradient descent and Ridge
The Hessian matrix for Ridge Regression
Program example for gradient descent with Ridge Regression
Using gradient descent methods, limitations
Momentum based GD
Improving gradient descent with momentum
Same code but now with momentum gradient descent
Overview video on Stochastic Gradient Descent (SGD)
Batches and mini-batches
Pros and cons
Convergence rates
Accuracy
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent
Computation of gradients
SGD example
The gradient step
Simple example code
When do we stop?
Slightly different approach
Time decay rate
Code with a Number of Minibatches which varies
Replace or not
SGD vs Full-Batch GD: Convergence Speed and Memory Comparison
Theoretical Convergence Speed and convex optimization
Strongly Convex Case
Non-Convex Problems
Memory Usage and Scalability
Empirical Evidence: Convergence Time and Memory in Practice
Deep Neural Networks
Memory constraints
Second moment of the gradient
Challenge: Choosing a Fixed Learning Rate
Motivation for Adaptive Step Sizes
AdaGrad algorithm, taken from "Goodfellow et al":"https://www.deeplearningbook.org/contents/optimization.html"
Derivation of the AdaGrad Algorithm
AdaGrad Update Rule Derivation
AdaGrad Properties
RMSProp: Adaptive Learning Rates
RMSProp algorithm, taken from "Goodfellow et al":"https://www.deeplearningbook.org/contents/optimization.html"
Adam Optimizer
"ADAM optimizer":"https://arxiv.org/abs/1412.6980"
Why Combine Momentum and RMSProp?
Adam: Exponential Moving Averages (Moments)
Adam: Bias Correction
Adam: Update Rule Derivation
Adam vs. AdaGrad and RMSProp
Adaptivity Across Dimensions
ADAM algorithm, taken from "Goodfellow et al":"https://www.deeplearningbook.org/contents/optimization.html"
Algorithms and codes for Adagrad, RMSprop and Adam
Practical tips
Sneaking in automatic differentiation using Autograd
Same code but now with momentum gradient descent
Including Stochastic Gradient Descent with Autograd
Same code but now with momentum gradient descent
But none of these can compete with Newton's method
Similar (second order function now) problem but now with AdaGrad
RMSprop for adaptive learning rate with Stochastic Gradient Descent
And finally "ADAM":"https://arxiv.org/pdf/1412.6980.pdf"
Material for the lab sessions
Reminder on different scaling methods
Functionality in Scikit-Learn
More preprocessing
Frequently used scaling functions
Week 37: Gradient descent methods
Morten Hjorth-Jensen
Department of Physics, University of Oslo, Norway
September 8-12, 2025
Read »
1
2
3
4
5
6
7
8
9
10
...
66
»
© 1999-2025, Morten Hjorth-Jensen. Released under CC Attribution-NonCommercial 4.0 license