Week 38: Logistic Regression and Optimization#
Morten Hjorth-Jensen, Department of Physics and Center for Computing in Science Education, University of Oslo and Department of Physics and Astronomy and Facility for Rare Isotope Beams, Michigan State University
Date: September 16-20, 2024
Plans for week 38, lecture Monday September 16#
Material for the lecture on Monday September 16.
Logistic regression as our first encounter of classification methods. From binary cases to several categories.
Start gradient and optimization methods
Whiteboard notes at CompPhysics/MachineLearning
Suggested reading and videos#
Readings and Videos:
Hastie et al 4.1, 4.2 and 4.3 on logistic regression
Raschka et al, pages 53-76 on Logistic regression and pages 37-52 on gradient optimization
For a good discussion on gradient methods, see Goodfellow et al section 4.3-4.5 and chapter 8. We will come back to the latter chapter in our discussion of Neural networks as well.
Plans for the lab sessions#
Material for the active learning sessions on Tuesday and Wednesday.
Repetition from last week on the bias-variance tradeoff
Resampling techniques, cross-validation examples included here, see also the lectures from last week on the bootstrap method
Exercise for week 38 on the bias-variance tradeoff, see also the video from the lab session from week 37 at https://youtu.be/omLmp_kkie0
Work on project 1, in particular resampling methods like cross-validation and bootstrap.
Material for lecture Monday September 16#
Logistic Regression#
In linear regression our main interest was centered on learning the
coefficients of a functional fit (say a polynomial) in order to be
able to predict the response of a continuous variable on some unseen
data. The fit to the continuous variable
Classification problems#
Classification problems, however, are concerned with outcomes taking the form of discrete variables (i.e. categories). We may for example, on the basis of DNA sequencing for a number of patients, like to find out which mutations are important for a certain disease; or based on scans of various patients’ brains, figure out if there is a tumor or not; or given a specific physical system, we’d like to identify its state, say whether it is an ordered or disordered system (typical situation in solid state physics); or classify the status of a patient, whether she/he has a stroke or not and many other similar situations.
The most common situation we encounter when we apply logistic regression is that of two possible outcomes, normally denoted as a binary outcome, true or false, positive or negative, success or failure etc.
Optimization and Deep learning#
Logistic regression will also serve as our stepping stone towards
neural network algorithms and supervised deep learning. For logistic
learning, the minimization of the cost function leads to a non-linear
equation in the parameters
We note also that many of the topics discussed here on logistic regression are also commonly used in modern supervised Deep Learning models, as we will see later.
Basics#
We consider the case where the dependent variables, also called the
responses or the outcomes,
The goal is to predict the
output classes from the design matrix
Let us specialize to the case of two classes only, with outputs
Linear classifier#
Before moving to the logistic model, let us try to use our linear
regression model to classify these two outcomes. We could for example
fit a linear model to the default case if
We would then have our weighted linear combination, namely
where
Some selected properties#
The main problem with our function is that it takes values on the
entire real axis. In the case of logistic regression, however, the
labels
One simple way to get a discrete output is to have sign
functions that map the output of a linear regressor to values
Historically it is called the perceptron model in the machine learning literature. This model is extremely simple. However, in many cases it is more favorable to use a ``soft” classifier that outputs the probability of a given category. This leads us to the logistic function.
Simple example#
The following example on data for coronary heart disease (CHD) as function of age may serve as an illustration. In the code here we read and plot whether a person has had CHD (output = 1) or not (output = 0). This ouput is plotted the person’s against age. Clearly, the figure shows that attempting to make a standard linear regression fit may not be very meaningful.
%matplotlib inline
# Common imports
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.metrics import mean_squared_error
from IPython.display import display
from pylab import plt, mpl
plt.style.use('seaborn')
mpl.rcParams['font.family'] = 'serif'
# Where to save the figures and data files
PROJECT_ROOT_DIR = "Results"
FIGURE_ID = "Results/FigureFiles"
DATA_ID = "DataFiles/"
if not os.path.exists(PROJECT_ROOT_DIR):
os.mkdir(PROJECT_ROOT_DIR)
if not os.path.exists(FIGURE_ID):
os.makedirs(FIGURE_ID)
if not os.path.exists(DATA_ID):
os.makedirs(DATA_ID)
def image_path(fig_id):
return os.path.join(FIGURE_ID, fig_id)
def data_path(dat_id):
return os.path.join(DATA_ID, dat_id)
def save_fig(fig_id):
plt.savefig(image_path(fig_id) + ".png", format='png')
infile = open(data_path("chddata.csv"),'r')
# Read the chd data as csv file and organize the data into arrays with age group, age, and chd
chd = pd.read_csv(infile, names=('ID', 'Age', 'Agegroup', 'CHD'))
chd.columns = ['ID', 'Age', 'Agegroup', 'CHD']
output = chd['CHD']
age = chd['Age']
agegroup = chd['Agegroup']
numberID = chd['ID']
display(chd)
plt.scatter(age, output, marker='o')
plt.axis([18,70.0,-0.1, 1.2])
plt.xlabel(r'Age')
plt.ylabel(r'CHD')
plt.title(r'Age distribution and Coronary heart disease')
plt.show()
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
File ~/miniforge3/envs/myenv/lib/python3.9/site-packages/matplotlib/style/core.py:137, in use(style)
136 try:
--> 137 style = _rc_params_in_file(style)
138 except OSError as err:
File ~/miniforge3/envs/myenv/lib/python3.9/site-packages/matplotlib/__init__.py:870, in _rc_params_in_file(fname, transform, fail_on_error)
869 rc_temp = {}
--> 870 with _open_file_or_url(fname) as fd:
871 try:
File ~/miniforge3/envs/myenv/lib/python3.9/contextlib.py:119, in _GeneratorContextManager.__enter__(self)
118 try:
--> 119 return next(self.gen)
120 except StopIteration:
File ~/miniforge3/envs/myenv/lib/python3.9/site-packages/matplotlib/__init__.py:847, in _open_file_or_url(fname)
846 fname = os.path.expanduser(fname)
--> 847 with open(fname, encoding='utf-8') as f:
848 yield f
FileNotFoundError: [Errno 2] No such file or directory: 'seaborn'
The above exception was the direct cause of the following exception:
OSError Traceback (most recent call last)
Cell In[1], line 14
12 from IPython.display import display
13 from pylab import plt, mpl
---> 14 plt.style.use('seaborn')
15 mpl.rcParams['font.family'] = 'serif'
17 # Where to save the figures and data files
File ~/miniforge3/envs/myenv/lib/python3.9/site-packages/matplotlib/style/core.py:139, in use(style)
137 style = _rc_params_in_file(style)
138 except OSError as err:
--> 139 raise OSError(
140 f"{style!r} is not a valid package style, path of style "
141 f"file, URL of style file, or library style name (library "
142 f"styles are listed in `style.available`)") from err
143 filtered = {}
144 for k in style: # don't trigger RcParams.__getitem__('backend')
OSError: 'seaborn' is not a valid package style, path of style file, URL of style file, or library style name (library styles are listed in `style.available`)
Plotting the mean value for each group#
What we could attempt however is to plot the mean value for each group.
agegroupmean = np.array([0.1, 0.133, 0.250, 0.333, 0.462, 0.625, 0.765, 0.800])
group = np.array([1, 2, 3, 4, 5, 6, 7, 8])
plt.plot(group, agegroupmean, "r-")
plt.axis([0,9,0, 1.0])
plt.xlabel(r'Age group')
plt.ylabel(r'CHD mean values')
plt.title(r'Mean values for each age group')
plt.show()
We are now trying to find a function
This expression implies however that
The logistic function#
Another widely studied model, is the so-called
perceptron model, which is an example of a “hard classification” model. We
will encounter this model when we discuss neural networks as
well. Each datapoint is deterministically assigned to a category (i.e
Note that
Examples of likelihood functions used in logistic regression and nueral networks#
The following code plots the logistic function, the step function and other functions we will encounter from here and on.
"""The sigmoid function (or the logistic curve) is a
function that takes any real number, z, and outputs a number (0,1).
It is useful in neural networks for assigning weights on a relative scale.
The value z is the weighted sum of parameters involved in the learning algorithm."""
import numpy
import matplotlib.pyplot as plt
import math as mt
z = numpy.arange(-5, 5, .1)
sigma_fn = numpy.vectorize(lambda z: 1/(1+numpy.exp(-z)))
sigma = sigma_fn(z)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(z, sigma)
ax.set_ylim([-0.1, 1.1])
ax.set_xlim([-5,5])
ax.grid(True)
ax.set_xlabel('z')
ax.set_title('sigmoid function')
plt.show()
"""Step Function"""
z = numpy.arange(-5, 5, .02)
step_fn = numpy.vectorize(lambda z: 1.0 if z >= 0.0 else 0.0)
step = step_fn(z)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(z, step)
ax.set_ylim([-0.5, 1.5])
ax.set_xlim([-5,5])
ax.grid(True)
ax.set_xlabel('z')
ax.set_title('step function')
plt.show()
"""tanh Function"""
z = numpy.arange(-2*mt.pi, 2*mt.pi, 0.1)
t = numpy.tanh(z)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(z, t)
ax.set_ylim([-1.0, 1.0])
ax.set_xlim([-2*mt.pi,2*mt.pi])
ax.grid(True)
ax.set_xlabel('z')
ax.set_title('tanh function')
plt.show()
Two parameters#
We assume now that we have two classes with
where
Note that we used
Maximum likelihood#
In order to define the total likelihood for all possible outcomes from a
dataset
from which we obtain the log-likelihood and our cost/loss function
The cost function rewritten#
Reordering the logarithms, we can rewrite the cost/loss function as
The maximum likelihood estimator is defined as the set of parameters that maximize the log-likelihood where we maximize with respect to
This equation is known in statistics as the cross entropy. Finally, we note that just as in linear regression,
in practice we often supplement the cross-entropy with additional regularization terms, usually
Minimizing the cross entropy#
The cross entropy is a convex function of the weights
Minimizing this
cost function with respect to the two parameters
and
A more compact expression#
Let us now define a vector
If we in addition define a diagonal matrix
Extending to more predictors#
Within a binary classification problem, we can easily expand our model to include multiple predictors. Our ratio between likelihoods is then with
Here we defined
Including more classes#
Till now we have mainly focused on two classes, the so-called binary
system. Suppose we wish to extend to
and
and so on till the class
and the model is specified in term of
More classes#
In our discussion of neural networks we will encounter the above again in terms of a slightly modified function, the so-called Softmax function.
The softmax function is used in various multiclass classification
methods, such as multinomial logistic regression (also known as
softmax regression), multiclass linear discriminant analysis, naive
Bayes classifiers, and artificial neural networks. Specifically, in
multinomial logistic regression and linear discriminant analysis, the
input to the function is the result of
It is easy to extend to more predictors. The final class is
and they sum to one. Our earlier discussions were all specialized to the case with two classes only. It is easy to see from the above that what we derived earlier is compatible with these equations.
To find the optimal parameters we would typically use a gradient descent method. Newton’s method and gradient descent methods are discussed in the material on optimization methods.
Searching for Optimal Regularization Parameters #
In project 1, when using Ridge and Lasso regression, we end up
searching for the optimal parameter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import linear_model
def MSE(y_data,y_model):
n = np.size(y_model)
return np.sum((y_data-y_model)**2)/n
# A seed just to ensure that the random numbers are the same for every run.
# Useful for eventual debugging.
np.random.seed(2021)
n = 100
x = np.random.rand(n)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.randn(n)
Maxpolydegree = 5
X = np.zeros((n,Maxpolydegree-1))
for degree in range(1,Maxpolydegree): #No intercept column
X[:,degree-1] = x**(degree)
# We split the data in test and training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Decide which values of lambda to use
nlambdas = 500
MSERidgePredict = np.zeros(nlambdas)
lambdas = np.logspace(-4, 2, nlambdas)
for i in range(nlambdas):
lmb = lambdas[i]
RegRidge = linear_model.Ridge(lmb)
RegRidge.fit(X_train,y_train)
ypredictRidge = RegRidge.predict(X_test)
MSERidgePredict[i] = MSE(y_test,ypredictRidge)
# Now plot the results
plt.figure()
plt.plot(np.log10(lambdas), MSERidgePredict, 'g--', label = 'MSE SL Ridge Test')
plt.xlabel('log10(lambda)')
plt.ylabel('MSE')
plt.legend()
plt.show()
Here we have performed a rather data greedy calculation as function of the regularization parameter
Grid Search#
An alternative is to use the so-called grid search functionality included with the library Scikit-Learn, as demonstrated for the same example here.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
def R2(y_data, y_model):
return 1 - np.sum((y_data - y_model) ** 2) / np.sum((y_data - np.mean(y_data)) ** 2)
def MSE(y_data,y_model):
n = np.size(y_model)
return np.sum((y_data-y_model)**2)/n
# A seed just to ensure that the random numbers are the same for every run.
# Useful for eventual debugging.
np.random.seed(2021)
n = 100
x = np.random.rand(n)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.randn(n)
Maxpolydegree = 5
X = np.zeros((n,Maxpolydegree-1))
for degree in range(1,Maxpolydegree): #No intercept column
X[:,degree-1] = x**(degree)
# We split the data in test and training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Decide which values of lambda to use
nlambdas = 10
lambdas = np.logspace(-4, 2, nlambdas)
# create and fit a ridge regression model, testing each alpha
model = Ridge()
gridsearch = GridSearchCV(estimator=model, param_grid=dict(alpha=lambdas))
gridsearch.fit(X_train, y_train)
print(gridsearch)
ypredictRidge = gridsearch.predict(X_test)
# summarize the results of the grid search
print(f"Best estimated lambda-value: {gridsearch.best_estimator_.alpha}")
print(f"MSE score: {MSE(y_test,ypredictRidge)}")
print(f"R2 score: {R2(y_test,ypredictRidge)}")
By default the grid search function includes cross validation with five folds. The Scikit-Learn documentation contains more information on how to set the different parameters.
If we take out the random noise, running the above codes results in
Randomized Grid Search#
An alternative to the above manual grid set up, is to use a random
search where the parameters are tuned from a random distribution
(uniform below) for a fixed number of iterations. A model is
constructed and evaluated for each combination of chosen parameters.
We repeat the previous example but now with a random search. Note
that values of
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from scipy.stats import uniform as randuniform
from sklearn.model_selection import RandomizedSearchCV
def R2(y_data, y_model):
return 1 - np.sum((y_data - y_model) ** 2) / np.sum((y_data - np.mean(y_data)) ** 2)
def MSE(y_data,y_model):
n = np.size(y_model)
return np.sum((y_data-y_model)**2)/n
# A seed just to ensure that the random numbers are the same for every run.
# Useful for eventual debugging.
np.random.seed(2021)
n = 100
x = np.random.rand(n)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.randn(n)
Maxpolydegree = 5
X = np.zeros((n,Maxpolydegree-1))
for degree in range(1,Maxpolydegree): #No intercept column
X[:,degree-1] = x**(degree)
# We split the data in test and training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
param_grid = {'alpha': randuniform()}
# create and fit a ridge regression model, testing each alpha
model = Ridge()
gridsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)
gridsearch.fit(X_train, y_train)
print(gridsearch)
ypredictRidge = gridsearch.predict(X_test)
# summarize the results of the grid search
print(f"Best estimated lambda-value: {gridsearch.best_estimator_.alpha}")
print(f"MSE score: {MSE(y_test,ypredictRidge)}")
print(f"R2 score: {R2(y_test,ypredictRidge)}")
Wisconsin Cancer Data#
We show here how we can use a simple regression case on the breast cancer data using Logistic regression as our algorithm for classification.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
# Load the data
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,random_state=0)
print(X_train.shape)
print(X_test.shape)
# Logistic Regression
logreg = LogisticRegression(solver='lbfgs')
logreg.fit(X_train, y_train)
print("Test set accuracy with Logistic Regression: {:.2f}".format(logreg.score(X_test,y_test)))
Using the correlation matrix#
In addition to the above scores, we could also study the covariance (and the correlation matrix). We use Pandas to compute the correlation matrix.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
cancer = load_breast_cancer()
import pandas as pd
# Making a data frame
cancerpd = pd.DataFrame(cancer.data, columns=cancer.feature_names)
fig, axes = plt.subplots(15,2,figsize=(10,20))
malignant = cancer.data[cancer.target == 0]
benign = cancer.data[cancer.target == 1]
ax = axes.ravel()
for i in range(30):
_, bins = np.histogram(cancer.data[:,i], bins =50)
ax[i].hist(malignant[:,i], bins = bins, alpha = 0.5)
ax[i].hist(benign[:,i], bins = bins, alpha = 0.5)
ax[i].set_title(cancer.feature_names[i])
ax[i].set_yticks(())
ax[0].set_xlabel("Feature magnitude")
ax[0].set_ylabel("Frequency")
ax[0].legend(["Malignant", "Benign"], loc ="best")
fig.tight_layout()
plt.show()
import seaborn as sns
correlation_matrix = cancerpd.corr().round(1)
# use the heatmap function from seaborn to plot the correlation matrix
# annot = True to print the values inside the square
plt.figure(figsize=(15,8))
sns.heatmap(data=correlation_matrix, annot=True)
plt.show()
Discussing the correlation data#
In the above example we note two things. In the first plot we display the overlap of benign and malignant tumors as functions of the various features in the Wisconsing breast cancer data set. We see that for some of the features we can distinguish clearly the benign and malignant cases while for other features we cannot. This can point to us which features may be of greater interest when we wish to classify a benign or not benign tumour.
In the second figure we have computed the so-called correlation
matrix, which in our case with thirty features becomes a
We constructed this matrix using pandas via the statements
cancerpd = pd.DataFrame(cancer.data, columns=cancer.feature_names)
and then
correlation_matrix = cancerpd.corr().round(1)
Diagonalizing this matrix we can in turn say something about which features are of relevance and which are not. This leads us to the classical Principal Component Analysis (PCA) theorem with applications. This will be discussed later this semester (week 43).
Other measures in classification studies: Cancer Data again#
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
# Load the data
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,random_state=0)
print(X_train.shape)
print(X_test.shape)
# Logistic Regression
logreg = LogisticRegression(solver='lbfgs')
logreg.fit(X_train, y_train)
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_validate
#Cross validation
accuracy = cross_validate(logreg,X_test,y_test,cv=10)['test_score']
print(accuracy)
print("Test set accuracy with Logistic Regression: {:.2f}".format(logreg.score(X_test,y_test)))
import scikitplot as skplt
y_pred = logreg.predict(X_test)
skplt.metrics.plot_confusion_matrix(y_test, y_pred, normalize=True)
plt.show()
y_probas = logreg.predict_proba(X_test)
skplt.metrics.plot_roc(y_test, y_probas)
plt.show()
skplt.metrics.plot_cumulative_gain(y_test, y_probas)
plt.show()
Optimization, the central part of any Machine Learning algortithm#
Overview Video, why do we care about gradient methods?
Almost every problem in machine learning and data science starts with
a dataset
Revisiting our Logistic Regression case#
In our discussion on Logistic Regression we studied the
case of
two classes, with
where
The equations to solve#
Our compact equations used a definition of a vector
If we in addition define a diagonal matrix
This defines what is called the Hessian matrix.
Solving using Newton-Raphson’s method#
If we can set up these equations, Newton-Raphson’s iterative method is normally the method of choice. It requires however that we can compute in an efficient way the matrices that define the first and second derivatives.
Our iterative scheme is then given by
or in matrix form as
The right-hand side is computed with the old values of
If we can compute these matrices, in particular the Hessian, the above is often the easiest method to implement.
Brief reminder on Newton-Raphson’s method#
Let us quickly remind ourselves how we derive the above method.
Perhaps the most celebrated of all one-dimensional root-finding
routines is Newton’s method, also called the Newton-Raphson
method. This method requires the evaluation of both the
function
The equations#
The Newton-Raphson formula consists geometrically of extending the
tangent line at a current point until it crosses zero, then setting
the next guess to the abscissa of that zero-crossing. The mathematics
behind this method is rather simple. Employing a Taylor expansion for
For small enough values of the function and for well-behaved functions, the terms beyond linear are unimportant, hence we obtain
yielding
Having in mind an iterative procedure, it is natural to start iterating with
Simple geometric interpretation#
The above is Newton-Raphson’s method. It has a simple geometric
interpretation, namely
Extending to more than one variable#
Newton’s method can be generalized to systems of several non-linear equations and variables. Consider the case with two equations
which we Taylor expand to obtain
Defining the Jacobian matrix
we can rephrase Newton’s method as
where we have defined
We need thus to compute the inverse of the Jacobian matrix and it
is to understand that difficulties may
arise in case
It is rather straightforward to extend the above scheme to systems of more than two non-linear equations. In our case, the Jacobian matrix is given by the Hessian that represents the second derivative of cost function.
Steepest descent#
The basic idea of gradient descent is
that a function
It can be shown that if
with
For
More on Steepest descent#
The previous observation is the basis of the method of steepest
descent, which is also referred to as just gradient descent (GD). One
starts with an initial guess
The parameter
The ideal#
Ideally the sequence
In machine learing we are often faced with non-convex high dimensional cost functions with many local minima. Since GD is deterministic we will get stuck in a local minimum, if the method converges, unless we have a very good intial guess. This also implies that the scheme is sensitive to the chosen initial condition.
Note that the gradient is a function of
The sensitiveness of the gradient descent#
The gradient descent method
is sensitive to the choice of learning rate
Many of these shortcomings can be alleviated by introducing randomness. One such method is that of Stochastic Gradient Descent (SGD), to be discussed next week.
Convex functions#
Ideally we want our cost/loss function to be convex(concave).
First we give the definition of a convex set: A set
The convex subsets of
Convex function#
Convex function: Let
Conditions on convex functions#
In the following we state first and second-order conditions which
ensures convexity of a function
First order condition.
Suppose
Second order condition.
Assume that
This condition is particularly useful since it gives us an procedure for determining if the function under consideration is convex, apart from using the definition.
More on convex functions#
The next result is of great importance to us and the reason why we are going on about convex functions. In machine learning we frequently have to minimize a loss/cost function in order to find the best parameters for the model we are considering.
Ideally we want the global minimum (for high-dimensional models it is hard to know if we have local or global minimum). However, if the cost/loss function is convex the following result provides invaluable information:
Any minimum is global for convex functions.
Consider the problem of finding
This result means that if we know that the cost/loss function is convex and we are able to find a minimum, we are guaranteed that it is a global minimum.
Some simple problems#
Show that
is convex for using the definition of convexity. Hint: If you re-write the definition, is convex if the following holds for all and any .Using the second order condition show that the following functions are convex on the specified domain.
is convex for . is convex for .
Let
and . Show that and is convex for . Also show that if is any convex function than is convex.A norm is any function that satisfy the following properties
for all . for all with equality if and only if
Using the definition of convexity, try to show that a function satisfying the properties above is convex (the third condition is not needed to show this).
Revisiting our first homework#
We will use linear regression as a case study for the gradient descent methods. Linear regression is a great test case for the gradient descent methods discussed in the lectures since it has several desirable properties such as:
An analytical solution (recall homework set 1).
The gradient can be computed analytically.
The cost function is convex which guarantees that gradient descent converges for small enough learning rates
We revisit an example similar to what we had in the first homework set. We had a function of the type
x = 2*np.random.rand(m,1)
y = 4+3*x+np.random.randn(m,1)
with
such that
Gradient descent example#
Let
It is convenient to write
The cost/loss/risk function is given by (
and we want to find
The derivative of the cost/loss function#
Computing
where
The Hessian matrix#
The Hessian matrix of
This result implies that
Simple program#
We can now write a program that minimizes
We can use the expression we computed for the gradient and let use a
And finally we can compare our solution for
Gradient Descent Example#
Here our simple example
# Importing various packages
from random import random, seed
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
import sys
# the number of datapoints
n = 100
x = 2*np.random.rand(n,1)
y = 4+3*x+np.random.randn(n,1)
X = np.c_[np.ones((n,1)), x]
# Hessian matrix
H = (2.0/n)* X.T @ X
# Get the eigenvalues
EigValues, EigVectors = np.linalg.eig(H)
print(f"Eigenvalues of Hessian Matrix:{EigValues}")
beta_linreg = np.linalg.inv(X.T @ X) @ X.T @ y
print(beta_linreg)
beta = np.random.randn(2,1)
eta = 1.0/np.max(EigValues)
Niterations = 1000
for iter in range(Niterations):
gradient = (2.0/n)*X.T @ (X @ beta-y)
beta -= eta*gradient
print(beta)
xnew = np.array([[0],[2]])
xbnew = np.c_[np.ones((2,1)), xnew]
ypredict = xbnew.dot(beta)
ypredict2 = xbnew.dot(beta_linreg)
plt.plot(xnew, ypredict, "r-")
plt.plot(xnew, ypredict2, "b-")
plt.plot(x, y ,'ro')
plt.axis([0,2.0,0, 15.0])
plt.xlabel(r'$x$')
plt.ylabel(r'$y$')
plt.title(r'Gradient descent example')
plt.show()
And a corresponding example using scikit-learn#
# Importing various packages
from random import random, seed
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDRegressor
n = 100
x = 2*np.random.rand(n,1)
y = 4+3*x+np.random.randn(n,1)
X = np.c_[np.ones((n,1)), x]
beta_linreg = np.linalg.inv(X.T @ X) @ (X.T @ y)
print(beta_linreg)
sgdreg = SGDRegressor(max_iter = 50, penalty=None, eta0=0.1)
sgdreg.fit(x,y.ravel())
print(sgdreg.intercept_, sgdreg.coef_)
Gradient descent and Ridge#
We have also discussed Ridge regression where the loss function contains a regularized term given by the
In order to minimize
We can easily extend our program to minimize
The Hessian matrix for Ridge Regression#
The Hessian matrix of Ridge Regression for our simple example is given by
This implies that the Hessian matrix is positive definite, hence the stationary point is a minimum. Note that the Ridge cost function is convex being a sum of two convex functions. Therefore, the stationary point is a global minimum of this function.
Program example for gradient descent with Ridge Regression#
from random import random, seed
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
import sys
# the number of datapoints
n = 100
x = 2*np.random.rand(n,1)
y = 4+3*x+np.random.randn(n,1)
X = np.c_[np.ones((n,1)), x]
XT_X = X.T @ X
#Ridge parameter lambda
lmbda = 0.001
Id = n*lmbda* np.eye(XT_X.shape[0])
# Hessian matrix
H = (2.0/n)* XT_X+2*lmbda* np.eye(XT_X.shape[0])
# Get the eigenvalues
EigValues, EigVectors = np.linalg.eig(H)
print(f"Eigenvalues of Hessian Matrix:{EigValues}")
beta_linreg = np.linalg.inv(XT_X+Id) @ X.T @ y
print(beta_linreg)
# Start plain gradient descent
beta = np.random.randn(2,1)
eta = 1.0/np.max(EigValues)
Niterations = 100
for iter in range(Niterations):
gradients = 2.0/n*X.T @ (X @ (beta)-y)+2*lmbda*beta
beta -= eta*gradients
print(beta)
ypredict = X @ beta
ypredict2 = X @ beta_linreg
plt.plot(x, ypredict, "r-")
plt.plot(x, ypredict2, "b-")
plt.plot(x, y ,'ro')
plt.axis([0,2.0,0, 15.0])
plt.xlabel(r'$x$')
plt.ylabel(r'$y$')
plt.title(r'Gradient descent example for Ridge')
plt.show()
Using gradient descent methods, limitations#
Gradient descent (GD) finds local minima of our function. Since the GD algorithm is deterministic, if it converges, it will converge to a local minimum of our cost/loss/risk function. Because in ML we are often dealing with extremely rugged landscapes with many local minima, this can lead to poor performance.
GD is sensitive to initial conditions. One consequence of the local nature of GD is that initial conditions matter. Depending on where one starts, one will end up at a different local minima. Therefore, it is very important to think about how one initializes the training process. This is true for GD as well as more complicated variants of GD.
Gradients are computationally expensive to calculate for large datasets. In many cases in statistics and ML, the cost/loss/risk function is a sum of terms, with one term for each data point. For example, in linear regression,
; for logistic regression, the square error is replaced by the cross entropy. To calculate the gradient we have to sum over all data points. Doing this at every GD step becomes extremely computationally expensive. An ingenious solution to this, is to calculate the gradients using small subsets of the data called “mini batches”. This has the added benefit of introducing stochasticity into our algorithm.GD is very sensitive to choices of learning rates. GD is extremely sensitive to the choice of learning rates. If the learning rate is very small, the training process take an extremely long time. For larger learning rates, GD can diverge and give poor results. Furthermore, depending on what the local landscape looks like, we have to modify the learning rates to ensure convergence. Ideally, we would adaptively choose the learning rates to match the landscape.
GD treats all directions in parameter space uniformly. Another major drawback of GD is that unlike Newton’s method, the learning rate for GD is the same in all directions in parameter space. For this reason, the maximum learning rate is set by the behavior of the steepest direction and this can significantly slow down training. Ideally, we would like to take large steps in flat directions and small steps in steep directions. Since we are exploring rugged landscapes where curvatures change, this requires us to keep track of not only the gradient but second derivatives. The ideal scenario would be to calculate the Hessian but this proves to be too computationally expensive.
GD can take exponential time to escape saddle points, even with random initialization. As we mentioned, GD is extremely sensitive to initial condition since it determines the particular local minimum GD would eventually reach. However, even with a good initialization scheme, through the introduction of randomness, GD can still take exponential time to escape saddle points.
Challenge yourself the coming weekend#
Write a code which implements gradient descent for a logistic regression example.
Lab session: Material from last week and relevant for the first project#
Various steps in cross-validation#
When the repetitive splitting of the data set is done randomly,
samples may accidently end up in a fast majority of the splits in
either training or test set. Such samples may have an unbalanced
influence on either model building or prediction evaluation. To avoid
this
How to set up the cross-validation for Ridge and/or Lasso#
Define a range of interest for the penalty parameter.
Divide the data set into training and test set comprising samples
and , respectively.Fit the linear regression model by means of for example Ridge or Lasso regression for each
in the grid using the training set, and the corresponding estimate of the error variance , as
Evaluate the prediction performance of these models on the test set by
. Or, by the prediction error , the relative error, the error squared or the R2 score function.Repeat the first three steps such that each sample plays the role of the test set once.
Average the prediction performances of the test sets at each grid point of the penalty bias/parameter. It is an estimate of the prediction performance of the model corresponding to this value of the penalty parameter on novel data.
Cross-validation in brief#
For the various values of
shuffle the dataset randomly.
Split the dataset into
groups.For each unique group:
a. Decide which group to use as set for test data
b. Take the remaining groups as a training data set
c. Fit a model on the training set and evaluate it on the test set
d. Retain the evaluation score and discard the model
Summarize the model using the sample of model evaluation scores
Code Example for Cross-validation and -fold Cross-validation#
The code here uses Ridge regression with cross-validation (CV) resampling and
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
# A seed just to ensure that the random numbers are the same for every run.
# Useful for eventual debugging.
np.random.seed(3155)
# Generate the data.
nsamples = 100
x = np.random.randn(nsamples)
y = 3*x**2 + np.random.randn(nsamples)
## Cross-validation on Ridge regression using KFold only
# Decide degree on polynomial to fit
poly = PolynomialFeatures(degree = 6)
# Decide which values of lambda to use
nlambdas = 500
lambdas = np.logspace(-3, 5, nlambdas)
# Initialize a KFold instance
k = 5
kfold = KFold(n_splits = k)
# Perform the cross-validation to estimate MSE
scores_KFold = np.zeros((nlambdas, k))
i = 0
for lmb in lambdas:
ridge = Ridge(alpha = lmb)
j = 0
for train_inds, test_inds in kfold.split(x):
xtrain = x[train_inds]
ytrain = y[train_inds]
xtest = x[test_inds]
ytest = y[test_inds]
Xtrain = poly.fit_transform(xtrain[:, np.newaxis])
ridge.fit(Xtrain, ytrain[:, np.newaxis])
Xtest = poly.fit_transform(xtest[:, np.newaxis])
ypred = ridge.predict(Xtest)
scores_KFold[i,j] = np.sum((ypred - ytest[:, np.newaxis])**2)/np.size(ypred)
j += 1
i += 1
estimated_mse_KFold = np.mean(scores_KFold, axis = 1)
## Cross-validation using cross_val_score from sklearn along with KFold
# kfold is an instance initialized above as:
# kfold = KFold(n_splits = k)
estimated_mse_sklearn = np.zeros(nlambdas)
i = 0
for lmb in lambdas:
ridge = Ridge(alpha = lmb)
X = poly.fit_transform(x[:, np.newaxis])
estimated_mse_folds = cross_val_score(ridge, X, y[:, np.newaxis], scoring='neg_mean_squared_error', cv=kfold)
# cross_val_score return an array containing the estimated negative mse for every fold.
# we have to the the mean of every array in order to get an estimate of the mse of the model
estimated_mse_sklearn[i] = np.mean(-estimated_mse_folds)
i += 1
## Plot and compare the slightly different ways to perform cross-validation
plt.figure()
plt.plot(np.log10(lambdas), estimated_mse_sklearn, label = 'cross_val_score')
plt.plot(np.log10(lambdas), estimated_mse_KFold, 'r--', label = 'KFold')
plt.xlabel('log10(lambda)')
plt.ylabel('mse')
plt.legend()
plt.show()