6. Logistic Regression#
6.1. Logistic Regression#
In linear regression our main interest was centered on learning the
coefficients of a functional fit (say a polynomial) in order to be
able to predict the response of a continuous variable on some unseen
data. The fit to the continuous variable
Classification problems, however, are concerned with outcomes taking the form of discrete variables (i.e. categories). We may for example, on the basis of DNA sequencing for a number of patients, like to find out which mutations are important for a certain disease; or based on scans of various patients’ brains, figure out if there is a tumor or not; or given a specific physical system, we’d like to identify its state, say whether it is an ordered or disordered system (typical situation in solid state physics); or classify the status of a patient, whether she/he has a stroke or not and many other similar situations.
The most common situation we encounter when we apply logistic regression is that of two possible outcomes, normally denoted as a binary outcome, true or false, positive or negative, success or failure etc.
Logistic regression will also serve as our stepping stone towards
neural network algorithms and supervised deep learning. For logistic
learning, the minimization of the cost function leads to a non-linear
equation in the parameters
We note also that many of the topics discussed here on logistic regression are also commonly used in modern supervised Deep Learning models, as we will see later.
6.2. Basics#
We consider the case where the dependent variables, also called the
responses or the outcomes,
The goal is to predict the
output classes from the design matrix
Let us specialize to the case of two classes only, with outputs
Before moving to the logistic model, let us try to use our linear
regression model to classify these two outcomes. We could for example
fit a linear model to the default case if
We would then have our weighted linear combination, namely
where
The main problem with our function is that it takes values on the
entire real axis. In the case of logistic regression, however, the
labels
One simple way to get a discrete output is to have sign
functions that map the output of a linear regressor to values perceptron" model in the machine learning literature. This model is extremely simple. However, in many cases it is more favorable to use a
soft” classifier that outputs
the probability of a given category. This leads us to the logistic function.
The following example on data for coronary heart disease (CHD) as function of age may serve as an illustration. In the code here we read and plot whether a person has had CHD (output = 1) or not (output = 0). This ouput is plotted the person’s against age. Clearly, the figure shows that attempting to make a standard linear regression fit may not be very meaningful.
%matplotlib inline
# Common imports
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.metrics import mean_squared_error
from IPython.display import display
from pylab import plt, mpl
plt.style.use('seaborn')
mpl.rcParams['font.family'] = 'serif'
# Where to save the figures and data files
PROJECT_ROOT_DIR = "Results"
FIGURE_ID = "Results/FigureFiles"
DATA_ID = "DataFiles/"
if not os.path.exists(PROJECT_ROOT_DIR):
os.mkdir(PROJECT_ROOT_DIR)
if not os.path.exists(FIGURE_ID):
os.makedirs(FIGURE_ID)
if not os.path.exists(DATA_ID):
os.makedirs(DATA_ID)
def image_path(fig_id):
return os.path.join(FIGURE_ID, fig_id)
def data_path(dat_id):
return os.path.join(DATA_ID, dat_id)
def save_fig(fig_id):
plt.savefig(image_path(fig_id) + ".png", format='png')
infile = open(data_path("chddata.csv"),'r')
# Read the chd data as csv file and organize the data into arrays with age group, age, and chd
chd = pd.read_csv(infile, names=('ID', 'Age', 'Agegroup', 'CHD'))
chd.columns = ['ID', 'Age', 'Agegroup', 'CHD']
output = chd['CHD']
age = chd['Age']
agegroup = chd['Agegroup']
numberID = chd['ID']
display(chd)
plt.scatter(age, output, marker='o')
plt.axis([18,70.0,-0.1, 1.2])
plt.xlabel(r'Age')
plt.ylabel(r'CHD')
plt.title(r'Age distribution and Coronary heart disease')
plt.show()
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
File ~/miniforge3/envs/myenv/lib/python3.9/site-packages/matplotlib/style/core.py:137, in use(style)
136 try:
--> 137 style = _rc_params_in_file(style)
138 except OSError as err:
File ~/miniforge3/envs/myenv/lib/python3.9/site-packages/matplotlib/__init__.py:870, in _rc_params_in_file(fname, transform, fail_on_error)
869 rc_temp = {}
--> 870 with _open_file_or_url(fname) as fd:
871 try:
File ~/miniforge3/envs/myenv/lib/python3.9/contextlib.py:119, in _GeneratorContextManager.__enter__(self)
118 try:
--> 119 return next(self.gen)
120 except StopIteration:
File ~/miniforge3/envs/myenv/lib/python3.9/site-packages/matplotlib/__init__.py:847, in _open_file_or_url(fname)
846 fname = os.path.expanduser(fname)
--> 847 with open(fname, encoding='utf-8') as f:
848 yield f
FileNotFoundError: [Errno 2] No such file or directory: 'seaborn'
The above exception was the direct cause of the following exception:
OSError Traceback (most recent call last)
Cell In[1], line 14
12 from IPython.display import display
13 from pylab import plt, mpl
---> 14 plt.style.use('seaborn')
15 mpl.rcParams['font.family'] = 'serif'
17 # Where to save the figures and data files
File ~/miniforge3/envs/myenv/lib/python3.9/site-packages/matplotlib/style/core.py:139, in use(style)
137 style = _rc_params_in_file(style)
138 except OSError as err:
--> 139 raise OSError(
140 f"{style!r} is not a valid package style, path of style "
141 f"file, URL of style file, or library style name (library "
142 f"styles are listed in `style.available`)") from err
143 filtered = {}
144 for k in style: # don't trigger RcParams.__getitem__('backend')
OSError: 'seaborn' is not a valid package style, path of style file, URL of style file, or library style name (library styles are listed in `style.available`)
What we could attempt however is to plot the mean value for each group.
agegroupmean = np.array([0.1, 0.133, 0.250, 0.333, 0.462, 0.625, 0.765, 0.800])
group = np.array([1, 2, 3, 4, 5, 6, 7, 8])
plt.plot(group, agegroupmean, "r-")
plt.axis([0,9,0, 1.0])
plt.xlabel(r'Age group')
plt.ylabel(r'CHD mean values')
plt.title(r'Mean values for each age group')
plt.show()
We are now trying to find a function
This expression implies however that
6.3. The logistic function#
Another widely studied model, is the so-called
perceptron model, which is an example of a “hard classification” model. We
will encounter this model when we discuss neural networks as
well. Each datapoint is deterministically assigned to a category (i.e
Note that
6.4. Examples of likelihood functions used in logistic regression and neural networks#
The following code plots the logistic function, the step function and other functions we will encounter from here and on.
"""The sigmoid function (or the logistic curve) is a
function that takes any real number, z, and outputs a number (0,1).
It is useful in neural networks for assigning weights on a relative scale.
The value z is the weighted sum of parameters involved in the learning algorithm."""
import numpy
import matplotlib.pyplot as plt
import math as mt
z = numpy.arange(-5, 5, .1)
sigma_fn = numpy.vectorize(lambda z: 1/(1+numpy.exp(-z)))
sigma = sigma_fn(z)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(z, sigma)
ax.set_ylim([-0.1, 1.1])
ax.set_xlim([-5,5])
ax.grid(True)
ax.set_xlabel('z')
ax.set_title('sigmoid function')
plt.show()
"""Step Function"""
z = numpy.arange(-5, 5, .02)
step_fn = numpy.vectorize(lambda z: 1.0 if z >= 0.0 else 0.0)
step = step_fn(z)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(z, step)
ax.set_ylim([-0.5, 1.5])
ax.set_xlim([-5,5])
ax.grid(True)
ax.set_xlabel('z')
ax.set_title('step function')
plt.show()
"""tanh Function"""
z = numpy.arange(-2*mt.pi, 2*mt.pi, 0.1)
t = numpy.tanh(z)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(z, t)
ax.set_ylim([-1.0, 1.0])
ax.set_xlim([-2*mt.pi,2*mt.pi])
ax.grid(True)
ax.set_xlabel('z')
ax.set_title('tanh function')
plt.show()
We assume now that we have two classes with
where
Note that we used
In order to define the total likelihood for all possible outcomes from a
dataset
from which we obtain the log-likelihood and our cost/loss function
Reordering the logarithms, we can rewrite the cost/loss function as
The maximum likelihood estimator is defined as the set of parameters that maximize the log-likelihood where we maximize with respect to
This equation is known in statistics as the cross entropy. Finally, we note that just as in linear regression,
in practice we often supplement the cross-entropy with additional regularization terms, usually
The cross entropy is a convex function of the weights
Minimizing this
cost function with respect to the two parameters
and
Let us now define a vector
If we in addition define a diagonal matrix
Within a binary classification problem, we can easily expand our model to include multiple predictors. Our ratio between likelihoods is then with
Here we defined
Till now we have mainly focused on two classes, the so-called binary
system. Suppose we wish to extend to
and
and so on till the class
and the model is specified in term of
In our discussion of neural networks we will encounter the above again in terms of a slightly modified function, the so-called Softmax function.
The softmax function is used in various multiclass classification
methods, such as multinomial logistic regression (also known as
softmax regression), multiclass linear discriminant analysis, naive
Bayes classifiers, and artificial neural networks. Specifically, in
multinomial logistic regression and linear discriminant analysis, the
input to the function is the result of
It is easy to extend to more predictors. The final class is
and they sum to one. Our earlier discussions were all specialized to the case with two classes only. It is easy to see from the above that what we derived earlier is compatible with these equations.
To find the optimal parameters we would typically use a gradient descent method. Newton’s method and gradient descent methods are discussed in the material on optimization methods.
6.5. Wisconsin Cancer Data#
We show here how we can use a simple regression case on the breast cancer data using Logistic regression as our algorithm for classification.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
# Load the data
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,random_state=0)
print(X_train.shape)
print(X_test.shape)
# Logistic Regression
logreg = LogisticRegression(solver='lbfgs')
logreg.fit(X_train, y_train)
print("Test set accuracy with Logistic Regression: {:.2f}".format(logreg.score(X_test,y_test)))
#now scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Logistic Regression
logreg.fit(X_train_scaled, y_train)
print("Test set accuracy Logistic Regression with scaled data: {:.2f}".format(logreg.score(X_test_scaled,y_test)))
In addition to the above scores, we could also study the covariance (and the correlation matrix). We use Pandas to compute the correlation matrix.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
cancer = load_breast_cancer()
import pandas as pd
# Making a data frame
cancerpd = pd.DataFrame(cancer.data, columns=cancer.feature_names)
fig, axes = plt.subplots(15,2,figsize=(10,20))
malignant = cancer.data[cancer.target == 0]
benign = cancer.data[cancer.target == 1]
ax = axes.ravel()
for i in range(30):
_, bins = np.histogram(cancer.data[:,i], bins =50)
ax[i].hist(malignant[:,i], bins = bins, alpha = 0.5)
ax[i].hist(benign[:,i], bins = bins, alpha = 0.5)
ax[i].set_title(cancer.feature_names[i])
ax[i].set_yticks(())
ax[0].set_xlabel("Feature magnitude")
ax[0].set_ylabel("Frequency")
ax[0].legend(["Malignant", "Benign"], loc ="best")
fig.tight_layout()
plt.show()
import seaborn as sns
correlation_matrix = cancerpd.corr().round(1)
# use the heatmap function from seaborn to plot the correlation matrix
# annot = True to print the values inside the square
plt.figure(figsize=(15,8))
sns.heatmap(data=correlation_matrix, annot=True)
plt.show()
In the above example we note two things. In the first plot we display the overlap of benign and malignant tumors as functions of the various features in the Wisconsing breast cancer data set. We see that for some of the features we can distinguish clearly the benign and malignant cases while for other features we cannot. This can point to us which features may be of greater interest when we wish to classify a benign or not benign tumour.
In the second figure we have computed the so-called correlation
matrix, which in our case with thirty features becomes a
We constructed this matrix using pandas via the statements
cancerpd = pd.DataFrame(cancer.data, columns=cancer.feature_names)
and then
correlation_matrix = cancerpd.corr().round(1)
Diagonalizing this matrix we can in turn say something about which features are of relevance and which are not. This leads us to the classical Principal Component Analysis (PCA) theorem with applications. This will be discussed later this semester (week 43).
Here we present a further way to present our results in terms of a so-called confusion matrix, the cumulative gain and the ROC curve. This way of displaying our data are based upon different ways to classify our possible outcomes. Before we proceed we need some definitions.
TP: true positive or in other words, something equivalent with a proper classification
TN: true negative, which is equivalent with a correct rejection
FP: false positive, or in simpler words something that is equivalent with a false alarm
FN: false negative, which is mean to be equivalent with a miss.
The total data set is then the sum of the true positive and true negative targets or outputs, labeled by
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
# Load the data
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,random_state=0)
print(X_train.shape)
print(X_test.shape)
# Logistic Regression
logreg = LogisticRegression(solver='lbfgs')
logreg.fit(X_train, y_train)
print("Test set accuracy with Logistic Regression: {:.2f}".format(logreg.score(X_test,y_test)))
#now scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Logistic Regression
logreg.fit(X_train_scaled, y_train)
print("Test set accuracy Logistic Regression with scaled data: {:.2f}".format(logreg.score(X_test_scaled,y_test)))
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_validate
#Cross validation
accuracy = cross_validate(logreg,X_test_scaled,y_test,cv=10)['test_score']
print(accuracy)
print("Test set accuracy with Logistic Regression and scaled data: {:.2f}".format(logreg.score(X_test_scaled,y_test)))
import scikitplot as skplt
y_pred = logreg.predict(X_test_scaled)
skplt.metrics.plot_confusion_matrix(y_test, y_pred, normalize=True)
plt.show()
y_probas = logreg.predict_proba(X_test_scaled)
skplt.metrics.plot_roc(y_test, y_probas)
plt.show()
skplt.metrics.plot_cumulative_gain(y_test, y_probas)
plt.show()