Week 48: Gradient boosting and summary of course

Week 48: Gradient boosting and summary of course#

Morten Hjorth-Jensen, Department of Physics and Center for Computing in Science Education, University of Oslo, Norway

Date: Nov 25, 2024

Overview of week 48#

Lecture Monday, November 25#

Plans for the lecture Monday 25 November, with video suggestions etc.

Boosting and gradient boosting and ensemble models
Summary of course
Readings and Videos:

a. These lecture notes at CompPhysics/MachineLearning

b. See also lecture notes from week 47 at CompPhysics/MachineLearning. The lecture on Monday starts with a repetition on AdaBoost before we move over to gradient boosting with examples

c. Video of lecture at https://youtu.be/iTaRdAPQnDA

d. Whiteboard notes at CompPhysics/MachineLearning

e. Video on Decision trees https://www.youtube.com/watch?v=RmajweUFKvM&ab_channel=Simplilearn

f. Video on boosting methods https://www.youtube.com/watch?v=wPqtzj5VZus&ab_channel=H2O.ai

g. Video on AdaBoost https://www.youtube.com/watch?v=LsK-xG1cLYA

h. Video on Gradient boost, part 1, parts 2-4 follow thereafter https://www.youtube.com/watch?v=3CC4N4z3GJc

i. Decision Trees: Rashcka et al chapter 3 pages 86-98, and chapter 7 on Ensemble methods, Voting and Bagging and Gradient Boosting. See also lecture from STK-IN4300, lecture 7 at https://www.uio.no/studier/emner/matnat/math/STK-IN4300/h20/slides/lecture_7.pdf.

Lab sessions#

Lab sessions on Tuesday and Wednesday.

Work and Discussion of project 3
Last weekly exercise
Lab sessions at usual times.
For the week of December 2-6, lab sessions start at 10am and end at 4pm, room FØ434, Tuesday and Wednesday

Random Forest Algorithm, reminder from last week#

The algorithm described here can be applied to both classification and regression problems.

We will grow of forest of say $B$ trees.

For $b = 1 : B$

a. Draw a bootstrap sample from the training data organized in our $X$ matrix.

b. We grow then a random forest tree $T_{b}$ based on the bootstrapped data by repeating the steps outlined till we reach the maximum node size is reached

we select $m \leq p$ variables at random from the $p$ predictors/features
pick the best split point among the $m$ features using for example the CART algorithm and create a new node
split the node into daughter nodes

Finally we output then the ensemble of trees ${T_{b}}_{1}^{B}$ and make predictions for either a regression type of problem or a classification type of problem.

Random Forests Compared with other Methods on the Cancer Data#

%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import  train_test_split 
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

# Load the data
cancer = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,random_state=0)
print(X_train.shape)
print(X_test.shape)
#define methods
# Logistic Regression
logreg = LogisticRegression(solver='lbfgs')
# Support vector machine
svm = SVC(gamma='auto', C=100)
# Decision Trees
deep_tree_clf = DecisionTreeClassifier(max_depth=None)
#Scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Logistic Regression
logreg.fit(X_train_scaled, y_train)
print("Test set accuracy Logistic Regression with scaled data: {:.2f}".format(logreg.score(X_test_scaled,y_test)))
# Support Vector Machine
svm.fit(X_train_scaled, y_train)
print("Test set accuracy SVM with scaled data: {:.2f}".format(logreg.score(X_test_scaled,y_test)))
# Decision Trees
deep_tree_clf.fit(X_train_scaled, y_train)
print("Test set accuracy with Decision Trees and scaled data: {:.2f}".format(deep_tree_clf.score(X_test_scaled,y_test)))


from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_validate
# Data set not specificied
#Instantiate the model with 500 trees and entropy as splitting criteria
Random_Forest_model = RandomForestClassifier(n_estimators=500,criterion="entropy")
Random_Forest_model.fit(X_train_scaled, y_train)
#Cross validation
accuracy = cross_validate(Random_Forest_model,X_test_scaled,y_test,cv=10)['test_score']
print(accuracy)
print("Test set accuracy with Random Forests and scaled data: {:.2f}".format(Random_Forest_model.score(X_test_scaled,y_test)))


import scikitplot as skplt
y_pred = Random_Forest_model.predict(X_test_scaled)
skplt.metrics.plot_confusion_matrix(y_test, y_pred, normalize=True)
plt.show()
y_probas = Random_Forest_model.predict_proba(X_test_scaled)
skplt.metrics.plot_roc(y_test, y_probas)
plt.show()
skplt.metrics.plot_cumulative_gain(y_test, y_probas)
plt.show()

(426, 30)
(143, 30)
Test set accuracy Logistic Regression with scaled data: 0.96
Test set accuracy SVM with scaled data: 0.96
Test set accuracy with Decision Trees and scaled data: 0.92

[1.         0.73333333 0.93333333 1.         1.         0.92857143
 1.         0.92857143 0.92857143 0.92857143]
Test set accuracy with Random Forests and scaled data: 0.97

_images/90075505602c3f17740e87e303ddff0ae0ff2ec0245679468ad8fb7cb2ba3b3a.png

_images/a318a67ffc1f6a60c2418c7f612568dd9b06752eb57ba2c4643dcc6e7d1a01dc.png

_images/1d150add40cbaa91d293348b988a5fbd2e04c68c41f11a70ebf78ac5686e7e4a.png

Recall that the cumulative gains curve shows the percentage of the overall number of cases in a given category gained by targeting a percentage of the total number of cases.

Similarly, the receiver operating characteristic curve, or ROC curve, displays the diagnostic ability of a binary classifier system as its discrimination threshold is varied. It plots the true positive rate against the false positive rate.

Compare Bagging on Trees with Random Forests#

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(splitter="random", max_leaf_nodes=16, random_state=42),
    n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1, random_state=42)

bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1, random_state=42)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)
np.sum(y_pred == y_pred_rf) / len(y_pred)

0.9790209790209791

Boosting, a Bird’s Eye View#

The basic idea is to combine weak classifiers in order to create a good classifier. With a weak classifier we often intend a classifier which produces results which are only slightly better than we would get by random guesses.

This is done by applying in an iterative way a weak (or a standard classifier like decision trees) to modify the data. In each iteration we emphasize those observations which are misclassified by weighting them with a factor.

What is boosting? Additive Modelling/Iterative Fitting#

Boosting is a way of fitting an additive expansion in a set of elementary basis functions like for example some simple polynomials. Assume for example that we have a function

f_{M} (x) = \sum_{i = 1}^{M} β_{m} b (x; γ_{m}),

where $β_{m}$ are the expansion parameters to be determined in a minimization process and $b (x; γ_{m})$ are some simple functions of the multivariable parameter $x$ which is characterized by the parameters $γ_{m}$ .

As an example, consider the Sigmoid function we used in logistic regression. In that case, we can translate the function $b (x; γ_{m})$ into the Sigmoid function

σ (t) = \frac{1}{1 + \exp (- t)},

where $t = γ_{0} + γ_{1} x$ and the parameters $γ_{0}$ and $γ_{1}$ were determined by the Logistic Regression fitting algorithm.

As another example, consider the cost function we defined for linear regression

C (y, f) = \frac{1}{n} \sum_{i = 0}^{n - 1} (y_{i} - f (x_{i}))^{2} .

In this case the function $f (x)$ was replaced by the design matrix $X$ and the unknown linear regression parameters $β$ , that is $f = X β$ . In linear regression we can simply invert a matrix and obtain the parameters $β$ by

β = {(X^{T} X)}^{- 1} X^{T} y .

In iterative fitting or additive modeling, we minimize the cost function with respect to the parameters $β_{m}$ and $γ_{m}$ .

Iterative Fitting, Regression and Squared-error Cost Function#

The way we proceed is as follows (here we specialize to the squared-error cost function)

Establish a cost function, here $C (y, f) = \frac{1}{n} \sum_{i = 0}^{n - 1} (y_{i} - f_{M} (x_{i}))^{2}$ with $f_{M} (x) = \sum_{i = 1}^{M} β_{m} b (x; γ_{m})$ .
Initialize with a guess $f_{0} (x)$ . It could be one or even zero or some random numbers.
For $m = 1 : M$

a. minimize $\sum_{i = 0}^{n - 1} (y_{i} - f_{m - 1} (x_{i}) - β b (x; γ))^{2}$ wrt $γ$ and $β$

b. This gives the optimal values $β_{m}$ and $γ_{m}$

c. Determine then the new values $f_{m} (x) = f_{m - 1} (x) + β_{m} b (x; γ_{m})$

We could use any of the algorithms we have discussed till now. If we use trees, $γ$ parameterizes the split variables and split points at the internal nodes, and the predictions at the terminal nodes.

Squared-Error Example and Iterative Fitting#

To better understand what happens, let us develop the steps for the iterative fitting using the above squared error function.

For simplicity we assume also that our functions $b (x; γ) = 1 + γ x$ .

This means that for every iteration $m$ , we need to optimize

(β_{m}, γ_{m}) = {argmin}_{β, λ} \sum_{i = 0}^{n - 1} (y_{i} - f_{m - 1} (x_{i}) - β b (x; γ))^{2} = \sum_{i = 0}^{n - 1} (y_{i} - f_{m - 1} (x_{i}) - β (1 + γ x_{i}))^{2} .

We start our iteration by simply setting $f_{0} (x) = 0$ . Taking the derivatives with respect to $β$ and $γ$ we obtain

\frac{\partial C}{\partial β} = - 2 \sum_{i} (1 + γ x_{i}) (y_{i} - β (1 + γ x_{i})) = 0,

and

\frac{\partial C}{\partial γ} = - 2 \sum_{i} β x_{i} (y_{i} - β (1 + γ x_{i})) = 0.

We can then rewrite these equations as (defining $w = e + γ x)$ with $e$ being the unit vector)

γ w^{T} (y - β γ w) = 0,

which gives us $β = w^{T} y / (w^{T} w)$ . Similarly we have

β γ x^{T} (y - β (1 + γ x)) = 0,

which leads to $γ = (x^{T} y - β x^{T} e) / (β x^{T} x)$ . Inserting for $β$ gives us an equation for $γ$ . This is a non-linear equation in the unknown $γ$ and has to be solved numerically.

The solution to these two equations gives us in turn $β_{1}$ and $γ_{1}$ leading to the new expression for $f_{1} (x)$ as $f_{1} (x) = β_{1} (1 + γ_{1} x)$ . Doing this $M$ times results in our final estimate for the function $f$ .

Iterative Fitting, Classification and AdaBoost#

Let us consider a binary classification problem with two outcomes $y_{i} \in {- 1, 1}$ and $i = 0, 1, 2, \dots, n - 1$ as our set of observations. We define a classification function $G (x)$ which produces a prediction taking one or the other of the two values ${- 1, 1}$ .

The error rate of the training sample is then

\overset{―}{err} = \frac{1}{n} \sum_{i = 0}^{n - 1} I (y_{i} \neq G (x_{i})) .

The iterative procedure starts with defining a weak classifier whose error rate is barely better than random guessing. The iterative procedure in boosting is to sequentially apply a weak classification algorithm to repeatedly modified versions of the data producing a sequence of weak classifiers $G_{m} (x)$ .

Here we will express our function $f (x)$ in terms of $G (x)$ . That is

f_{M} (x) = \sum_{i = 1}^{M} β_{m} b (x; γ_{m}),

will be a function of

G_{M} (x) = sign \sum_{i = 1}^{M} α_{m} G_{m} (x) .

Adaptive Boosting, AdaBoost#

In our iterative procedure we define thus

f_{m} (x) = f_{m - 1} (x) + β_{m} G_{m} (x) .

The simplest possible cost function which leads (also simple from a computational point of view) to the AdaBoost algorithm is the exponential cost/loss function defined as

C (y, f) = \sum_{i = 0}^{n - 1} \exp (- y_{i} (f_{m - 1} (x_{i}) + β G (x_{i})) .

We optimize $β$ and $G$ for each value of $m = 1 : M$ as we did in the regression case. This is normally done in two steps. Let us however first rewrite the cost function as

C (y, f) = \sum_{i = 0}^{n - 1} w_{i}^{m} \exp (- y_{i} β G (x_{i})),

where we have defined $w_{i}^{m} = \exp (- y_{i} f_{m - 1} (x_{i}))$ .

Building up AdaBoost#

First, for any $β > 0$ , we optimize $G$ by setting

G_{m} (x) = sign \sum_{i = 0}^{n - 1} w_{i}^{m} I (y_{i} \neq G_{(} x_{i})),

which is the classifier that minimizes the weighted error rate in predicting $y$ .

We can do this by rewriting

\exp - (β) \sum_{y_{i} = G (x_{i})} w_{i}^{m} + \exp (β) \sum_{y_{i} \neq G (x_{i})} w_{i}^{m},

which can be rewritten as

(\exp (β) - \exp - (β)) \sum_{i = 0}^{n - 1} w_{i}^{m} I (y_{i} \neq G (x_{i})) + \exp (- β) \sum_{i = 0}^{n - 1} w_{i}^{m} = 0,

which leads to

β_{m} = \frac{1}{2} \log \frac{1 - \overset{―}{err}}{\overset{―}{err}},

where we have redefined the error as

{\overset{―}{err}}_{m} = \frac{1}{n} \frac{\sum_{i = 0}^{n - 1} w_{i}^{m} I (y_{i} \neq G (x_{i})}{\sum_{i = 0}^{n - 1} w_{i}^{m}},

which leads to an update of

f_{m} (x) = f_{m - 1} (x) + β_{m} G_{m} (x) .

This leads to the new weights

w_{i}^{m + 1} = w_{i}^{m} \exp (- y_{i} β_{m} G_{m} (x_{i}))

Adaptive boosting: AdaBoost, Basic Algorithm#

The algorithm here is rather straightforward. Assume that our weak classifier is a decision tree and we consider a binary set of outputs with $y_{i} \in {- 1, 1}$ and $i = 0, 1, 2, \dots, n - 1$ as our set of observations. Our design matrix is given in terms of the feature/predictor vectors $X = [x_{0} x_{1} \dots x_{p - 1}]$ . Finally, we define also a classifier determined by our data via a function $G (x)$ . This function tells us how well we are able to classify our outputs/targets $y$ .

We have already defined the misclassification error $err$ as

err = \frac{1}{n} \sum_{i = 0}^{n - 1} I (y_{i} \neq G (x_{i})),

where the function $I ()$ is one if we misclassify and zero if we classify correctly.

Basic Steps of AdaBoost#

With the above definitions we are now ready to set up the algorithm for AdaBoost. The basic idea is to set up weights which will be used to scale the correctly classified and the misclassified cases.

We start by initializing all weights to $w_{i} = 1 / n$ , with $i = 0, 1, 2, \dots n - 1$ . It is easy to see that we must have $\sum_{i = 0}^{n - 1} w_{i} = 1$ .
We rewrite the misclassification error as

{\overset{―}{err}}_{m} = \frac{\sum_{i = 0}^{n - 1} w_{i}^{m} I (y_{i} \neq G (x_{i}))}{\sum_{i = 0}^{n - 1} w_{i}},

Then we start looping over all attempts at classifying, namely we start an iterative process for $m = 1 : M$ , where $M$ is the final number of classifications. Our given classifier could for example be a plain decision tree.

a. Fit then a given classifier to the training set using the weights $w_{i}$ .

b. Compute then $err$ and figure out which events are classified properly and which are classified wrongly.

c. Define a quantity $α_{m} = \log (1 - {\overset{―}{err}}_{m}) / {\overset{―}{err}}_{m}$

d. Set the new weights to $w_{i} = w_{i} \times \exp (α_{m} I (y_{i} \neq G (x_{i})$ .

Compute the new classifier $G (x) = \sum_{i = 0}^{n - 1} α_{m} I (y_{i} \neq G (x_{i})$ .

For the iterations with $m \leq 2$ the weights are modified individually at each steps. The observations which were misclassified at iteration $m - 1$ have a weight which is larger than those which were classified properly. As this proceeds, the observations which were difficult to classifiy correctly are given a larger influence. Each new classification step $m$ is then forced to concentrate on those observations that are missed in the previous iterations.

AdaBoost Examples#

Using Scikit-Learn it is easy to apply the adaptive boosting algorithm, as done here.

from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=2), n_estimators=200,
    algorithm="SAMME.R", learning_rate=0.01, random_state=42)
ada_clf.fit(X_train, y_train)
y_pred = ada_clf.predict(X_test)
skplt.metrics.plot_confusion_matrix(y_test, y_pred, normalize=True)
plt.show()
y_probas = ada_clf.predict_proba(X_test)
skplt.metrics.plot_roc(y_test, y_probas)
plt.show()
skplt.metrics.plot_cumulative_gain(y_test, y_probas)
plt.show()

_images/4b551662b71f36e771bedba88b8176e1bd7b16262de80f1445234d9175a77e42.png

_images/42c69c0d24892802c490b4669010a3c7927840c483799216381719a88df1dcea.png

_images/94ace5b7042b960e9356e09b25e6338703d1e93b5c6e886432ee07b71c47404b.png

Making an ADAboost code yourself#

import numpy as np

class DecisionStump:
    def fit(self, X, y, weights):
        m, n = X.shape
        self.alpha = 0
        self.threshold = None
        self.polarity = 1

        min_error = float('inf')

        for feature in range(n):
            feature_values = np.unique(X[:, feature])

            for threshold in feature_values:
                for polarity in [1, -1]:
                    predictions = np.ones(m)
                    predictions[X[:, feature] < threshold] = -1
                    predictions *= polarity

                    error = sum(weights[predictions != y])

                    if error < min_error:
                        min_error = error
                        self.alpha = 0.5 * np.log((1 - error) / (error + 1e-10))
                        self.threshold = threshold
                        self.feature_index = feature
                        self.polarity = polarity

    def predict(self, X):
        m = X.shape[0]
        predictions = np.ones(m)
        if self.polarity == 1:
            predictions[X[:, self.feature_index] < self.threshold] = -1
        else:
            predictions[X[:, self.feature_index] >= self.threshold] = -1
        return predictions

class AdaBoost:
    def fit(self, X, y, n_estimators):
        m = X.shape[0]
        self.alphas = []
        self.models = []

        weights = np.ones(m) / m

        for _ in range(n_estimators):
            stump = DecisionStump()
            stump.fit(X, y, weights)
            predictions = stump.predict(X)

            error = sum(weights[predictions != y])
            if error == 0:
                break

            self.models.append(stump)
            self.alphas.append(stump.alpha)

            weights *= np.exp(-stump.alpha * y * predictions)
            weights /= np.sum(weights)

    def predict(self, X):
        final_predictions = np.zeros(X.shape[0])
        for alpha, model in zip(self.alphas, self.models):
            final_predictions += alpha * model.predict(X)
        return np.sign(final_predictions)

# Example dataset (X, y)
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y = np.array([-1, -1, -1, -1, 1, 1, 1, 1, 1, 1])  # Labels must be -1 or 1

# Train AdaBoost
ada = AdaBoost()
ada.fit(X, y, n_estimators=10)

# Predictions
predictions = ada.predict(X)
print("Predictions:", predictions)

Predictions: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Gradient boosting: Basics with Steepest Descent/Functional Gradient Descent#

Gradient boosting is again a similar technique to Adaptive boosting, it combines so-called weak classifiers or regressors into a strong method via a series of iterations.

In order to understand the method, let us illustrate its basics by bringing back the essential steps in linear regression, where our cost function was the least squares function.

The Squared-Error again! Steepest Descent#

We start again with our cost function $C (y m f) = \sum_{i = 0}^{n - 1} L (y_{i}, f (x_{i}))$ where we want to minimize This means that for every iteration, we need to optimize

(\hat{f}) = {argmin}_{f} \sum_{i = 0}^{n - 1} (y_{i} - f (x_{i}))^{2} .

We define a real function $h_{m} (x)$ that defines our final function $f_{M} (x)$ as

f_{M} (x) = \sum_{m = 0}^{M} h_{m} (x) .

In the steepest decent approach we approximate $h_{m} (x) = - ρ_{m} g_{m} (x)$ , where $ρ_{m}$ is a scalar and $g_{m} (x)$ the gradient defined as

g_{m} (x_{i}) = {[\frac{\partial L (y_{i}, f (x_{i}))}{\partial f (x_{i})}]}_{f (x_{i}) = f_{m - 1} (x_{i})} .

With the new gradient we can update $f_{m} (x) = f_{m - 1} (x) - ρ_{m} g_{m} (x)$ . Using the above squared-error function we see that the gradient is $g_{m} (x_{i}) = - 2 (y_{i} - f (x_{i}))$ .

Choosing $f_{0} (x) = 0$ we obtain $g_{m} (x) = - 2 y_{i}$ and inserting this into the minimization problem for the cost function we have

(ρ_{1}) = {argmin}_{ρ} \sum_{i = 0}^{n - 1} (y_{i} + 2 ρ y_{i})^{2} .

Steepest Descent Example#

Optimizing with respect to $ρ$ we obtain (taking the derivative) that $ρ_{1} = - 1 / 2$ . We have then that

f_{1} (x) = f_{0} (x) - ρ_{1} g_{1} (x) = - y_{i} .

We can then proceed and compute

g_{2} (x_{i}) = {[\frac{\partial L (y_{i}, f (x_{i}))}{\partial f (x_{i})}]}_{f (x_{i}) = f_{1} (x_{i}) = y_{i}} = - 4 y_{i},

and find a new value for $ρ_{2} = - 1 / 2$ and continue till we have reached $m = M$ . We can modify the steepest descent method, or steepest boosting, by introducing what is called gradient boosting.

Gradient Boosting, algorithm#

Steepest descent is however not much used, since it only optimizes $f$ at a fixed set of $n$ points, so we do not learn a function that can generalize. However, we can modify the algorithm by fitting a weak learner to approximate the negative gradient signal.

Suppose we have a cost function $C (f) = \sum_{i = 0}^{n - 1} L (y_{i}, f (x_{i}))$ where $y_{i}$ is our target and $f (x_{i})$ the function which is meant to model $y_{i}$ . The above cost function could be our standard squared-error function

C (y, f) = \sum_{i = 0}^{n - 1} (y_{i} - f (x_{i}))^{2} .

The way we proceed in an iterative fashion is to

Initialize our estimate $f_{0} (x)$ .
For $m = 1 : M$ , we

a. compute the negative gradient vector $u_{m} = - \partial C (y, f) / \partial f (x)$ at $f (x) = f_{m - 1} (x)$ ;

b. fit the so-called base-learner to the negative gradient $h_{m} (u_{m}, x)$ ;

c. update the estimate $f_{m} (x) = f_{m - 1} (x) + h_{m} (u_{m}, x)$ ;

The final estimate is then $f_{M} (x) = \sum_{m = 1}^{M} h_{m} (u_{m}, x)$ .

Gradient Boosting, Examples of Regression#

import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
import scikitplot as skplt
from sklearn.metrics import mean_squared_error

n = 100
maxdegree = 6

# Make data set.
x = np.linspace(-3, 3, n).reshape(-1, 1)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.normal(0, 0.1, x.shape)

error = np.zeros(maxdegree)
bias = np.zeros(maxdegree)
variance = np.zeros(maxdegree)
polydegree = np.zeros(maxdegree)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

for degree in range(1,maxdegree):
    model = GradientBoostingRegressor(max_depth=degree, n_estimators=100, learning_rate=1.0)  
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    polydegree[degree] = degree
    error[degree] = np.mean( np.mean((y_test - y_pred)**2) )
    bias[degree] = np.mean( (y_test - np.mean(y_pred))**2 )
    variance[degree] = np.mean( np.var(y_pred) )
    print('Max depth:', degree)
    print('Error:', error[degree])
    print('Bias^2:', bias[degree])
    print('Var:', variance[degree])
    print('{} >= {} + {} = {}'.format(error[degree], bias[degree], variance[degree], bias[degree]+variance[degree]))

plt.xlim(1,maxdegree-1)
plt.plot(polydegree, error, label='Error')
plt.plot(polydegree, bias, label='bias')
plt.plot(polydegree, variance, label='Variance')
plt.legend()
plt.show()

Max depth: 1
Error: 0.4010613825254484
Bias^2: 0.2079593417034804
Var: 0.19310204082196794
0.4010613825254484 >= 0.2079593417034804 + 0.19310204082196794 = 0.4010613825254483
Max depth: 2
Error: 0.4250776117755916
Bias^2: 0.2080984218270197
Var: 0.21697918994857185
0.4250776117755916 >= 0.2080984218270197 + 0.21697918994857185 = 0.42507761177559156
Max depth: 3
Error: 0.4250796355306808
Bias^2: 0.2080985447081304
Var: 0.21698109082255032
0.4250796355306808 >= 0.2080985447081304 + 0.21698109082255032 = 0.42507963553068073
Max depth: 4
Error: 0.4250796355306808
Bias^2: 0.2080985447081304
Var: 0.21698109082255038
0.4250796355306808 >= 0.2080985447081304 + 0.21698109082255038 = 0.4250796355306808
Max depth: 5
Error: 0.42507963553068073
Bias^2: 0.2080985447081304
Var: 0.21698109082255032
0.42507963553068073 >= 0.2080985447081304 + 0.21698109082255032 = 0.42507963553068073

/Users/mhjensen/miniforge3/envs/myenv/lib/python3.9/site-packages/sklearn/ensemble/_gb.py:424: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/mhjensen/miniforge3/envs/myenv/lib/python3.9/site-packages/sklearn/ensemble/_gb.py:424: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/mhjensen/miniforge3/envs/myenv/lib/python3.9/site-packages/sklearn/ensemble/_gb.py:424: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/mhjensen/miniforge3/envs/myenv/lib/python3.9/site-packages/sklearn/ensemble/_gb.py:424: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/mhjensen/miniforge3/envs/myenv/lib/python3.9/site-packages/sklearn/ensemble/_gb.py:424: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

_images/801d16873af4f485d9c398ac11811a7d9c4f25dd40c995cf7565bd81af8071c5.png

Gradient Boosting, Classification Example#

import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import  train_test_split 
from sklearn.datasets import load_breast_cancer
import scikitplot as skplt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_validate

# Load the data
cancer = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,random_state=0)
print(X_train.shape)
print(X_test.shape)
#now scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

gd_clf = GradientBoostingClassifier(max_depth=3, n_estimators=100, learning_rate=1.0)  
gd_clf.fit(X_train_scaled, y_train)
#Cross validation
accuracy = cross_validate(gd_clf,X_test_scaled,y_test,cv=10)['test_score']
print(accuracy)
print("Test set accuracy with Gradient boosting and scaled data: {:.2f}".format(gd_clf.score(X_test_scaled,y_test)))

import scikitplot as skplt
y_pred = gd_clf.predict(X_test_scaled)
skplt.metrics.plot_confusion_matrix(y_test, y_pred, normalize=True)
plt.show()
y_probas = gd_clf.predict_proba(X_test_scaled)
skplt.metrics.plot_roc(y_test, y_probas)
plt.show()
skplt.metrics.plot_cumulative_gain(y_test, y_probas)
plt.show()

(426, 30)
(143, 30)

[0.93333333 0.93333333 0.86666667 1.         1.         0.92857143
 1.         0.92857143 0.85714286 0.92857143]
Test set accuracy with Gradient boosting and scaled data: 0.97

_images/de0019c6a0f8206c4c2b0f29eedfececf3b0ad5a07a733217a9ed5b8e4cae808.png

_images/45972a93ed8e1f6ed66fe9c322a65b549b39dd2c80e16d5081151b2bc713b669.png

XGBoost: Extreme Gradient Boosting#

XGBoost or Extreme Gradient Boosting, is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting that solve many data science problems in a fast and accurate way. See the article by Chen and Guestrin.

The authors design and build a highly scalable end-to-end tree boosting system. It has a theoretically justified weighted quantile sketch for efficient proposal calculation. It introduces a novel sparsity-aware algorithm for parallel tree learning and an effective cache-aware block structure for out-of-core tree learning.

It is now the algorithm which wins essentially all ML competitions!!!

Xgboost on the Cancer Data#

As you will see from the confusion matrix below, XGBoots does an excellent job on the Wisconsin cancer data and outperforms essentially all agorithms we have discussed till now.

import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import  train_test_split 
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_validate
import scikitplot as skplt
import xgboost as xgb
# Load the data
cancer = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,random_state=0)
print(X_train.shape)
print(X_test.shape)
#now scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

xg_clf = xgb.XGBClassifier()
xg_clf.fit(X_train_scaled,y_train)

y_test = xg_clf.predict(X_test_scaled)

print("Test set accuracy with Gradient Boosting and scaled data: {:.2f}".format(xg_clf.score(X_test_scaled,y_test)))

import scikitplot as skplt
y_pred = xg_clf.predict(X_test_scaled)
skplt.metrics.plot_confusion_matrix(y_test, y_pred, normalize=True)
plt.show()
y_probas = xg_clf.predict_proba(X_test_scaled)
skplt.metrics.plot_roc(y_test, y_probas)
plt.show()
skplt.metrics.plot_cumulative_gain(y_test, y_probas)
plt.show()


xgb.plot_tree(xg_clf,num_trees=0)
plt.rcParams['figure.figsize'] = [50, 10]
plt.show()

xgb.plot_importance(xg_clf)
plt.rcParams['figure.figsize'] = [5, 5]
plt.show()

(426, 30)
(143, 30)
Test set accuracy with Gradient Boosting and scaled data: 1.00

_images/bae16b30a452ef5c69ddef05f0de3bbba23676920f8affa40a5a964be4504dd8.png

_images/429975495ea036cdaa180dafa8c2225aba5cd58c9c9a67fe5998acd764007772.png

_images/1baae63cbdd15ce7abeb765e3d3adddddca04d8f36b9292d03ad703844b287d4.png

_images/3a39cedd7df8be523efd25c1ba51760790f33378dd9e7a9d1eff178bc4b629e1.png

_images/a9306da9994b47ad9b17c6ed3244dbd079e10a74b85153831b07c9570b8677c3.png

Gradient boosting, making our own code for a regression case#

import numpy as np
class DecisionTreeRegressor:
    def __init__(self, max_depth=3):
        self.max_depth = max_depth
        self.tree = None
    def fit(self, X, y):
        self.tree = self._grow_tree(X, y)
    def _grow_tree(self, X, y, depth=0):
        n_samples, n_features = X.shape
        if depth < self.max_depth:
            best_feature, best_threshold = self._best_split(X, y)
            if best_feature is not None:
                left_indices = X[:, best_feature] < best_threshold
                right_indices = X[:, best_feature] >= best_threshold
                left_child = self._grow_tree(X[left_indices], y[left_indices], depth + 1)
                right_child = self._grow_tree(X[right_indices], y[right_indices], depth + 1)
                return (best_feature, best_threshold, left_child, right_child)
        return np.mean(y)
    def _best_split(self, X, y):
        best_mse = float('inf')
        best_feature, best_threshold = None, None
        n_samples, n_features = X.shape
        
        for feature in range(n_features):
            thresholds = np.unique(X[:, feature])
            for threshold in thresholds:
                left_indices = X[:, feature] < threshold
                right_indices = X[:, feature] >= threshold
                if len(y[left_indices]) > 0 and len(y[right_indices]) > 0:
                    left_mse = np.mean((y[left_indices] - np.mean(y[left_indices])) ** 2)
                    right_mse = np.mean((y[right_indices] - np.mean(y[right_indices])) ** 2)
                    mse = (len(y[left_indices]) * left_mse + len(y[right_indices]) * right_mse) / n_samples
                    
                    if mse < best_mse:
                        best_mse = mse
                        best_feature = feature
                        best_threshold = threshold
        return best_feature, best_threshold
    def predict(self, X):
        return np.array([self._predict_sample(sample, self.tree) for sample in X])
    def _predict_sample(self, sample, node):
        if isinstance(node, tuple):
            feature, threshold, left_child, right_child = node
            if sample[feature] < threshold:
                return self._predict_sample(sample, left_child)
            else:
                return self._predict_sample(sample, right_child)
        return node
class GradientBoostingRegressor:
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.models = []
    def fit(self, X, y):
        y_pred = np.zeros(y.shape)
        for _ in range(self.n_estimators):
            residuals = y - y_pred
            model = DecisionTreeRegressor(max_depth=self.max_depth)
            model.fit(X, residuals)
            y_pred += self.learning_rate * model.predict(X)
            self.models.append(model)
    def predict(self, X):
        y_pred = np.zeros(X.shape[0])
        for model in self.models:
            y_pred += self.learning_rate * model.predict(X)
        return y_pred
# Example usage
if __name__ == "__main__":
    # Sample data
    X = np.array([[1], [2], [3], [4], [5]])
    y = np.array([1.5, 1.7, 3.5, 3.7, 5.0])
    model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=2)
    model.fit(X, y)
    predictions = model.predict(X)
    print("Predictions:", predictions)

Predictions: [1.49999399 1.69995637 3.49991627 3.6998795  4.99984484]

Summary of course#

What? Me worry? No final exam in this course!#

Figure 1:

Topics we have covered this year#

The course has two central parts

Statistical analysis and optimization of data
Machine learning

Statistical analysis and optimization of data#

The following topics have been discussed:

Basic concepts, expectation values, variance, covariance, correlation functions and errors;
Simpler models, binomial distribution, the Poisson distribution, simple and multivariate normal distributions;
Central elements from linear algebra, matrix inversion and SVD
Gradient methods for data optimization
Estimation of errors using cross-validation, bootstrapping and jackknife methods;
Practical optimization using Singular-value decomposition and least squares for parameterizing data.
Not discussed: Principal Component Analysis to reduce the number of features.

Machine learning#

Linear methods for regression and classification:

a. Ordinary Least Squares

b. Ridge regression

c. Lasso regression

d. Logistic regression

Neural networks and deep learning:

a. Feed Forward Neural Networks

b. Convolutional Neural Networks

c. Recurrent Neural Networks

Decisions trees and ensemble methods:

a. Decision trees

b. Bagging and voting

c. Random forests

d. Boosting and gradient boosting

Not discussed this year: Support vector machines

a. Binary classification and multiclass classification

b. Kernel methods

c. Regression

Learning outcomes and overarching aims of this course#

The course introduces a variety of central algorithms and methods essential for studies of data analysis and machine learning. The course is project based and through the various projects, normally three, you will be exposed to fundamental research problems in these fields, with the aim to reproduce state of the art scientific results. The students will learn to develop and structure large codes for studying these systems, get acquainted with computing facilities and learn to handle large scientific projects. A good scientific and ethical conduct is emphasized throughout the course.

Understand linear methods for regression and classification;
Learn about neural network;
Learn about bagging, boosting and trees

Learn about basic data analysis;
Be capable of extending the acquired knowledge to other systems and cases;
Have an understanding of central algorithms used in data analysis and machine learning;
Work on numerical projects to illustrate the theory. The projects play a central role.

Perspective on Machine Learning#

Rapidly emerging application area
Experiment AND theory are evolving in many many fields.
Requires education/retraining for more widespread adoption
A lot of “word-of-mouth” development methods

Huge amounts of data sets require automation, classical analysis tools often inadequate. High energy physics hit this wall in the 90’s. In 2009 single top quark production was determined via Boosted decision trees, Bayesian Neural Networks, etc.

Machine Learning Research#

Where to find recent results:

Conference proceedings, arXiv and blog posts!
NIPS: Neural Information Processing Systems
ICLR: International Conference on Learning Representations
ICML: International Conference on Machine Learning
Journal of Machine Learning Research
Follow ML on ArXiv

Starting your Machine Learning Project#

Identify problem type: classification, regression
Consider your data carefully
Choose a simple model that fits 1 and 2
Consider your data carefully again! Think of data representation more carefully.
Based on your results, feedback loop to earliest possible point

Choose a Model and Algorithm#

Supervised?
Start with the simplest model that fits your problem
Start with minimal processing of data

Preparing Your Data#

Shuffle your data
Mean center your data
- Why?
Normalize the variance
- Why?
Whitening
- Decorrelates data
- Can be hit or miss
When to do train/test split?

Which activation and weights to choose in neural networks#

RELU? ELU? GELU? etc
Sigmoid or Tanh?
Set all weights to 0? Terrible idea
Set all weights to random values? Small random values

Optimization Methods and Hyperparameters#

Stochastic gradient descent
Stochastic gradient descent + momentum
State-of-the-art approaches:

a. RMSProp

b. Adam

c. and more

Which regularization and hyperparameters? $L_{1}$ or $L_{2}$ , soft classifiers, depths of trees and many other. Need to explore a large set of hyperparameters and regularization methods.

Resampling#

When do we resample?

Bootstrap
Cross-validation
Jackknife and many other

Other courses on Data science and Machine Learning at UiO#

FYS5429 – Advanced machine learning and data analysis for the physical sciences
IN3050/IN4050 Introduction to Artificial Intelligence and Machine Learning. Introductory course in machine learning and AI
STK-INF3000/4000 Selected Topics in Data Science. The course provides insight into selected contemporary relevant topics within Data Science.
IN4080 Natural Language Processing. Probabilistic and machine learning techniques applied to natural language processing.
STK-IN4300 – Statistical learning methods in Data Science. An advanced introduction to statistical and machine learning. For students with a good mathematics and statistics background.
IN-STK5000 Responsible Data Science. Methods for adaptive collection and processing of data based on machine learning techniques.
IN4310 – Machine Learning for Image Analysis. An introduction to deep learning with particular emphasis on applications within Image analysis, but useful for other application areas too.
IN5310 – Advanced Deep Learning for Image Analysis
IN5490 – Advanced Topics in Artificial Intelligence for Intelligent Systems
TEK5040 – Deep learning for autonomous systems. The course addresses advanced algorithms and architectures for deep learning with neural networks. The course provides an introduction to how deep-learning techniques can be used in the construction of key parts of advanced autonomous systems that exist in physical environments and cyber environments.

Additional courses of interest#

What’s the future like?#

Based on multi-layer nonlinear neural networks, deep learning can learn directly from raw data, automatically extract and abstract features from layer to layer, and then achieve the goal of regression, classification, or ranking. Deep learning has made breakthroughs in computer vision, speech processing and natural language, and reached or even surpassed human level. The success of deep learning is mainly due to the three factors: big data, big model, and big computing.

In the past few decades, many different architectures of deep neural networks have been proposed, such as

Convolutional neural networks, which are mostly used in image and video data processing, and have also been applied to sequential data such as text processing;
Recurrent neural networks, which can process sequential data of variable length and have been widely used in natural language understanding and speech processing;
Encoder-decoder framework, which is mostly used for image or sequence generation, such as machine translation, text summarization, and image captioning.

Types of Machine Learning, a repetition#

The approaches to machine learning are many, but are often split into two main categories. In supervised learning we know the answer to a problem, and let the computer deduce the logic behind it. On the other hand, unsupervised learning is a method for finding patterns and relationship in data sets without any prior knowledge of the system. Some authours also operate with a third category, namely reinforcement learning. This is a paradigm of learning inspired by behavioural psychology, where learning is achieved by trial-and-error, solely from rewards and punishment.

Another way to categorize machine learning tasks is to consider the desired output of a system. Some of the most common tasks are:

Classification: Outputs are divided into two or more classes. The goal is to produce a model that assigns inputs into one of these classes. An example is to identify digits based on pictures of hand-written ones. Classification is typically supervised learning.
Regression: Finding a functional relationship between an input data set and a reference data set. The goal is to construct a function that maps input data to continuous output values.
Clustering: Data are divided into groups with certain common traits, without knowing the different groups beforehand. It is thus a form of unsupervised learning.
Other unsupervised learning algortihms like Boltzmann machines

Why Boltzmann machines?#

What is known as restricted Boltzmann Machines (RMB) have received a lot of attention lately. One of the major reasons is that they can be stacked layer-wise to build deep neural networks that capture complicated statistics.

The original RBMs had just one visible layer and a hidden layer, but recently so-called Gaussian-binary RBMs have gained quite some popularity in imaging since they are capable of modeling continuous data that are common to natural images.

Furthermore, they have been used to solve complicated quantum mechanical many-particle problems or classical statistical physics problems like the Ising and Potts classes of models.

Boltzmann Machines#

Why use a generative model rather than the more well known discriminative deep neural networks (DNN)?

Discriminitave methods have several limitations: They are mainly supervised learning methods, thus requiring labeled data. And there are tasks they cannot accomplish, like drawing new examples from an unknown probability distribution.
A generative model can learn to represent and sample from a probability distribution. The core idea is to learn a parametric model of the probability distribution from which the training data was drawn. As an example

a. A model for images could learn to draw new examples of cats and dogs, given a training dataset of images of cats and dogs.

b. Generate a sample of an ordered or disordered phase, having been given samples of such phases.

c. Model the trial function for Monte Carlo calculations.

Some similarities and differences from DNNs#

Both use gradient-descent based learning procedures for minimizing cost functions
Energy based models don’t use backpropagation and automatic differentiation for computing gradients, instead turning to Markov Chain Monte Carlo methods.
DNNs often have several hidden layers. A restricted Boltzmann machine has only one hidden layer, however several RBMs can be stacked to make up Deep Belief Networks, of which they constitute the building blocks.

History: The RBM was developed by amongst others Geoffrey Hinton, called by some the “Godfather of Deep Learning”, working with the University of Toronto and Google.

Boltzmann machines (BM)#

A BM is what we would call an undirected probabilistic graphical model with stochastic continuous or discrete units.

It is interpreted as a stochastic recurrent neural network where the state of each unit(neurons/nodes) depends on the units it is connected to. The weights in the network represent thus the strength of the interaction between various units/nodes.

It turns into a Hopfield network if we choose deterministic rather than stochastic units. In contrast to a Hopfield network, a BM is a so-called generative model. It allows us to generate new samples from the learned distribution.

A standard BM setup#

A standard BM network is divided into a set of observable and visible units $\hat{x}$ and a set of unknown hidden units/nodes $\hat{h}$ .

Additionally there can be bias nodes for the hidden and visible layers. These biases are normally set to $1$ .

BMs are stackable, meaning they cwe can train a BM which serves as input to another BM. We can construct deep networks for learning complex PDFs. The layers can be trained one after another, a feature which makes them popular in deep learning

However, they are often hard to train. This leads to the introduction of so-called restricted BMs, or RBMS. Here we take away all lateral connections between nodes in the visible layer as well as connections between nodes in the hidden layer. The network is illustrated in the figure below.

The structure of the RBM network#

Figure 1:

The network#

The network layers:

A function $x$ that represents the visible layer, a vector of $M$ elements (nodes). This layer represents both what the RBM might be given as training input, and what we want it to be able to reconstruct. This might for example be given by the pixels of an image or coefficients representing speech, or the coordinates of a quantum mechanical state function.
The function $h$ represents the hidden, or latent, layer. A vector of $N$ elements (nodes). Also called “feature detectors”.

Goals#

The goal of the hidden layer is to increase the model’s expressive power. We encode complex interactions between visible variables by introducing additional, hidden variables that interact with visible degrees of freedom in a simple manner, yet still reproduce the complex correlations between visible degrees in the data once marginalized over (integrated out).

The network parameters, to be optimized/learned:

$a$ represents the visible bias, a vector of same length as $x$ .
$b$ represents the hidden bias, a vector of same lenght as $h$ .
$W$ represents the interaction weights, a matrix of size $M \times N$ .

Joint distribution#

The restricted Boltzmann machine is described by a Boltzmann distribution

\begin{matrix} (1) & P_{r b m} (x, h) = \frac{1}{Z} e^{- \frac{1}{T_{0}} E (x, h)}, \end{matrix}

where $Z$ is the normalization constant or partition function, defined as

\begin{matrix} (2) & Z = \int \int e^{- \frac{1}{T_{0}} E (x, h)} d x d h . \end{matrix}

It is common to ignore $T_{0}$ by setting it to one.

Network Elements, the energy function#

The function $E (x, h)$ gives the energy of a configuration (pair of vectors) $(x, h)$ . The lower the energy of a configuration, the higher the probability of it. This function also depends on the parameters $a$ , $b$ and $W$ . Thus, when we adjust them during the learning procedure, we are adjusting the energy function to best fit our problem.

An expression for the energy function is

E (\hat{x}, \hat{h}) = - \sum_{i a}^{N A} b_{i}^{a} α_{i}^{a} (x_{i}) - \sum_{j d}^{M D} c_{j}^{d} β_{j}^{d} (h_{j}) - \sum_{i j a d}^{N A M D} b_{i}^{a} α_{i}^{a} (x_{i}) c_{j}^{d} β_{j}^{d} (h_{j}) w_{i j}^{a d} .

Here $β_{j}^{d} (h_{j})$ and $α_{i}^{a} (x_{j})$ are so-called transfer functions that map a given input value to a desired feature value. The labels $a$ and $d$ denote that there can be multiple transfer functions per variable. The first sum depends only on the visible units. The second on the hidden ones. Note that there is no connection between nodes in a layer.

The quantities $b$ and $c$ can be interpreted as the visible and hidden biases, respectively.

The connection between the nodes in the two layers is given by the weights $w_{i j}$ .

Defining different types of RBMs#

There are different variants of RBMs, and the differences lie in the types of visible and hidden units we choose as well as in the implementation of the energy function $E (x, h)$ .

Binary-Binary RBM:

RBMs were first developed using binary units in both the visible and hidden layer. The corresponding energy function is defined as follows:

\begin{matrix} (3) & E (x, h) = - \sum_{i}^{M} x_{i} a_{i} - \sum_{j}^{N} b_{j} h_{j} - \sum_{i, j}^{M, N} x_{i} w_{i j} h_{j}, \end{matrix}

where the binary values taken on by the nodes are most commonly 0 and 1.

Gaussian-Binary RBM:

Another varient is the RBM where the visible units are Gaussian while the hidden units remain binary:

\begin{matrix} (4) & E (x, h) = \sum_{i}^{M} \frac{(x_{i} - a_{i})^{2}}{2 σ_{i}^{2}} - \sum_{j}^{N} b_{j} h_{j} - \sum_{i, j}^{M, N} \frac{x_{i} w_{i j} h_{j}}{σ_{i}^{2}} . \end{matrix}

More about RBMs#

Useful when we model continuous data (i.e., we wish $x$ to be continuous)
Requires a smaller learning rate, since there’s no upper bound to the value a component might take in the reconstruction

Other types of units include:

Softmax and multinomial units
Gaussian visible and hidden units
Binomial units
Rectified linear units

To read more, see Lectures on Boltzmann machines in Physics.

Autoencoders: Overarching view#

Autoencoders are artificial neural networks capable of learning efficient representations of the input data (these representations are called codings) without any supervision (i.e., the training set is unlabeled). These codings typically have a much lower dimensionality than the input data, making autoencoders useful for dimensionality reduction.

More importantly, autoencoders act as powerful feature detectors, and they can be used for unsupervised pretraining of deep neural networks.

Lastly, they are capable of randomly generating new data that looks very similar to the training data; this is called a generative model. For example, you could train an autoencoder on pictures of faces, and it would then be able to generate new faces. Surprisingly, autoencoders work by simply learning to copy their inputs to their outputs. This may sound like a trivial task, but we will see that constraining the network in various ways can make it rather difficult. For example, you can limit the size of the internal representation, or you can add noise to the inputs and train the network to recover the original inputs. These constraints prevent the autoencoder from trivially copying the inputs directly to the outputs, which forces it to learn efficient ways of representing the data. In short, the codings are byproducts of the autoencoder’s attempt to learn the identity function under some constraints.

Video on autoencoders

See also A. Geron’s textbook, chapter 15.

Bayesian Machine Learning#

This is an important topic if we aim at extracting a probability distribution. This gives us also a confidence interval and error estimates.

Bayesian machine learning allows us to encode our prior beliefs about what those models should look like, independent of what the data tells us. This is especially useful when we don’t have a ton of data to confidently learn our model.

Video on Bayesian deep learning

Reinforcement Learning#

Reinforcement Learning (RL) is one of the most exciting fields of Machine Learning today, and also one of the oldest. It has been around since the 1950s, producing many interesting applications over the years.

It studies how agents take actions based on trial and error, so as to maximize some notion of cumulative reward in a dynamic system or environment. Due to its generality, the problem has also been studied in many other disciplines, such as game theory, control theory, operations research, information theory, multi-agent systems, swarm intelligence, statistics, and genetic algorithms.

In March 2016, AlphaGo, a computer program that plays the board game Go, beat Lee Sedol in a five-game match. This was the first time a computer Go program had beaten a 9-dan (highest rank) professional without handicaps. AlphaGo is based on deep convolutional neural networks and reinforcement learning. AlphaGo’s victory was a major milestone in artificial intelligence and it has also made reinforcement learning a hot research area in the field of machine learning.

Lecture on Reinforcement Learning.

See also A. Geron’s textbook, chapter 16.

Transfer learning#

The goal of transfer learning is to transfer the model or knowledge obtained from a source task to the target task, in order to resolve the issues of insufficient training data in the target task. The rationality of doing so lies in that usually the source and target tasks have inter-correlations, and therefore either the features, samples, or models in the source task might provide useful information for us to better solve the target task. Transfer learning is a hot research topic in recent years, with many problems still waiting to be studied.

Lecture on transfer learning.

Adversarial learning#

The conventional deep generative model has a potential problem: the model tends to generate extreme instances to maximize the probabilistic likelihood, which will hurt its performance. Adversarial learning utilizes the adversarial behaviors (e.g., generating adversarial instances or training an adversarial model) to enhance the robustness of the model and improve the quality of the generated data. In recent years, one of the most promising unsupervised learning technologies, generative adversarial networks (GAN), has already been successfully applied to image, speech, and text.

Lecture on adversial learning.

Dual learning#

Dual learning is a new learning paradigm, the basic idea of which is to use the primal-dual structure between machine learning tasks to obtain effective feedback/regularization, and guide and strengthen the learning process, thus reducing the requirement of large-scale labeled data for deep learning. The idea of dual learning has been applied to many problems in machine learning, including machine translation, image style conversion, question answering and generation, image classification and generation, text classification and generation, image-to-text, and text-to-image.

Distributed machine learning#

Distributed computation will speed up machine learning algorithms, significantly improve their efficiency, and thus enlarge their application. When distributed meets machine learning, more than just implementing the machine learning algorithms in parallel is required.

Meta learning#

Meta learning is an emerging research direction in machine learning. Roughly speaking, meta learning concerns learning how to learn, and focuses on the understanding and adaptation of the learning itself, instead of just completing a specific learning task. That is, a meta learner needs to be able to evaluate its own learning methods and adjust its own learning methods according to specific learning tasks.

The Challenges Facing Machine Learning#

While there has been much progress in machine learning, there are also challenges.

For example, the mainstream machine learning technologies are black-box approaches, making us concerned about their potential risks. To tackle this challenge, we may want to make machine learning more explainable and controllable. As another example, the computational complexity of machine learning algorithms is usually very high and we may want to invent lightweight algorithms or implementations. Furthermore, in many domains such as physics, chemistry, biology, and social sciences, people usually seek elegantly simple equations (e.g., the Schrödinger equation) to uncover the underlying laws behind various phenomena. In the field of machine learning, can we reveal simple laws instead of designing more complex models for data fitting? Although there are many challenges, we are still very optimistic about the future of machine learning. As we look forward to the future, here are what we think the research hotspots in the next ten years will be.

See the article on Discovery of Physics From Data: Universal Laws and Discrepancies

Explainable machine learning#

Machine learning, especially deep learning, evolves rapidly. The ability gap between machine and human on many complex cognitive tasks becomes narrower and narrower. However, we are still in the very early stage in terms of explaining why those effective models work and how they work.

What is missing: the gap between correlation and causation. Standard Machine Learning is based on what e have called a frequentist approach.

Most machine learning techniques, especially the statistical ones, depend highly on correlations in data sets to make predictions and analyses. In contrast, rational humans tend to reply on clear and trustworthy causality relations obtained via logical reasoning on real and clear facts. It is one of the core goals of explainable machine learning to transition from solving problems by data correlation to solving problems by logical reasoning.

Bayesian Machine Learning is one of the exciting research directions in this field.

Quantum machine learning#

Quantum machine learning is an emerging interdisciplinary research area at the intersection of quantum computing and machine learning.

Quantum computers use effects such as quantum coherence and quantum entanglement to process information, which is fundamentally different from classical computers. Quantum algorithms have surpassed the best classical algorithms in several problems (e.g., searching for an unsorted database, inverting a sparse matrix), which we call quantum acceleration.

When quantum computing meets machine learning, it can be a mutually beneficial and reinforcing process, as it allows us to take advantage of quantum computing to improve the performance of classical machine learning algorithms. In addition, we can also use the machine learning algorithms (on classic computers) to analyze and improve quantum computing systems.

Lecture on Quantum ML.

Read interview with Maria Schuld on her work on Quantum Machine Learning. See also her recent textbook.

Quantum machine learning algorithms based on linear algebra#

Many quantum machine learning algorithms are based on variants of quantum algorithms for solving linear equations, which can efficiently solve N-variable linear equations with complexity of O(log2 N) under certain conditions. The quantum matrix inversion algorithm can accelerate many machine learning methods, such as least square linear regression, least square version of support vector machine, Gaussian process, and more. The training of these algorithms can be simplified to solve linear equations. The key bottleneck of this type of quantum machine learning algorithms is data input—that is, how to initialize the quantum system with the entire data set. Although efficient data-input algorithms exist for certain situations, how to efficiently input data into a quantum system is as yet unknown for most cases.

Quantum reinforcement learning#

In quantum reinforcement learning, a quantum agent interacts with the classical environment to obtain rewards from the environment, so as to adjust and improve its behavioral strategies. In some cases, it achieves quantum acceleration by the quantum processing capabilities of the agent or the possibility of exploring the environment through quantum superposition. Such algorithms have been proposed in superconducting circuits and systems of trapped ions.

Quantum deep learning#

Dedicated quantum information processors, such as quantum annealers and programmable photonic circuits, are well suited for building deep quantum networks. The simplest deep quantum network is the Boltzmann machine. The classical Boltzmann machine consists of bits with tunable interactions and is trained by adjusting the interaction of these bits so that the distribution of its expression conforms to the statistics of the data. To quantize the Boltzmann machine, the neural network can simply be represented as a set of interacting quantum spins that correspond to an adjustable Ising model. Then, by initializing the input neurons in the Boltzmann machine to a fixed state and allowing the system to heat up, we can read out the output qubits to get the result.

The last words?#

Early computer scientist Alan Kay said, The best way to predict the future is to create it. Therefore, all machine learning practitioners, whether scholars or engineers, professors or students, need to work together to advance these important research topics. Together, we will not just predict the future, but create it.

Best wishes to you all and thanks so much for your heroic efforts this semester#

Figure 1:

Week 48: Gradient boosting and summary of course

Contents

Week 48: Gradient boosting and summary of course#

Overview of week 48#

Lecture Monday, November 25#

Lab sessions#

Random Forest Algorithm, reminder from last week#

Random Forests Compared with other Methods on the Cancer Data#

Compare Bagging on Trees with Random Forests#

Boosting, a Bird’s Eye View#

What is boosting? Additive Modelling/Iterative Fitting#

Iterative Fitting, Regression and Squared-error Cost Function#

Squared-Error Example and Iterative Fitting#

Iterative Fitting, Classification and AdaBoost#

Adaptive Boosting, AdaBoost#

Building up AdaBoost#

Adaptive boosting: AdaBoost, Basic Algorithm#

Basic Steps of AdaBoost#

AdaBoost Examples#

Making an ADAboost code yourself#

Gradient boosting: Basics with Steepest Descent/Functional Gradient Descent#

The Squared-Error again! Steepest Descent#

Steepest Descent Example#

Gradient Boosting, algorithm#

Gradient Boosting, Examples of Regression#

Gradient Boosting, Classification Example#

XGBoost: Extreme Gradient Boosting#

Xgboost on the Cancer Data#

Gradient boosting, making our own code for a regression case#

Summary of course#

What? Me worry? No final exam in this course!#

Topics we have covered this year#

Statistical analysis and optimization of data#

Machine learning#

Learning outcomes and overarching aims of this course#

Perspective on Machine Learning#

Machine Learning Research#

Starting your Machine Learning Project#

Choose a Model and Algorithm#

Preparing Your Data#

Which activation and weights to choose in neural networks#

Optimization Methods and Hyperparameters#

Resampling#

Other courses on Data science and Machine Learning at UiO#

Additional courses of interest#

What’s the future like?#

Types of Machine Learning, a repetition#

Why Boltzmann machines?#

Boltzmann Machines#

Some similarities and differences from DNNs#

Boltzmann machines (BM)#

A standard BM setup#

The structure of the RBM network#

The network#

Goals#

Joint distribution#

Network Elements, the energy function#

Defining different types of RBMs#

More about RBMs#

Autoencoders: Overarching view#

Bayesian Machine Learning#

Reinforcement Learning#

Transfer learning#

Adversarial learning#

Dual learning#

Distributed machine learning#

Meta learning#

The Challenges Facing Machine Learning#

Explainable machine learning#

Quantum machine learning#

Quantum machine learning algorithms based on linear algebra#

Quantum reinforcement learning#

Quantum deep learning#

Social machine learning#

The last words?#

Best wishes to you all and thanks so much for your heroic efforts this semester#