Machine Learning and Boltzmann machines with applications

The approaches to machine learning are many, but are often split into two main categories. In supervised learning we know the answer to a problem, and let the computer deduce the logic behind it. On the other hand, unsupervised learning is a method for finding patterns and relationship in data sets without any prior knowledge of the system. Some authours also operate with a third category, namely reinforcement learning. This is a paradigm of learning inspired by behavioural psychology, where learning is achieved by trial-and-error, solely from rewards and punishment.

Another way to categorize machine learning tasks is to consider the desired output of a system. Some of the most common tasks are:

Classification: Outputs are divided into two or more classes. The goal is to produce a model that assigns inputs into one of these classes. An example is to identify digits based on pictures of hand-written ones. Classification is typically supervised learning.
Regression: Finding a functional relationship between an input data set and a reference data set. The goal is to construct a function that maps input data to continuous output values.
Clustering: Data are divided into groups with certain common traits, without knowing the different groups beforehand. It is thus a form of unsupervised learning.
Other unsupervised learning algortihms, here Boltzmann machines

Why Boltzmann machines?

What is known as restricted Boltzmann Machines (RMB) have received a lot of attention lately. One of the major reasons is that they can be stacked layer-wise to build deep neural networks that capture complicated statistics.

The original RBMs had just one visible layer and a hidden layer, but recently so-called Gaussian-binary RBMs have gained quite some popularity in imaging since they are capable of modeling continuous data that are common to natural images.

Furthermore, they have been used to solve complicated quantum mechanical many-particle problems or classical statistical physics problems like the Ising and Potts classes of models.

An intermediate step, the Hopfield network and links to the Ising and Potts models

A brief review on Markov Chains, Metropolis and Gibbs sampling

Brownian motion and Markov processes

The Markov process is used repeatedly in Monte Carlo simulations in order to generate new random states.

The reason for choosing a Markov process is that when it is run for a long enough time starting with a random state, we will eventually reach the most likely state of the system.

In thermodynamics, this means that after a certain number of Markov processes we reach an equilibrium distribution.

This mimicks the way a real system reaches its most likely state at a given temperature of the surroundings.

Brownian motion and Markov processes, Ergodicity and Detailed balance

To reach this distribution, the Markov process needs to obey two important conditions, that of ergodicity and detailed balance. These conditions impose then constraints on our algorithms for accepting or rejecting new random states.

The Metropolis algorithm is widely used in Monte Carlo simulations and the understanding of it rests within the interpretation of random walks and Markov processes.

Brownian motion and Markov processes, jargon

In a random walk one defines a mathematical entity called a walker, whose attributes completely define the state of the system in question.

The state of the system can refer to any physical quantities, from the vibrational state of a molecule specified by a set of quantum numbers, to the brands of coffee in your favourite supermarket.

The walker moves in an appropriate state space by a combination of deterministic and random displacements from its previous position.

Brownian motion and Markov processes, sequence of ingredients

Applications: almost every field in science

Markov processes

A Markov process allows in principle for a microscopic description of Brownian motion. As with the random walk studied in the previous section, we consider a particle which moves along the $ x $-axis in the form of a series of jumps with step length $ \Delta x = l $. Time and space are discretized and the subsequent moves are statistically independent, i.e., the new move depends only on the previous step and not on the results from earlier trials. We start at a position $ x=jl=j\Delta x $ and move to a new position $ x =i\Delta x $ during a step $ \Delta t=\epsilon $, where $ i\ge 0 $ and $ j\ge 0 $ are integers. The original probability distribution function (PDF) of the particles is given by $ w_i(t=0) $ where $ i $ refers to a specific position on the grid in

Markov processes

For the Markov process we have a transition probability from a position $ x=jl $ to a position $ x=il $ given by $$ \begin{equation*} W_{ij}(\epsilon)=W(il-jl,\epsilon)=\left\{\begin{array}{cc}\frac{1}{2} & |i-j| = 1\\ 0 & \mathrm{else} \end{array} \right. , \end{equation*} $$ where $ W_{ij} $ is normally called the transition probability and we can represent it, see below, as a matrix. Here we have specialized to a case where the transition probability is known.

Our new PDF $ w_i(t=\epsilon) $ is now related to the PDF at $ t=0 $ through the relation $$ \begin{equation*} w_i(t=\epsilon) =\sum_{j} W(j\rightarrow i)w_j(t=0). \end{equation*} $$

This equation represents the discretized time-development of an original PDF with equal probability of jumping left or right.

Markov processes, the probabilities

Since both $ W $ and $ w $ represent probabilities, they have to be normalized, i.e., we require that at each time step we have $$ \begin{equation*} \sum_i w_i(t) = 1, \end{equation*} $$ and $$ \begin{equation*} \sum_j W(j\rightarrow i) = 1, \end{equation*} $$ which applies for all $ j $-values. The further constraints are $ 0 \le W_{ij} \le 1 $ and $ 0 \le w_{j} \le 1 $. Note that the probability for remaining at the same place is in general not necessarily equal zero.

Markov processes

The time development of our initial PDF can now be represented through the action of the transition probability matrix applied $ n $ times. At a time $ t_n=n\epsilon $ our initial distribution has developed into $$ \begin{equation*} w_i(t_n) = \sum_jW_{ij}(t_n)w_j(0), \end{equation*} $$ and defining $$ \begin{equation*} W(il-jl,n\epsilon)=(W^n(\epsilon))_{ij} \end{equation*} $$ we obtain $$ \begin{equation*} w_i(n\epsilon) = \sum_j(W^n(\epsilon))_{ij}w_j(0), \end{equation*} $$ or in matrix form $$ \begin{equation} \label{eq:wfinal} \hat{w}(n\epsilon) = \hat{W}^n(\epsilon)\hat{w}(0). \end{equation} $$

An Illustrative Example

The following simple example may help in understanding the meaning of the transition matrix $ \hat{W} $ and the vector $ \hat{w} $. Consider the $ 4\times 4 $ matrix $ \hat{W} $ $$ \begin{equation*} \hat{W} = \left(\begin{array}{cccc} 1/4 & 1/9 & 3/8 & 1/3 \\ 2/4 & 2/9 & 0 & 1/3\\ 0 & 1/9 & 3/8 & 0\\ 1/4 & 5/9& 2/8 & 1/3 \end{array} \right), \end{equation*} $$ and we choose our initial state as $$ \begin{equation*} \hat{w}(t=0)= \left(\begin{array}{c} 1\\ 0\\ 0 \\ 0 \end{array} \right). \end{equation*} $$

An Illustrative Example

We note that both the vector and the matrix are properly normalized. Summing the vector elements gives one and summing over columns for the matrix results also in one. Furthermore, the largest eigenvalue is one. We act then on $ \hat{w} $ with $ \hat{W} $. The first iteration is $$ \begin{equation*} \hat{w}(t=\epsilon) = \hat{W}\hat{w}(t=0), \end{equation*} $$

resulting in $$ \begin{equation*} \hat{w}(t=\epsilon)= \left(\begin{array}{c} 1/4\\ 1/2 \\ 0 \\ 1/4 \end{array} \right). \end{equation*} $$

An Illustrative Example, next step

The next iteration results in $$ \begin{equation*} \hat{w}(t=2\epsilon) = \hat{W}\hat{w}(t=\epsilon), \end{equation*} $$

resulting in $$ \begin{equation*} \hat{w}(t=2\epsilon)= \left(\begin{array}{c} 0.201389\\ 0.319444 \\ 0.055556 \\ 0.423611 \end{array} \right). \end{equation*} $$ Note that the vector $ \hat{w} $ is always normalized to $ 1 $.

An Illustrative Example, the steady state

We find the steady state of the system by solving the set of equations $$ \begin{equation*} w(t=\infty) = Ww(t=\infty), \end{equation*} $$ which is an eigenvalue problem with eigenvalue equal to one! This set of equations reads $$ \begin{align} W_{11}w_1(t=\infty) +W_{12}w_2(t=\infty) +W_{13}w_3(t=\infty)+ W_{14}w_4(t=\infty)=&w_1(t=\infty) \nonumber \\ W_{21}w_1(t=\infty) + W_{22}w_2(t=\infty) + W_{23}w_3(t=\infty)+ W_{24}w_4(t=\infty)=&w_2(t=\infty) \nonumber \\ W_{31}w_1(t=\infty) + W_{32}w_2(t=\infty) + W_{33}w_3(t=\infty)+ W_{34}w_4(t=\infty)=&w_3(t=\infty) \nonumber \\ W_{41}w_1(t=\infty) + W_{42}w_2(t=\infty) + W_{43}w_3(t=\infty)+ W_{44}w_4(t=\infty)=&w_4(t=\infty) \nonumber \\ \label{_auto1} \end{align} $$ with the constraint that $$ \begin{equation*} \sum_i w_i(t=\infty) = 1, \end{equation*} $$ yielding as solution $$ \begin{equation*} \hat{w}(t=\infty)= \left(\begin{array}{c}0.244318 \\ 0.319602 \\ 0.056818 \\ 0.379261 \end{array} \right). \end{equation*} $$

An Illustrative Example, iterative steps

The table here demonstrates the convergence as a function of the number of iterations or time steps. After twelve iterations we have reached the exact value with six leading digits.

Iteration $ w_1 $ $ w_2 $ $ w_3 $ $ w_4 $

0 1.000000 0.000000 0.000000 0.000000

1 0.250000 0.500000 0.000000 0.250000

2 0.201389 0.319444 0.055556 0.423611

3 0.247878 0.312886 0.056327 0.382909

4 0.245494 0.321106 0.055888 0.377513

5 0.243847 0.319941 0.056636 0.379575

6 0.244274 0.319547 0.056788 0.379391

7 0.244333 0.319611 0.056801 0.379255

8 0.244314 0.319610 0.056813 0.379264

9 0.244317 0.319603 0.056817 0.379264

10 0.244318 0.319602 0.056818 0.379262

11 0.244318 0.319602 0.056818 0.379261

12 0.244318 0.319602 0.056818 0.379261

$ \hat{w}(t=\infty) $ 0.244318 0.319602 0.056818 0.379261

An Illustrative Example, what does it mean?

Iteration	\( w_1 \)	\( w_2 \)	\( w_3 \)	\( w_4 \)
0	1.000000	0.000000	0.000000	0.000000
1	0.250000	0.500000	0.000000	0.250000
2	0.201389	0.319444	0.055556	0.423611
3	0.247878	0.312886	0.056327	0.382909
4	0.245494	0.321106	0.055888	0.377513
5	0.243847	0.319941	0.056636	0.379575
6	0.244274	0.319547	0.056788	0.379391
7	0.244333	0.319611	0.056801	0.379255
8	0.244314	0.319610	0.056813	0.379264
9	0.244317	0.319603	0.056817	0.379264
10	0.244318	0.319602	0.056818	0.379262
11	0.244318	0.319602	0.056818	0.379261
12	0.244318	0.319602	0.056818	0.379261
\( \hat{w}(t=\infty) \)	0.244318	0.319602	0.056818	0.379261

We have after $ t $-steps $$ \begin{equation*} \hat{w}(t) = \hat{W}^t\hat{w}(0), \end{equation*} $$ with $ \hat{w}(0) $ the distribution at $ t=0 $ and $ \hat{W} $ representing the transition probability matrix.

An Illustrative Example, understanding the basics

We can always expand $ \hat{w}(0) $ in terms of the right eigenvectors $ \hat{v} $ of $ \hat{W} $ as $$ \begin{equation*} \hat{w}(0) = \sum_i\alpha_i\hat{v}_i, \end{equation*} $$ resulting in $$ \begin{equation*} \hat{w}(t) = \hat{W}^t\hat{w}(0)=\hat{W}^t\sum_i\alpha_i\hat{v}_i= \sum_i\lambda_i^t\alpha_i\hat{v}_i, \end{equation*} $$ with $ \lambda_i $ the $ i^{\mathrm{th}} $ eigenvalue corresponding to the eigenvector $ \hat{v}_i $.

If we assume that $ \lambda_0 $ is the largest eigenvector we see that in the limit $ t\rightarrow \infty $, $ \hat{w}(t) $ becomes proportional to the corresponding eigenvector $ \hat{v}_0 $. This is our steady state or final distribution.