{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!-- dom:TITLE: Decision trees, Random Forests, Bagging and Boosting -->\n",
    "# Decision trees, Random Forests, Bagging and Boosting\n",
    "<!-- dom:AUTHOR: Morten Hjorth-Jensen at Department of Physics, University of Oslo & Department of Physics and Astronomy and National Superconducting Cyclotron Laboratory, Michigan State University -->\n",
    "<!-- Author: -->  \n",
    "**Morten Hjorth-Jensen**, Department of Physics, University of Oslo and Department of Physics and Astronomy and National Superconducting Cyclotron Laboratory, Michigan State University\n",
    "\n",
    "Date: **Feb 14, 2021**\n",
    "\n",
    "Copyright 1999-2021, Morten Hjorth-Jensen. Released under CC Attribution-NonCommercial 4.0 license\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "## Decision trees, overarching aims\n",
    "\n",
    "\n",
    "We start here with the most basic algorithm, the so-called decision\n",
    "tree. With this basic algorithm we can in turn build more complex\n",
    "networks, spanning from homogeneous and heterogenous forests (bagging,\n",
    "random forests and more) to one of the most popular supervised\n",
    "algorithms nowadays, the extreme gradient boosting, or just\n",
    "XGBoost. But let us start with the simplest possible ingredient.\n",
    "\n",
    "Decision trees are supervised learning algorithms used for both,\n",
    "classification and regression tasks.\n",
    "\n",
    "\n",
    "The main idea of decision trees\n",
    "is to find those descriptive features which contain the most\n",
    "**information** regarding the target feature and then split the dataset\n",
    "along the values of these features such that the target feature values\n",
    "for the resulting underlying datasets are as pure as possible.\n",
    "\n",
    "The descriptive features which reproduce best the target/output features are normally  said\n",
    "to be the most informative ones. The process of finding the **most\n",
    "informative** feature is done until we accomplish a stopping criteria\n",
    "where we then finally end up in so called **leaf nodes**. \n",
    "\n",
    "## Basics of a tree\n",
    "\n",
    "A decision tree is typically divided into a **root node**, the **interior nodes**,\n",
    "and the final **leaf nodes** or just **leaves**. These entities are then connected by so-called **branches**.\n",
    "\n",
    "The leaf nodes\n",
    "contain the predictions we will make for new query instances presented\n",
    "to our trained model. This is possible since the model has \n",
    "learned the underlying structure of the training data and hence can,\n",
    "given some assumptions, make predictions about the target feature value\n",
    "(class) of unseen query instances.\n",
    "\n",
    "## A Sketch of a Tree, Regression problem\n",
    "\n",
    "\n",
    "\n",
    "<!-- FIGURE: [DataFiles/Regsimpletree.png, width=600 frac=0.8] -->\n",
    "\n",
    "## A Sketch of a Tree, Classification  problem\n",
    "\n",
    "<!-- FIGURE: [DataFiles/Classimpletree.png, width=600 frac=0.8] -->\n",
    "\n",
    "\n",
    "\n",
    "## A typical Decision Tree with its pertinent Jargon, Classification Problem\n",
    "\n",
    "<!-- dom:FIGURE: [DataFiles/cancer.png, width=600 frac=0.8] -->\n",
    "<!-- begin figure -->\n",
    "<img src=\"DataFiles/cancer.png\" width=600><p style=\"font-size: 0.9em\"><i>Figure 1: </i></p><!-- end figure -->\n",
    "\n",
    "\n",
    "This tree was produced using the Wisconsin cancer data (discussed here as well, see code examples below) using **Scikit-Learn**'s decision tree classifier. Here we have used the so-called **gini** index (see below) to split the various branches.\n",
    "\n",
    "\n",
    "\n",
    "## General Features\n",
    "\n",
    "The overarching approach to decision trees is a top-down approach.\n",
    "\n",
    "* A leaf provides the classification of a given instance.\n",
    "\n",
    "* A node specifies a test of some attribute of the instance.\n",
    "\n",
    "* A branch corresponds to a possible values of an attribute.\n",
    "\n",
    "* An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute in the given example.\n",
    "\n",
    "This process is then repeated for the subtree rooted at the new\n",
    "node.\n",
    "\n",
    "\n",
    "## How do we set it up?\n",
    "\n",
    "\n",
    "In simplified terms, the process of training a decision tree and\n",
    "predicting the target features of query instances is as follows:\n",
    "\n",
    "1. Present a dataset containing of a number of training instances characterized by a number of descriptive features and a target feature\n",
    "\n",
    "2. Train the decision tree model by continuously splitting the target feature along the values of the descriptive features using a measure of information gain during the training process\n",
    "\n",
    "3. Grow the tree until we accomplish a stopping criteria create leaf nodes which represent the *predictions* we want to make for new query instances\n",
    "\n",
    "4. Show query instances to the tree and run down the tree until we arrive at leaf nodes\n",
    "\n",
    "Then we are essentially done!\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "## Decision trees and Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "steps=250\n",
    "\n",
    "distance=0\n",
    "x=0\n",
    "distance_list=[]\n",
    "steps_list=[]\n",
    "while x<steps:\n",
    "    distance+=np.random.randint(-1,2)\n",
    "    distance_list.append(distance)\n",
    "    x+=1\n",
    "    steps_list.append(x)\n",
    "plt.plot(steps_list,distance_list, color='green', label=\"Random Walk Data\")\n",
    "\n",
    "steps_list=np.asarray(steps_list)\n",
    "distance_list=np.asarray(distance_list)\n",
    "\n",
    "X=steps_list[:,np.newaxis]\n",
    "\n",
    "#Polynomial fits\n",
    "\n",
    "#Degree 2\n",
    "poly_features=PolynomialFeatures(degree=2, include_bias=False)\n",
    "X_poly=poly_features.fit_transform(X)\n",
    "\n",
    "lin_reg=LinearRegression()\n",
    "poly_fit=lin_reg.fit(X_poly,distance_list)\n",
    "b=lin_reg.coef_\n",
    "c=lin_reg.intercept_\n",
    "print (\"2nd degree coefficients:\")\n",
    "print (\"zero power: \",c)\n",
    "print (\"first power: \", b[0])\n",
    "print (\"second power: \",b[1])\n",
    "\n",
    "z = np.arange(0, steps, .01)\n",
    "z_mod=b[1]*z**2+b[0]*z+c\n",
    "\n",
    "fit_mod=b[1]*X**2+b[0]*X+c\n",
    "plt.plot(z, z_mod, color='r', label=\"2nd Degree Fit\")\n",
    "plt.title(\"Polynomial Regression\")\n",
    "\n",
    "plt.xlabel(\"Steps\")\n",
    "plt.ylabel(\"Distance\")\n",
    "\n",
    "#Degree 10\n",
    "poly_features10=PolynomialFeatures(degree=10, include_bias=False)\n",
    "X_poly10=poly_features10.fit_transform(X)\n",
    "\n",
    "poly_fit10=lin_reg.fit(X_poly10,distance_list)\n",
    "\n",
    "y_plot=poly_fit10.predict(X_poly10)\n",
    "plt.plot(X, y_plot, color='black', label=\"10th Degree Fit\")\n",
    "\n",
    "plt.legend()\n",
    "plt.show()\n",
    "\n",
    "\n",
    "#Decision Tree Regression\n",
    "from sklearn.tree import DecisionTreeRegressor\n",
    "regr_1=DecisionTreeRegressor(max_depth=2)\n",
    "regr_2=DecisionTreeRegressor(max_depth=5)\n",
    "regr_3=DecisionTreeRegressor(max_depth=7)\n",
    "regr_1.fit(X, distance_list)\n",
    "regr_2.fit(X, distance_list)\n",
    "regr_3.fit(X, distance_list)\n",
    "\n",
    "X_test = np.arange(0.0, steps, 0.01)[:, np.newaxis]\n",
    "y_1 = regr_1.predict(X_test)\n",
    "y_2 = regr_2.predict(X_test)\n",
    "y_3=regr_3.predict(X_test)\n",
    "\n",
    "# Plot the results\n",
    "plt.figure()\n",
    "plt.scatter(X, distance_list, s=2.5, c=\"black\", label=\"data\")\n",
    "plt.plot(X_test, y_1, color=\"red\",\n",
    "         label=\"max_depth=2\", linewidth=2)\n",
    "plt.plot(X_test, y_2, color=\"green\", label=\"max_depth=5\", linewidth=2)\n",
    "plt.plot(X_test, y_3, color=\"m\", label=\"max_depth=7\", linewidth=2)\n",
    "\n",
    "plt.xlabel(\"Data\")\n",
    "plt.ylabel(\"Darget\")\n",
    "plt.title(\"Decision Tree Regression\")\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building a tree, regression\n",
    "\n",
    "There are mainly two steps\n",
    "1. We split the predictor space (the set of possible values $x_1,x_2,\\dots, x_p$) into $J$ distinct and non-non-overlapping regions, $R_1,R_2,\\dots,R_J$.  \n",
    "\n",
    "2. For every observation that falls into the region $R_j$ , we make the same prediction, which is simply the mean of the response values for the training observations in $R_j$.\n",
    "\n",
    "How do we construct the regions $R_1,\\dots,R_J$?  In theory, the\n",
    "regions could have any shape. However, we choose to divide the\n",
    "predictor space into high-dimensional rectangles, or boxes, for\n",
    "simplicity and for ease of interpretation of the resulting predictive\n",
    "model. The goal is to find boxes $R_1,\\dots,R_J$ that minimize the\n",
    "MSE, given by"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\sum_{j=1}^J\\sum_{i\\in R_j}(y_i-\\overline{y}_{R_j})^2,\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where $\\overline{y}_{R_j}$  is the mean response for the training observations \n",
    "within box $j$. \n",
    "\n",
    "## A top-down approach, recursive binary splitting\n",
    "\n",
    "Unfortunately, it is computationally infeasible to consider every\n",
    "possible partition of the feature space into $J$ boxes.  The common\n",
    "strategy is to take a top-down approach\n",
    "\n",
    "The approach is top-down because it begins at the top of the tree (all\n",
    "observations belong to a single region) and then successively splits\n",
    "the predictor space; each split is indicated via two new branches\n",
    "further down on the tree. It is greedy because at each step of the\n",
    "tree-building process, the best split is made at that particular step,\n",
    "rather than looking ahead and picking a split that will lead to a\n",
    "better tree in some future step.\n",
    "\n",
    "## Making a tree\n",
    "\n",
    "In order to implement the recursive binary splitting we start by selecting\n",
    "the predictor $x_j$ and a cutpoint $s$ that splits the predictor space into two regions $R_1$ and $R_2$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\left\\{X\\vert x_j < s\\right\\},\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\left\\{X\\vert x_j \\geq s\\right\\},\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "so that we obtain the lowest MSE, that is"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\sum_{i:x_i\\in R_j}(y_i-\\overline{y}_{R_1})^2+\\sum_{i:x_i\\in R_2}(y_i-\\overline{y}_{R_2})^2,\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "which we want to minimize by considering all predictors\n",
    "$x_1,x_2,\\dots,x_p$.  We consider also all possible values of $s$ for\n",
    "each predictor. These values could be determined by randomly assigned\n",
    "numbers or by starting at the midpoint and then proceed till we find\n",
    "an optimal value.\n",
    "\n",
    "For any $j$ and $s$, we define the pair of half-planes where\n",
    "$\\overline{y}_{R_1}$ is the mean response for the training\n",
    "observations in $R_1(j,s)$, and $\\overline{y}_{R_2}$ is the mean\n",
    "response for the training observations in $R_2(j,s)$.\n",
    "\n",
    "Finding the values of $j$ and $s$ that minimize the above equation can be\n",
    "done quite quickly, especially when the number of features $p$ is not\n",
    "too large.\n",
    "\n",
    "Next, we repeat the process, looking\n",
    "for the best predictor and best cutpoint in order to split the data\n",
    "further so as to minimize the MSE within each of the resulting\n",
    "regions. However, this time, instead of splitting the entire predictor\n",
    "space, we split one of the two previously identified regions. We now\n",
    "have three regions. Again, we look to split one of these three regions\n",
    "further, so as to minimize the MSE. The process continues until a\n",
    "stopping criterion is reached; for instance, we may continue until no\n",
    "region contains more than five observations.\n",
    "\n",
    "<!-- !split  -->\n",
    "## Pruning the tree\n",
    "\n",
    "The above procedure is rather straightforward, but leads often to\n",
    "overfitting and unnecessarily large and complicated trees. The basic\n",
    "idea is to grow a large tree $T_0$ and then prune it back in order to\n",
    "obtain a subtree. A smaller tree with fewer splits (fewer regions) can\n",
    "lead to smaller variance and better interpretation at the cost of a\n",
    "little more bias.\n",
    "\n",
    "The so-called Cost complexity pruning algorithm gives us a\n",
    "way to do just this. Rather than considering every possible subtree,\n",
    "we consider a sequence of trees indexed by a nonnegative tuning\n",
    "parameter $\\alpha$.\n",
    "\n",
    "Read more at the following [Scikit-Learn link on pruning](https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html#sphx-glr-auto-examples-tree-plot-cost-complexity-pruning-py).\n",
    "\n",
    "## Cost complexity pruning\n",
    "\n",
    "For each value of $\\alpha$  there corresponds a subtree $T \\in T_0$ such that"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\sum_{m=1}^{\\overline{T}}\\sum_{i:x_i\\in R_m}(y_i-\\overline{y}_{R_m})^2+\\alpha\\overline{T},\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "is as small as possible. Here $\\overline{T}$ is \n",
    "the number of terminal nodes of the tree $T$ , $R_m$ is the\n",
    "rectangle (i.e. the subset of predictor space)  corresponding to the $m$-th terminal node.\n",
    "\n",
    "The tuning parameter $\\alpha$ controls a trade-off between the subtree’s\n",
    "complexity and its fit to the training data. When $\\alpha = 0$, then the\n",
    "subtree $T$ will simply equal $T_0$, \n",
    "because then the above equation just measures the\n",
    "training error. \n",
    "However, as $\\alpha$ increases, there is a price to pay for\n",
    "having a tree with many terminal nodes. The above equation will\n",
    "tend to be minimized for a smaller subtree. \n",
    "\n",
    "\n",
    "It turns out that as we increase $\\alpha$ from zero\n",
    "branches get pruned from the tree in a nested and predictable fashion,\n",
    "so obtaining the whole sequence of subtrees as a function of $\\alpha$ is\n",
    "easy. We can select a value of $\\alpha$ using a validation set or using\n",
    "cross-validation. We then return to the full data set and obtain the\n",
    "subtree corresponding to $\\alpha$. \n",
    "\n",
    "\n",
    "## Schematic Regression Procedure\n",
    "\n",
    "**Building a Regression Tree.**\n",
    "\n",
    "\n",
    "1. Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations.\n",
    "\n",
    "2. Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of $\\alpha$.\n",
    "\n",
    "3. Use for example $K$-fold cross-validation to choose $\\alpha$. Divide the training observations into $K$ folds. For each $k=1,2,\\dots,K$ we: \n",
    "\n",
    "  * repeat steps 1 and 2 on all but the $k$-th fold of the training data. \n",
    "\n",
    "  * Then we valuate the mean squared prediction error on the data in the left-out $k$-th fold, as a function of $\\alpha$.\n",
    "\n",
    "  * Finally  we average the results for each value of $\\alpha$, and pick $\\alpha$ to minimize the average error.\n",
    "\n",
    "\n",
    "4. Return the subtree from Step 2 that corresponds to the chosen value of $\\alpha$.\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "## A Classification Tree\n",
    "\n",
    "A classification tree is very similar to a regression tree, except\n",
    "that it is used to predict a qualitative response rather than a\n",
    "quantitative one. Recall that for a regression tree, the predicted\n",
    "response for an observation is given by the mean response of the\n",
    "training observations that belong to the same terminal node. In\n",
    "contrast, for a classification tree, we predict that each observation\n",
    "belongs to the most commonly occurring class of training observations\n",
    "in the region to which it belongs. In interpreting the results of a\n",
    "classification tree, we are often interested not only in the class\n",
    "prediction corresponding to a particular terminal node region, but\n",
    "also in the class proportions among the training observations that\n",
    "fall into that region.  \n",
    "\n",
    "## Growing a classification tree\n",
    "\n",
    "The task of growing a\n",
    "classification tree is quite similar to the task of growing a\n",
    "regression tree. Just as in the regression setting, we use recursive\n",
    "binary splitting to grow a classification tree. However, in the\n",
    "classification setting, the MSE cannot be used as a criterion for making\n",
    "the binary splits.  A natural alternative to MSE is the **classification\n",
    "error rate**. Since we plan to assign an observation in a given region\n",
    "to the most commonly occurring error rate class of training\n",
    "observations in that region, the classification error rate is simply\n",
    "the fraction of the training observations in that region that do not\n",
    "belong to the most common class. \n",
    "\n",
    "When building a classification tree, either the Gini index or the\n",
    "entropy are typically used to evaluate the quality of a particular\n",
    "split, since these two approaches are more sensitive to node purity\n",
    "than is the classification error rate. \n",
    "\n",
    "\n",
    "## Classification tree, how to split nodes\n",
    "\n",
    "If our targets are the outcome of a classification process that takes\n",
    "for example $k=1,2,\\dots,K$ values, the only thing we need to think of\n",
    "is to set up the splitting criteria for each node.\n",
    "\n",
    "We define a PDF $p_{mk}$ that represents the number of observations of\n",
    "a class $k$ in a region $R_m$ with $N_m$ observations. We represent\n",
    "this likelihood function in terms of the proportion $I(y_i=k)$ of\n",
    "observations of this class in the region $R_m$ as"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "p_{mk} = \\frac{1}{N_m}\\sum_{x_i\\in R_m}I(y_i=k).\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We let $p_{mk}$ represent the majority class of observations in region\n",
    "$m$. The three most common ways of splitting a node are given by\n",
    "\n",
    "* Misclassification error"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "p_{mk} = \\frac{1}{N_m}\\sum_{x_i\\in R_m}I(y_i\\ne k) = 1-p_{mk}.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Gini index $g$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "g = \\sum_{k=1}^K p_{mk}(1-p_{mk}).\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Information entropy or just entropy $s$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "s = -\\sum_{k=1}^K p_{mk}\\log{p_{mk}}.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualizing the Tree, Classification"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "from sklearn.datasets import load_breast_cancer\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import confusion_matrix\n",
    "from sklearn.tree import export_graphviz\n",
    "\n",
    "from IPython.display import Image \n",
    "from pydot import graph_from_dot_data\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "cancer = load_breast_cancer()\n",
    "X = pd.DataFrame(cancer.data, columns=cancer.feature_names)\n",
    "print(X)\n",
    "y = pd.Categorical.from_codes(cancer.target, cancer.target_names)\n",
    "y = pd.get_dummies(y)\n",
    "print(y)\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n",
    "tree_clf = DecisionTreeClassifier(max_depth=5)\n",
    "tree_clf.fit(X_train, y_train)\n",
    "\n",
    "export_graphviz(\n",
    "    tree_clf,\n",
    "    out_file=\"DataFiles/cancer.dot\",\n",
    "    feature_names=cancer.feature_names,\n",
    "    class_names=cancer.target_names,\n",
    "    rounded=True,\n",
    "    filled=True\n",
    ")\n",
    "cmd = 'dot -Tpng DataFiles/cancer.dot -o DataFiles/cancer.png'\n",
    "os.system(cmd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualizing the Tree, The Moons"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# Common imports\n",
    "import numpy as np\n",
    "from sklearn.model_selection import  train_test_split \n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.datasets import make_moons\n",
    "from sklearn.tree import export_graphviz\n",
    "from pydot import graph_from_dot_data\n",
    "import pandas as pd\n",
    "import os\n",
    "\n",
    "np.random.seed(42)\n",
    "X, y = make_moons(n_samples=100, noise=0.25, random_state=53)\n",
    "X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)\n",
    "tree_clf = DecisionTreeClassifier(max_depth=5)\n",
    "tree_clf.fit(X_train, y_train)\n",
    "\n",
    "export_graphviz(\n",
    "    tree_clf,\n",
    "    out_file=\"DataFiles/moons.dot\",\n",
    "    rounded=True,\n",
    "    filled=True\n",
    ")\n",
    "cmd = 'dot -Tpng DataFiles/moons.dot -o DataFiles/moons.png'\n",
    "os.system(cmd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Other ways of visualizing the trees\n",
    "\n",
    "**Scikit-Learn** has also another way to visualize the trees which is very useful, here with the Iris data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "from sklearn import tree\n",
    "X, y = load_iris(return_X_y=True)\n",
    "tree_clf = tree.DecisionTreeClassifier()\n",
    "tree_clf = tree_clf.fit(X, y)\n",
    "# and then plot the tree\n",
    "tree.plot_tree(tree_clf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Printing out as text\n",
    "\n",
    "Alternatively, the tree can also be exported in textual format with the function exporttext.\n",
    "This method doesn’t require the installation of external libraries and is more compact:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.tree import export_text\n",
    "iris = load_iris()\n",
    "decision_tree = DecisionTreeClassifier(random_state=0, max_depth=2)\n",
    "decision_tree = decision_tree.fit(iris.data, iris.target)\n",
    "r = export_text(decision_tree, feature_names=iris['feature_names'])\n",
    "print(r)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Algorithms for Setting up Decision Trees\n",
    "\n",
    "Two algorithms stand out in the set up of decision trees:\n",
    "1. The CART (Classification And Regression Tree) algorithm for both classification and regression\n",
    "\n",
    "2. The ID3 algorithm based on the computation of the information gain for classification\n",
    "\n",
    "We discuss both algorithms with applications here. The popular library\n",
    "**Scikit-Learn** uses the CART algorithm. For classification problems\n",
    "you can use either the **gini** index or the **entropy** to split a tree\n",
    "in two branches.\n",
    "\n",
    "## The CART algorithm for Classification\n",
    "\n",
    "For classification, the CART algorithm splits the data set in two subsets using a single feature $k$ and a threshold $t_k$.\n",
    "This could be for example a threshold set by a number below a certain circumference of a malign tumor.\n",
    "\n",
    "How do we find these two quantities?\n",
    "We search for the pair $(k,t_k)$ that produces the purest subset using for example the **gini** factor $G$.\n",
    "The cost function it tries to minimize is then"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "C(k,t_k) = \\frac{m_{\\mathrm{left}}}{m}G_{\\mathrm{left}}+ \\frac{m_{\\mathrm{right}}}{m}G_{\\mathrm{right}},\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where $G_{\\mathrm{left/right}}$ measures the impurity of the left/right subset  and $m_{\\mathrm{left/right}}$\n",
    " is the number of instances in the left/right subset\n",
    "\n",
    "Once it has successfully split the training set in two, it splits the subsets using the same logic, then the subsubsets\n",
    "and so on, recursively. It stops recursing once it reaches the maximum depth (defined by the\n",
    "$max\\_depth$ hyperparameter), or if it cannot find a split that will reduce impurity. A few other\n",
    "hyperparameters control additional stopping conditions such as the $min\\_samples\\_split$,\n",
    "$min\\_samples\\_leaf$, $min\\_weight\\_fraction\\_leaf$, and $max\\_leaf\\_nodes$.\n",
    "\n",
    "## The CART algorithm for Regression\n",
    "\n",
    "The CART algorithm for regression works is similar to the one for classification except that instead of trying to split the\n",
    "training set in a way that minimizes say the **gini** or **entropy** impurity, it now tries to split the training set in a way that minimizes our well-known mean-squared error (MSE). The cost function is now"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "C(k,t_k) = \\frac{m_{\\mathrm{left}}}{m}\\mathrm{MSE}_{\\mathrm{left}}+ \\frac{m_{\\mathrm{right}}}{m}\\mathrm{MSE}_{\\mathrm{right}}.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here the MSE for a specific node is defined as"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\mathrm{MSE}_{\\mathrm{node}}=\\frac{1}{m_\\mathrm{node}}\\sum_{i\\in \\mathrm{node}}(\\overline{y}_{\\mathrm{node}}-y_i)^2,\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "with"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\overline{y}_{\\mathrm{node}}=\\frac{1}{m_\\mathrm{node}}\\sum_{i\\in \\mathrm{node}}y_i,\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "the mean value of all observations in a specific node.\n",
    "\n",
    "Without any regularization, the regression task for decision trees, \n",
    "just like for classification tasks, is  prone to overfitting.\n",
    "\n",
    "\n",
    "\n",
    "## Cancer Data again now with Decision Trees and other Methods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from sklearn.model_selection import  train_test_split \n",
    "from sklearn.datasets import load_breast_cancer\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "\n",
    "# Load the data\n",
    "cancer = load_breast_cancer()\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,random_state=0)\n",
    "print(X_train.shape)\n",
    "print(X_test.shape)\n",
    "# Logistic Regression\n",
    "logreg = LogisticRegression(solver='lbfgs')\n",
    "logreg.fit(X_train, y_train)\n",
    "print(\"Test set accuracy with Logistic Regression: {:.2f}\".format(logreg.score(X_test,y_test)))\n",
    "# Support vector machine\n",
    "svm = SVC(gamma='auto', C=100)\n",
    "svm.fit(X_train, y_train)\n",
    "print(\"Test set accuracy with SVM: {:.2f}\".format(svm.score(X_test,y_test)))\n",
    "# Decision Trees\n",
    "deep_tree_clf = DecisionTreeClassifier(max_depth=None)\n",
    "deep_tree_clf.fit(X_train, y_train)\n",
    "print(\"Test set accuracy with Decision Trees: {:.2f}\".format(deep_tree_clf.score(X_test,y_test)))\n",
    "#now scale the data\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "scaler = StandardScaler()\n",
    "scaler.fit(X_train)\n",
    "X_train_scaled = scaler.transform(X_train)\n",
    "X_test_scaled = scaler.transform(X_test)\n",
    "# Logistic Regression\n",
    "logreg.fit(X_train_scaled, y_train)\n",
    "print(\"Test set accuracy Logistic Regression with scaled data: {:.2f}\".format(logreg.score(X_test_scaled,y_test)))\n",
    "# Support Vector Machine\n",
    "svm.fit(X_train_scaled, y_train)\n",
    "print(\"Test set accuracy SVM with scaled data: {:.2f}\".format(logreg.score(X_test_scaled,y_test)))\n",
    "# Decision Trees\n",
    "deep_tree_clf.fit(X_train_scaled, y_train)\n",
    "print(\"Test set accuracy with Decision Trees and scaled data: {:.2f}\".format(deep_tree_clf.score(X_test_scaled,y_test)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Another example, the moons again"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "from __future__ import division, print_function, unicode_literals\n",
    "\n",
    "# Common imports\n",
    "import numpy as np\n",
    "import os\n",
    "\n",
    "# to make this notebook's output stable across runs\n",
    "np.random.seed(42)\n",
    "\n",
    "# To plot pretty figures\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "from matplotlib.colors import ListedColormap\n",
    "plt.rcParams['axes.labelsize'] = 14\n",
    "plt.rcParams['xtick.labelsize'] = 12\n",
    "plt.rcParams['ytick.labelsize'] = 12\n",
    "\n",
    "\n",
    "from sklearn.svm import SVC\n",
    "from sklearn import datasets\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.datasets import make_moons\n",
    "from sklearn.tree import export_graphviz\n",
    "\n",
    "Xm, ym = make_moons(n_samples=100, noise=0.25, random_state=53)\n",
    "\n",
    "deep_tree_clf1 = DecisionTreeClassifier(random_state=42)\n",
    "deep_tree_clf2 = DecisionTreeClassifier(min_samples_leaf=4, random_state=42)\n",
    "deep_tree_clf1.fit(Xm, ym)\n",
    "deep_tree_clf2.fit(Xm, ym)\n",
    "\n",
    "\n",
    "def plot_decision_boundary(clf, X, y, axes=[0, 7.5, 0, 3], iris=True, legend=False, plot_training=True):\n",
    "    x1s = np.linspace(axes[0], axes[1], 100)\n",
    "    x2s = np.linspace(axes[2], axes[3], 100)\n",
    "    x1, x2 = np.meshgrid(x1s, x2s)\n",
    "    X_new = np.c_[x1.ravel(), x2.ravel()]\n",
    "    y_pred = clf.predict(X_new).reshape(x1.shape)\n",
    "    custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])\n",
    "    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)\n",
    "    if not iris:\n",
    "        custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])\n",
    "        plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)\n",
    "    if plot_training:\n",
    "        plt.plot(X[:, 0][y==0], X[:, 1][y==0], \"yo\", label=\"Iris-Setosa\")\n",
    "        plt.plot(X[:, 0][y==1], X[:, 1][y==1], \"bs\", label=\"Iris-Versicolor\")\n",
    "        plt.plot(X[:, 0][y==2], X[:, 1][y==2], \"g^\", label=\"Iris-Virginica\")\n",
    "        plt.axis(axes)\n",
    "    if iris:\n",
    "        plt.xlabel(\"Petal length\", fontsize=14)\n",
    "        plt.ylabel(\"Petal width\", fontsize=14)\n",
    "    else:\n",
    "        plt.xlabel(r\"$x_1$\", fontsize=18)\n",
    "        plt.ylabel(r\"$x_2$\", fontsize=18, rotation=0)\n",
    "    if legend:\n",
    "        plt.legend(loc=\"lower right\", fontsize=14)\n",
    "plt.figure(figsize=(11, 4))\n",
    "plt.subplot(121)\n",
    "plot_decision_boundary(deep_tree_clf1, Xm, ym, axes=[-1.5, 2.5, -1, 1.5], iris=False)\n",
    "plt.title(\"No restrictions\", fontsize=16)\n",
    "plt.subplot(122)\n",
    "plot_decision_boundary(deep_tree_clf2, Xm, ym, axes=[-1.5, 2.5, -1, 1.5], iris=False)\n",
    "plt.title(\"min_samples_leaf = {}\".format(deep_tree_clf2.min_samples_leaf), fontsize=14)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Playing around with regions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "np.random.seed(6)\n",
    "Xs = np.random.rand(100, 2) - 0.5\n",
    "ys = (Xs[:, 0] > 0).astype(np.float32) * 2\n",
    "\n",
    "angle = np.pi/4\n",
    "rotation_matrix = np.array([[np.cos(angle), -np.sin(angle)], [np.sin(angle), np.cos(angle)]])\n",
    "Xsr = Xs.dot(rotation_matrix)\n",
    "\n",
    "tree_clf_s = DecisionTreeClassifier(random_state=42)\n",
    "tree_clf_s.fit(Xs, ys)\n",
    "tree_clf_sr = DecisionTreeClassifier(random_state=42)\n",
    "tree_clf_sr.fit(Xsr, ys)\n",
    "\n",
    "plt.figure(figsize=(11, 4))\n",
    "plt.subplot(121)\n",
    "plot_decision_boundary(tree_clf_s, Xs, ys, axes=[-0.7, 0.7, -0.7, 0.7], iris=False)\n",
    "plt.subplot(122)\n",
    "plot_decision_boundary(tree_clf_sr, Xsr, ys, axes=[-0.7, 0.7, -0.7, 0.7], iris=False)\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regression trees"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# Quadratic training set + noise\n",
    "np.random.seed(42)\n",
    "m = 200\n",
    "X = np.random.rand(m, 1)\n",
    "y = 4 * (X - 0.5) ** 2\n",
    "y = y + np.random.randn(m, 1) / 10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeRegressor\n",
    "\n",
    "tree_reg = DecisionTreeRegressor(max_depth=2, random_state=42)\n",
    "tree_reg.fit(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Final regressor code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeRegressor\n",
    "\n",
    "tree_reg1 = DecisionTreeRegressor(random_state=42, max_depth=2)\n",
    "tree_reg2 = DecisionTreeRegressor(random_state=42, max_depth=3)\n",
    "tree_reg1.fit(X, y)\n",
    "tree_reg2.fit(X, y)\n",
    "\n",
    "def plot_regression_predictions(tree_reg, X, y, axes=[0, 1, -0.2, 1], ylabel=\"$y$\"):\n",
    "    x1 = np.linspace(axes[0], axes[1], 500).reshape(-1, 1)\n",
    "    y_pred = tree_reg.predict(x1)\n",
    "    plt.axis(axes)\n",
    "    plt.xlabel(\"$x_1$\", fontsize=18)\n",
    "    if ylabel:\n",
    "        plt.ylabel(ylabel, fontsize=18, rotation=0)\n",
    "    plt.plot(X, y, \"b.\")\n",
    "    plt.plot(x1, y_pred, \"r.-\", linewidth=2, label=r\"$\\hat{y}$\")\n",
    "\n",
    "plt.figure(figsize=(11, 4))\n",
    "plt.subplot(121)\n",
    "plot_regression_predictions(tree_reg1, X, y)\n",
    "for split, style in ((0.1973, \"k-\"), (0.0917, \"k--\"), (0.7718, \"k--\")):\n",
    "    plt.plot([split, split], [-0.2, 1], style, linewidth=2)\n",
    "plt.text(0.21, 0.65, \"Depth=0\", fontsize=15)\n",
    "plt.text(0.01, 0.2, \"Depth=1\", fontsize=13)\n",
    "plt.text(0.65, 0.8, \"Depth=1\", fontsize=13)\n",
    "plt.legend(loc=\"upper center\", fontsize=18)\n",
    "plt.title(\"max_depth=2\", fontsize=14)\n",
    "\n",
    "plt.subplot(122)\n",
    "plot_regression_predictions(tree_reg2, X, y, ylabel=None)\n",
    "for split, style in ((0.1973, \"k-\"), (0.0917, \"k--\"), (0.7718, \"k--\")):\n",
    "    plt.plot([split, split], [-0.2, 1], style, linewidth=2)\n",
    "for split in (0.0458, 0.1298, 0.2873, 0.9040):\n",
    "    plt.plot([split, split], [-0.2, 1], \"k:\", linewidth=1)\n",
    "plt.text(0.3, 0.5, \"Depth=2\", fontsize=13)\n",
    "plt.title(\"max_depth=3\", fontsize=14)\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "tree_reg1 = DecisionTreeRegressor(random_state=42)\n",
    "tree_reg2 = DecisionTreeRegressor(random_state=42, min_samples_leaf=10)\n",
    "tree_reg1.fit(X, y)\n",
    "tree_reg2.fit(X, y)\n",
    "\n",
    "x1 = np.linspace(0, 1, 500).reshape(-1, 1)\n",
    "y_pred1 = tree_reg1.predict(x1)\n",
    "y_pred2 = tree_reg2.predict(x1)\n",
    "\n",
    "plt.figure(figsize=(11, 4))\n",
    "\n",
    "plt.subplot(121)\n",
    "plt.plot(X, y, \"b.\")\n",
    "plt.plot(x1, y_pred1, \"r.-\", linewidth=2, label=r\"$\\hat{y}$\")\n",
    "plt.axis([0, 1, -0.2, 1.1])\n",
    "plt.xlabel(\"$x_1$\", fontsize=18)\n",
    "plt.ylabel(\"$y$\", fontsize=18, rotation=0)\n",
    "plt.legend(loc=\"upper center\", fontsize=18)\n",
    "plt.title(\"No restrictions\", fontsize=14)\n",
    "\n",
    "plt.subplot(122)\n",
    "plt.plot(X, y, \"b.\")\n",
    "plt.plot(x1, y_pred2, \"r.-\", linewidth=2, label=r\"$\\hat{y}$\")\n",
    "plt.axis([0, 1, -0.2, 1.1])\n",
    "plt.xlabel(\"$x_1$\", fontsize=18)\n",
    "plt.title(\"min_samples_leaf={}\".format(tree_reg2.min_samples_leaf), fontsize=14)\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pros and cons of trees, pros\n",
    "\n",
    "* White box, easy to interpret model. Some people believe that decision trees more closely mirror human decision-making than do the regression and classification approaches discussed earlier (think of support vector machines)\n",
    "\n",
    "* Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression!\n",
    "\n",
    "* No feature normalization needed\n",
    "\n",
    "* Tree models can handle both continuous and categorical data (Classification and Regression Trees)\n",
    "\n",
    "* Can model nonlinear relationships\n",
    "\n",
    "* Can model interactions between the different descriptive features\n",
    "\n",
    "* Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small)\n",
    "\n",
    "## Disadvantages\n",
    "\n",
    "* Unfortunately, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches\n",
    "\n",
    "* If continuous features are used the tree may become quite large and hence less interpretable\n",
    "\n",
    "* Decision trees are prone to overfit the training data and hence do not well generalize the data if no stopping criteria or improvements like pruning, boosting or bagging are implemented\n",
    "\n",
    "* Small changes in the data may lead to a completely different tree. This issue can be addressed by using ensemble methods like bagging, boosting or random forests\n",
    "\n",
    "* Unbalanced datasets where some target feature values occur much more frequently than others may lead to biased trees since the frequently occurring feature values are preferred over the less frequently occurring ones. \n",
    "\n",
    "* If the number of features is relatively large (high dimensional) and the number of instances is relatively low, the tree might overfit the data\n",
    "\n",
    "* Features with many levels may be preferred over features with less levels since for them it is *more easy* to split the dataset such that the sub datasets only contain pure target feature values. This issue can be addressed by preferring for instance the information gain ratio as splitting criteria over information gain\n",
    "\n",
    "However, by aggregating many decision trees, using methods like\n",
    "bagging, random forests, and boosting, the predictive performance of\n",
    "trees can be substantially improved.\n",
    "\n",
    "\n",
    "\n",
    "## Ensemble Methods: From a Single Tree to Many Trees and Extreme Boosting, Meet the Jungle of Methods\n",
    "\n",
    "As stated above and seen in many of the examples discussed here about\n",
    "a single decision tree, we often end up overfitting our training\n",
    "data. This normally means that we have a high variance. Can we reduce\n",
    "the variance of a statistical learning method?\n",
    "\n",
    "This leads us to a set of different methods that can combine different\n",
    "machine learning algorithms or just use one of them to construct\n",
    "forests and jungles of trees, homogeneous ones or heterogenous\n",
    "ones. These methods are recognized by different names which we will\n",
    "try to explain here. These are\n",
    "\n",
    "1. Voting classifiers\n",
    "\n",
    "2. Bagging and Pasting\n",
    "\n",
    "3. Random forests\n",
    "\n",
    "4. Boosting methods, from adaptive to Extreme Gradient Boosting (XGBoost)\n",
    "\n",
    "We discuss these methods here.\n",
    "\n",
    "\n",
    "## An Overview of Ensemble Methods\n",
    "\n",
    "<!-- dom:FIGURE: [DataFiles/ensembleoverview.png, width=600 frac=0.8] -->\n",
    "<!-- begin figure -->\n",
    "<img src=\"DataFiles/ensembleoverview.png\" width=600><p style=\"font-size: 0.9em\"><i>Figure 1: </i></p><!-- end figure -->\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "## Bagging\n",
    "\n",
    "The **plain** decision trees suffer from high\n",
    "variance. This means that if we split the training data into two parts\n",
    "at random, and fit a decision tree to both halves, the results that we\n",
    "get could be quite different. In contrast, a procedure with low\n",
    "variance will yield similar results if applied repeatedly to distinct\n",
    "data sets; linear regression tends to have low variance, if the ratio\n",
    "of $n$ to $p$ is moderately large. \n",
    "\n",
    "**Bootstrap aggregation**, or just **bagging**, is a\n",
    "general-purpose procedure for reducing the variance of a statistical\n",
    "learning method. \n",
    "\n",
    "\n",
    "## More bagging\n",
    "\n",
    "Bagging typically results in improved accuracy\n",
    "over prediction using a single tree. Unfortunately, however, it can be\n",
    "difficult to interpret the resulting model. Recall that one of the\n",
    "advantages of decision trees is the attractive and easily interpreted\n",
    "diagram that results.\n",
    "\n",
    "However, when we bag a large number of trees, it is no longer\n",
    "possible to represent the resulting statistical learning procedure\n",
    "using a single tree, and it is no longer clear which variables are\n",
    "most important to the procedure. Thus, bagging improves prediction\n",
    "accuracy at the expense of interpretability.  Although the collection\n",
    "of bagged trees is much more difficult to interpret than a single\n",
    "tree, one can obtain an overall summary of the importance of each\n",
    "predictor using the MSE (for bagging regression trees) or the Gini\n",
    "index (for bagging classification trees). In the case of bagging\n",
    "regression trees, we can record the total amount that the MSE is\n",
    "decreased due to splits over a given predictor, averaged over all $B$ possible\n",
    "trees. A large value indicates an important predictor. Similarly, in\n",
    "the context of bagging classification trees, we can add up the total\n",
    "amount that the Gini index  is decreased by splits over a given\n",
    "predictor, averaged over all $B$ trees.\n",
    "\n",
    "## Simple Voting Example, head or tail"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "heads_proba = 0.51\n",
    "coin_tosses = (np.random.rand(10000, 10) < heads_proba).astype(np.int32)\n",
    "cumulative_heads_ratio = np.cumsum(coin_tosses, axis=0) / np.arange(1, 10001).reshape(-1, 1)\n",
    "plt.figure(figsize=(8,3.5))\n",
    "plt.plot(cumulative_heads_ratio)\n",
    "plt.plot([0, 10000], [0.51, 0.51], \"k--\", linewidth=2, label=\"51%\")\n",
    "plt.plot([0, 10000], [0.5, 0.5], \"k-\", label=\"50%\")\n",
    "plt.xlabel(\"Number of coin tosses\")\n",
    "plt.ylabel(\"Heads ratio\")\n",
    "plt.legend(loc=\"lower right\")\n",
    "plt.axis([0, 10000, 0.42, 0.58])\n",
    "save_fig(\"votingsimple\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using the Voting Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.datasets import make_moons\n",
    "\n",
    "X, y = make_moons(n_samples=500, noise=0.30, random_state=42)\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)\n",
    "\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.ensemble import VotingClassifier\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.svm import SVC\n",
    "\n",
    "log_clf = LogisticRegression(solver=\"liblinear\", random_state=42)\n",
    "rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)\n",
    "svm_clf = SVC(gamma=\"auto\", random_state=42)\n",
    "\n",
    "voting_clf = VotingClassifier(\n",
    "    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],\n",
    "    voting='hard')\n",
    "\n",
    "voting_clf.fit(X_train, y_train)\n",
    "\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "for clf in (log_clf, rnd_clf, svm_clf, voting_clf):\n",
    "    clf.fit(X_train, y_train)\n",
    "    y_pred = clf.predict(X_test)\n",
    "    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))\n",
    "\n",
    "log_clf = LogisticRegression(solver=\"liblinear\", random_state=42)\n",
    "rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)\n",
    "svm_clf = SVC(gamma=\"auto\", probability=True, random_state=42)\n",
    "\n",
    "voting_clf = VotingClassifier(\n",
    "    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],\n",
    "    voting='soft')\n",
    "voting_clf.fit(X_train, y_train)\n",
    "\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "for clf in (log_clf, rnd_clf, svm_clf, voting_clf):\n",
    "    clf.fit(X_train, y_train)\n",
    "    y_pred = clf.predict(X_test)\n",
    "    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Please, not the moons again! Voting and Bagging"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.datasets import make_moons\n",
    "\n",
    "X, y = make_moons(n_samples=500, noise=0.30, random_state=42)\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.ensemble import VotingClassifier\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.svm import SVC\n",
    "\n",
    "log_clf = LogisticRegression(random_state=42)\n",
    "rnd_clf = RandomForestClassifier(random_state=42)\n",
    "svm_clf = SVC(random_state=42)\n",
    "\n",
    "voting_clf = VotingClassifier(\n",
    "    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],\n",
    "    voting='hard')\n",
    "voting_clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "for clf in (log_clf, rnd_clf, svm_clf, voting_clf):\n",
    "    clf.fit(X_train, y_train)\n",
    "    y_pred = clf.predict(X_test)\n",
    "    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "log_clf = LogisticRegression(random_state=42)\n",
    "rnd_clf = RandomForestClassifier(random_state=42)\n",
    "svm_clf = SVC(probability=True, random_state=42)\n",
    "\n",
    "voting_clf = VotingClassifier(\n",
    "    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],\n",
    "    voting='soft')\n",
    "voting_clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "for clf in (log_clf, rnd_clf, svm_clf, voting_clf):\n",
    "    clf.fit(X_train, y_train)\n",
    "    y_pred = clf.predict(X_test)\n",
    "    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bagging Examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "from sklearn.ensemble import BaggingClassifier\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "\n",
    "bag_clf = BaggingClassifier(\n",
    "    DecisionTreeClassifier(random_state=42), n_estimators=500,\n",
    "    max_samples=100, bootstrap=True, n_jobs=-1, random_state=42)\n",
    "bag_clf.fit(X_train, y_train)\n",
    "y_pred = bag_clf.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "print(accuracy_score(y_test, y_pred))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "tree_clf = DecisionTreeClassifier(random_state=42)\n",
    "tree_clf.fit(X_train, y_train)\n",
    "y_pred_tree = tree_clf.predict(X_test)\n",
    "print(accuracy_score(y_test, y_pred_tree))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "from matplotlib.colors import ListedColormap\n",
    "\n",
    "def plot_decision_boundary(clf, X, y, axes=[-1.5, 2.5, -1, 1.5], alpha=0.5, contour=True):\n",
    "    x1s = np.linspace(axes[0], axes[1], 100)\n",
    "    x2s = np.linspace(axes[2], axes[3], 100)\n",
    "    x1, x2 = np.meshgrid(x1s, x2s)\n",
    "    X_new = np.c_[x1.ravel(), x2.ravel()]\n",
    "    y_pred = clf.predict(X_new).reshape(x1.shape)\n",
    "    custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])\n",
    "    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)\n",
    "    if contour:\n",
    "        custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])\n",
    "        plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)\n",
    "    plt.plot(X[:, 0][y==0], X[:, 1][y==0], \"yo\", alpha=alpha)\n",
    "    plt.plot(X[:, 0][y==1], X[:, 1][y==1], \"bs\", alpha=alpha)\n",
    "    plt.axis(axes)\n",
    "    plt.xlabel(r\"$x_1$\", fontsize=18)\n",
    "    plt.ylabel(r\"$x_2$\", fontsize=18, rotation=0)\n",
    "plt.figure(figsize=(11,4))\n",
    "plt.subplot(121)\n",
    "plot_decision_boundary(tree_clf, X, y)\n",
    "plt.title(\"Decision Tree\", fontsize=14)\n",
    "plt.subplot(122)\n",
    "plot_decision_boundary(bag_clf, X, y)\n",
    "plt.title(\"Decision Trees with Bagging\", fontsize=14)\n",
    "save_fig(\"baggingtree\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Making your own Bootstrap: Changing the Level of the Decision Tree\n",
    "\n",
    "Let us bring up our good old boostrap example from the linear regression lectures. We change the linerar regression algorithm with\n",
    "a decision tree wth different depths and perform a bootstrap aggregate (in this case we perform as many bootstraps as data points $n$)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.utils import resample\n",
    "from sklearn.tree import DecisionTreeRegressor\n",
    "\n",
    "n = 100\n",
    "n_boostraps = 100\n",
    "maxdepth = 8\n",
    "\n",
    "# Make data set.\n",
    "x = np.linspace(-3, 3, n).reshape(-1, 1)\n",
    "y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.normal(0, 0.1, x.shape)\n",
    "error = np.zeros(maxdepth)\n",
    "bias = np.zeros(maxdepth)\n",
    "variance = np.zeros(maxdepth)\n",
    "polydegree = np.zeros(maxdepth)\n",
    "X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)\n",
    "\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "scaler = StandardScaler()\n",
    "scaler.fit(X_train)\n",
    "X_train_scaled = scaler.transform(X_train)\n",
    "X_test_scaled = scaler.transform(X_test)\n",
    "\n",
    "# we produce a simple tree first as benchmark\n",
    "simpletree = DecisionTreeRegressor(max_depth=3) \n",
    "simpletree.fit(X_train_scaled, y_train)\n",
    "simpleprediction = simpletree.predict(X_test_scaled)\n",
    "for degree in range(1,maxdepth):\n",
    "    model = DecisionTreeRegressor(max_depth=degree) \n",
    "    y_pred = np.empty((y_test.shape[0], n_boostraps))\n",
    "    for i in range(n_boostraps):\n",
    "        x_, y_ = resample(X_train_scaled, y_train)\n",
    "        model.fit(x_, y_)\n",
    "        y_pred[:, i] = model.predict(X_test_scaled)#.ravel()\n",
    "\n",
    "    polydegree[degree] = degree\n",
    "    error[degree] = np.mean( np.mean((y_test - y_pred)**2, axis=1, keepdims=True) )\n",
    "    bias[degree] = np.mean( (y_test - np.mean(y_pred, axis=1, keepdims=True))**2 )\n",
    "    variance[degree] = np.mean( np.var(y_pred, axis=1, keepdims=True) )\n",
    "    print('Polynomial degree:', degree)\n",
    "    print('Error:', error[degree])\n",
    "    print('Bias^2:', bias[degree])\n",
    "    print('Var:', variance[degree])\n",
    "    print('{} >= {} + {} = {}'.format(error[degree], bias[degree], variance[degree], bias[degree]+variance[degree]))\n",
    " \n",
    "mse_simpletree= np.mean( np.mean((y_test - simpleprediction)**2)\n",
    "print(mse_simpletree)\n",
    "plt.xlim(1,maxdepth)\n",
    "plt.plot(polydegree, error, label='MSE')\n",
    "plt.plot(polydegree, bias, label='bias')\n",
    "plt.plot(polydegree, variance, label='Variance')\n",
    "plt.legend()\n",
    "save_fig(\"baggingboot\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why Voting?\n",
    "\n",
    "The idea behind boosting, and voting as well can be phrased as follows:\n",
    "**Can a group of people somehow arrive at highly\n",
    "reasoned decisions, despite the weak judgement of the individual\n",
    "members?**\n",
    "\n",
    "The aim is to create a good classifier by combining several weak classifiers.\n",
    "**A weak classifier is a classifier which is able to produce results that are only slightly better than guessing at random.**\n",
    "\n",
    "The basic approach is to apply repeatedly (in boosting this is done in an iterative way) a weak classifier to modifications of the data.\n",
    "In voting we simply apply the law of large numbers while in boosting we give more weight to misclassified data in\n",
    "each iteration.  \n",
    "\n",
    "Decision trees play an important role as our weak classifier. They serve as the basic method. \n",
    "\n",
    "## Tossing coins\n",
    "\n",
    "The simplest case is a so-called voting ensemble. To illustrate this,\n",
    "think of yourself tossing coins with a biased outcome of 51 per cent\n",
    "for heads and 49% for tails. With only few tosses,\n",
    "you may not clearly see this distribution for heads and tails. However, after some\n",
    "thousands of tosses, there will be a clear majority of heads.  With 2000 tosses\n",
    "you should see approximately 1020 heads and 980 tails.\n",
    "\n",
    "We can then state that the outcome is a clear majority of heads. If\n",
    "you do this ten thousand times, it is easy to see that there is a 97%\n",
    "likelihood of a majority of heads.\n",
    "\n",
    "Another example would be to collect all polls before an\n",
    "election. Different polls may show different likelihoods for a\n",
    "candidate winning with say a majority  of the popular vote. The majority vote\n",
    "would then consist in many polls indicating that this candidate will\n",
    "actually win.\n",
    "\n",
    "The example here shows how we can implement the coin tossing case,\n",
    "clealry demostrating that after some tosses we see the [law of large](https://en.wikipedia.org/wiki/Law_of_large_numbers)\n",
    "numbers kicking in.\n",
    "\n",
    "## Standard imports first"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# Common imports\n",
    "from IPython.display import Image \n",
    "from pydot import graph_from_dot_data\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.tree import export_graphviz\n",
    "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
    "from sklearn.compose import ColumnTransformer\n",
    "from IPython.display import Image \n",
    "from pydot import graph_from_dot_data\n",
    "import os\n",
    "\n",
    "# Where to save the figures and data files\n",
    "PROJECT_ROOT_DIR = \"Results\"\n",
    "FIGURE_ID = \"Results/FigureFiles\"\n",
    "DATA_ID = \"DataFiles/\"\n",
    "\n",
    "if not os.path.exists(PROJECT_ROOT_DIR):\n",
    "    os.mkdir(PROJECT_ROOT_DIR)\n",
    "\n",
    "if not os.path.exists(FIGURE_ID):\n",
    "    os.makedirs(FIGURE_ID)\n",
    "\n",
    "if not os.path.exists(DATA_ID):\n",
    "    os.makedirs(DATA_ID)\n",
    "\n",
    "def image_path(fig_id):\n",
    "    return os.path.join(FIGURE_ID, fig_id)\n",
    "\n",
    "def data_path(dat_id):\n",
    "    return os.path.join(DATA_ID, dat_id)\n",
    "\n",
    "def save_fig(fig_id):\n",
    "    plt.savefig(image_path(fig_id) + \".png\", format='png')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Simple Voting Example, head or tail"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "\n",
    "# Common imports\n",
    "import numpy as np\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "from matplotlib.colors import ListedColormap\n",
    "plt.rcParams['axes.labelsize'] = 14\n",
    "plt.rcParams['xtick.labelsize'] = 12\n",
    "plt.rcParams['ytick.labelsize'] = 12\n",
    "\n",
    "heads_proba = 0.51\n",
    "coin_tosses = (np.random.rand(10000, 10) < heads_proba).astype(np.int32)\n",
    "cumulative_heads_ratio = np.cumsum(coin_tosses, axis=0) / np.arange(1, 10001).reshape(-1, 1)\n",
    "plt.figure(figsize=(8,3.5))\n",
    "plt.plot(cumulative_heads_ratio)\n",
    "plt.plot([0, 10000], [0.51, 0.51], \"k--\", linewidth=2, label=\"51%\")\n",
    "plt.plot([0, 10000], [0.5, 0.5], \"k-\", label=\"50%\")\n",
    "plt.xlabel(\"Number of coin tosses\")\n",
    "plt.ylabel(\"Heads ratio\")\n",
    "plt.legend(loc=\"lower right\")\n",
    "plt.axis([0, 10000, 0.42, 0.58])\n",
    "save_fig(\"votingsimple\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using the Voting Classifier\n",
    "\n",
    "We can use the voting classifier on other data sets, here the exciting binary case of two distinct objects using the make moons functionality of **Scikit-Learn**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.datasets import make_moons\n",
    "\n",
    "X, y = make_moons(n_samples=500, noise=0.30, random_state=42)\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)\n",
    "\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.ensemble import VotingClassifier\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.svm import SVC\n",
    "\n",
    "log_clf = LogisticRegression(solver=\"liblinear\", random_state=42)\n",
    "rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)\n",
    "svm_clf = SVC(gamma=\"auto\", random_state=42)\n",
    "\n",
    "voting_clf = VotingClassifier(\n",
    "    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],\n",
    "    voting='hard')\n",
    "\n",
    "voting_clf.fit(X_train, y_train)\n",
    "\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "for clf in (log_clf, rnd_clf, svm_clf, voting_clf):\n",
    "    clf.fit(X_train, y_train)\n",
    "    y_pred = clf.predict(X_test)\n",
    "    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))\n",
    "\n",
    "log_clf = LogisticRegression(solver=\"liblinear\", random_state=42)\n",
    "rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)\n",
    "svm_clf = SVC(gamma=\"auto\", probability=True, random_state=42)\n",
    "voting_clf = VotingClassifier(\n",
    "    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],\n",
    "    voting='soft')\n",
    "voting_clf.fit(X_train, y_train)\n",
    "\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "for clf in (log_clf, rnd_clf, svm_clf, voting_clf):\n",
    "    clf.fit(X_train, y_train)\n",
    "    y_pred = clf.predict(X_test)\n",
    "    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Voting and Bagging"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.datasets import make_moons\n",
    "\n",
    "X, y = make_moons(n_samples=500, noise=0.30, random_state=42)\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.ensemble import VotingClassifier\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.svm import SVC\n",
    "\n",
    "log_clf = LogisticRegression(random_state=42)\n",
    "rnd_clf = RandomForestClassifier(random_state=42)\n",
    "svm_clf = SVC(random_state=42)\n",
    "\n",
    "voting_clf = VotingClassifier(\n",
    "    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],\n",
    "    voting='hard')\n",
    "voting_clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "for clf in (log_clf, rnd_clf, svm_clf, voting_clf):\n",
    "    clf.fit(X_train, y_train)\n",
    "    y_pred = clf.predict(X_test)\n",
    "    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "log_clf = LogisticRegression(random_state=42)\n",
    "rnd_clf = RandomForestClassifier(random_state=42)\n",
    "svm_clf = SVC(probability=True, random_state=42)\n",
    "\n",
    "voting_clf = VotingClassifier(\n",
    "    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],\n",
    "    voting='soft')\n",
    "voting_clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "for clf in (log_clf, rnd_clf, svm_clf, voting_clf):\n",
    "    clf.fit(X_train, y_train)\n",
    "    y_pred = clf.predict(X_test)\n",
    "    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Random forests\n",
    "\n",
    "Random forests provide an improvement over bagged trees by way of a\n",
    "small tweak that decorrelates the trees. \n",
    "\n",
    "As in bagging, we build a\n",
    "number of decision trees on bootstrapped training samples. But when\n",
    "building these decision trees, each time a split in a tree is\n",
    "considered, a random sample of $m$ predictors is chosen as split\n",
    "candidates from the full set of $p$ predictors. The split is allowed to\n",
    "use only one of those $m$ predictors. \n",
    "\n",
    "A fresh sample of $m$ predictors is\n",
    "taken at each split, and typically we choose"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "m\\approx \\sqrt{p}.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In building a random forest, at\n",
    "each split in the tree, the algorithm is not even allowed to consider\n",
    "a majority of the available predictors. \n",
    "\n",
    "The reason for this is rather clever. Suppose that there is one very\n",
    "strong predictor in the data set, along with a number of other\n",
    "moderately strong predictors. Then in the collection of bagged\n",
    "variable importance random forest trees, most or all of the trees will\n",
    "use this strong predictor in the top split. Consequently, all of the\n",
    "bagged trees will look quite similar to each other. Hence the\n",
    "predictions from the bagged trees will be highly correlated.\n",
    "Unfortunately, averaging many highly correlated quantities does not\n",
    "lead to as large of a reduction in variance as averaging many\n",
    "uncorrelated quantities. In particular, this means that bagging will\n",
    "not lead to a substantial reduction in variance over a single tree in\n",
    "this setting.\n",
    "\n",
    "\n",
    "## Random Forest Algorithm\n",
    "The algorithm described here can be applied to both classification and regression problems.\n",
    "\n",
    "We will grow of forest of say $B$ trees.\n",
    "1. For $b=1:B$\n",
    "\n",
    "  * Draw a bootstrap sample from the training data organized in our $\\boldsymbol{X}$ matrix.\n",
    "\n",
    "  * We grow then a random forest tree $T_b$ based on the bootstrapped data by repeating the steps outlined till we reach the maximum node size is reached\n",
    "\n",
    "1. we select $m \\le p$ variables at random from the $p$ predictors/features\n",
    "\n",
    "2. pick the best split point among the $m$ features using for example the CART algorithm and create a new node\n",
    "\n",
    "3. split the node into daughter nodes\n",
    "\n",
    "\n",
    "\n",
    "4. Output then the ensemble of trees $\\{T_b\\}_1^{B}$ and make predictions for either a regression type of problem or a classification type of problem. \n",
    "\n",
    "## Random Forests Compared with other Methods on the Cancer Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from sklearn.model_selection import  train_test_split \n",
    "from sklearn.datasets import load_breast_cancer\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.ensemble import BaggingClassifier\n",
    "\n",
    "# Load the data\n",
    "cancer = load_breast_cancer()\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,random_state=0)\n",
    "print(X_train.shape)\n",
    "print(X_test.shape)\n",
    "# Logistic Regression\n",
    "logreg = LogisticRegression(solver='lbfgs')\n",
    "logreg.fit(X_train, y_train)\n",
    "print(\"Test set accuracy with Logistic Regression: {:.2f}\".format(logreg.score(X_test,y_test)))\n",
    "# Support vector machine\n",
    "svm = SVC(gamma='auto', C=100)\n",
    "svm.fit(X_train, y_train)\n",
    "print(\"Test set accuracy with SVM: {:.2f}\".format(svm.score(X_test,y_test)))\n",
    "# Decision Trees\n",
    "deep_tree_clf = DecisionTreeClassifier(max_depth=None)\n",
    "deep_tree_clf.fit(X_train, y_train)\n",
    "print(\"Test set accuracy with Decision Trees: {:.2f}\".format(deep_tree_clf.score(X_test,y_test)))\n",
    "#now scale the data\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "scaler = StandardScaler()\n",
    "scaler.fit(X_train)\n",
    "X_train_scaled = scaler.transform(X_train)\n",
    "X_test_scaled = scaler.transform(X_test)\n",
    "# Logistic Regression\n",
    "logreg.fit(X_train_scaled, y_train)\n",
    "print(\"Test set accuracy Logistic Regression with scaled data: {:.2f}\".format(logreg.score(X_test_scaled,y_test)))\n",
    "# Support Vector Machine\n",
    "svm.fit(X_train_scaled, y_train)\n",
    "print(\"Test set accuracy SVM with scaled data: {:.2f}\".format(logreg.score(X_test_scaled,y_test)))\n",
    "# Decision Trees\n",
    "deep_tree_clf.fit(X_train_scaled, y_train)\n",
    "print(\"Test set accuracy with Decision Trees and scaled data: {:.2f}\".format(deep_tree_clf.score(X_test_scaled,y_test)))\n",
    "\n",
    "\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "from sklearn.model_selection import cross_validate\n",
    "# Data set not specificied\n",
    "#Instantiate the model with 500 trees and entropy as splitting criteria\n",
    "Random_Forest_model = RandomForestClassifier(n_estimators=500,criterion=\"entropy\")\n",
    "Random_Forest_model.fit(X_train_scaled, y_train)\n",
    "#Cross validation\n",
    "accuracy = cross_validate(Random_Forest_model,X_test_scaled,y_test,cv=10)['test_score']\n",
    "print(accuracy)\n",
    "print(\"Test set accuracy with Random Forests and scaled data: {:.2f}\".format(Random_Forest_model.score(X_test_scaled,y_test)))\n",
    "\n",
    "\n",
    "import scikitplot as skplt\n",
    "y_pred = Random_Forest_model.predict(X_test_scaled)\n",
    "skplt.metrics.plot_confusion_matrix(y_test, y_pred, normalize=True)\n",
    "plt.show()\n",
    "y_probas = Random_Forest_model.predict_proba(X_test_scaled)\n",
    "skplt.metrics.plot_roc(y_test, y_probas)\n",
    "plt.show()\n",
    "skplt.metrics.plot_cumulative_gain(y_test, y_probas)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall that the cumulative gains curve shows the percentage of the\n",
    "overall number of cases in a given category *gained* by targeting a\n",
    "percentage of the total number of cases.\n",
    "\n",
    "Similarly, the receiver operating characteristic curve, or ROC curve,\n",
    "displays the diagnostic ability of a binary classifier system as its\n",
    "discrimination threshold is varied. It plots the true positive rate against the false positive rate.\n",
    "\n",
    "\n",
    "## Compare  Bagging on Trees with Random Forests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "bag_clf = BaggingClassifier(\n",
    "    DecisionTreeClassifier(splitter=\"random\", max_leaf_nodes=16, random_state=42),\n",
    "    n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1, random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "bag_clf.fit(X_train, y_train)\n",
    "y_pred = bag_clf.predict(X_test)\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1, random_state=42)\n",
    "rnd_clf.fit(X_train, y_train)\n",
    "y_pred_rf = rnd_clf.predict(X_test)\n",
    "np.sum(y_pred == y_pred_rf) / len(y_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Boosting, a Bird's Eye View\n",
    "\n",
    "The basic idea is to combine weak classifiers in order to create a good\n",
    "classifier. With a weak classifier we often intend a classifier which\n",
    "produces results which are only slightly better than we would get by\n",
    "random guesses.\n",
    "\n",
    "This is done by applying in an iterative way a weak (or a standard\n",
    "classifier like decision trees) to modify the data. In each iteration\n",
    "we emphasize those observations which are misclassified by weighting\n",
    "them with a factor.\n",
    "\n",
    "\n",
    "## What is boosting? Additive Modelling/Iterative Fitting\n",
    "\n",
    "Boosting is a way of fitting an additive expansion in a set of\n",
    "elementary basis functions like for example some simple polynomials.\n",
    "Assume for example that we have a function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "f_M(x) = \\sum_{i=1}^M \\beta_m b(x;\\gamma_m),\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where $\\beta_m$ are the expansion parameters to be determined in a\n",
    "minimization process and $b(x;\\gamma_m)$ are some simple functions of\n",
    "the multivariable parameter $x$ which is characterized by the\n",
    "parameters $\\gamma_m$.\n",
    "\n",
    "As an example, consider the Sigmoid function we used in logistic\n",
    "regression. In that case, we can translate the function\n",
    "$b(x;\\gamma_m)$ into the Sigmoid function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\sigma(t) = \\frac{1}{1+\\exp{(-t)}},\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where $t=\\gamma_0+\\gamma_1 x$ and the parameters $\\gamma_0$ and\n",
    "$\\gamma_1$ were determined by the Logistic Regression fitting\n",
    "algorithm.\n",
    "\n",
    "As another example, consider the cost function we defined for linear regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "C(\\boldsymbol{y},\\boldsymbol{f}) = \\frac{1}{n} \\sum_{i=0}^{n-1}(y_i-f(x_i))^2.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case the function $f(x)$ was replaced by the design matrix\n",
    "$\\boldsymbol{X}$ and the unknown linear regression parameters $\\boldsymbol{\\beta}$,\n",
    "that is $\\boldsymbol{f}=\\boldsymbol{X}\\boldsymbol{\\beta}$. In linear regression we can \n",
    "simply invert a matrix and obtain the parameters $\\beta$ by"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\boldsymbol{\\beta}=\\left(\\boldsymbol{X}^T\\boldsymbol{X}\\right)^{-1}\\boldsymbol{X}^T\\boldsymbol{y}.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In iterative fitting or additive modeling, we minimize the cost function with respect to the parameters $\\beta_m$ and $\\gamma_m$.\n",
    "\n",
    "\n",
    "## Iterative Fitting, Regression and Squared-error Cost Function\n",
    "\n",
    "The way we proceed is as follows (here we specialize to the squared-error cost function)\n",
    "\n",
    "1. Establish a cost function, here ${\\cal C}(\\boldsymbol{y},\\boldsymbol{f}) = \\frac{1}{n} \\sum_{i=0}^{n-1}(y_i-f_M(x_i))^2$ with $f_M(x) = \\sum_{i=1}^M \\beta_m b(x;\\gamma_m)$.\n",
    "\n",
    "2. Initialize with a guess $f_0(x)$. It could be one or even zero or some random numbers.\n",
    "\n",
    "3. For $m=1:M$\n",
    "\n",
    "a. minimize $\\sum_{i=0}^{n-1}(y_i-f_{m-1}(x_i)-\\beta b(x;\\gamma))^2$ wrt $\\gamma$ and $\\beta$\n",
    "\n",
    "b. This gives the optimal values $\\beta_m$ and $\\gamma_m$\n",
    "\n",
    "c. Determine then the new values $f_m(x)=f_{m-1}(x) +\\beta_m b(x;\\gamma_m)$\n",
    "\n",
    "\n",
    "We could use any of the algorithms we have discussed till now. If we\n",
    "use trees, $\\gamma$ parameterizes the split variables and split points\n",
    "at the internal nodes, and the predictions at the terminal nodes.\n",
    "\n",
    "\n",
    "## Squared-Error Example and Iterative Fitting\n",
    "\n",
    "To better understand what happens, let us develop the steps for the iterative fitting using the above squared error function.\n",
    "\n",
    "For simplicity we assume also that our functions $b(x;\\gamma)=1+\\gamma x$. \n",
    "\n",
    "This means that for every iteration $m$, we need to optimize"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "(\\beta_m,\\gamma_m) = \\mathrm{argmin}_{\\beta,\\lambda}\\hspace{0.1cm} \\sum_{i=0}^{n-1}(y_i-f_{m-1}(x_i)-\\beta b(x;\\gamma))^2=\\sum_{i=0}^{n-1}(y_i-f_{m-1}(x_i)-\\beta(1+\\gamma x_i))^2.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We start our iteration by simply setting $f_0(x)=0$. \n",
    "Taking the derivatives  with respect to $\\beta$ and $\\gamma$ we obtain"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\frac{\\partial {\\cal C}}{\\partial \\beta} = -2\\sum_{i}(1+\\gamma x_i)(y_i-\\beta(1+\\gamma x_i))=0,\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\frac{\\partial {\\cal C}}{\\partial \\gamma} =-2\\sum_{i}\\beta x_i(y_i-\\beta(1+\\gamma x_i))=0.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can then rewrite these equations as (defining $\\boldsymbol{w}=\\boldsymbol{e}+\\gamma \\boldsymbol{x})$ with $\\boldsymbol{e}$ being the unit vector)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\gamma \\boldsymbol{w}^T(\\boldsymbol{y}-\\beta\\gamma \\boldsymbol{w})=0,\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "which gives us $\\beta = \\boldsymbol{w}^T\\boldsymbol{y}/(\\boldsymbol{w}^T\\boldsymbol{w})$. Similarly we have"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\beta\\gamma \\boldsymbol{x}^T(\\boldsymbol{y}-\\beta(1+\\gamma \\boldsymbol{x}))=0,\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "which leads to $\\gamma =(\\boldsymbol{x}^T\\boldsymbol{y}-\\beta\\boldsymbol{x}^T\\boldsymbol{e})/(\\beta\\boldsymbol{x}^T\\boldsymbol{x})$.  Inserting\n",
    "for $\\beta$ gives us an equation for $\\gamma$. This is a non-linear equation in the unknown $\\gamma$ and has to be solved numerically. \n",
    "\n",
    "The solution to these two equations gives us in turn $\\beta_1$ and $\\gamma_1$ leading to the new expression for $f_1(x)$ as\n",
    "$f_1(x) = \\beta_1(1+\\gamma_1x)$. Doing this $M$ times results in our final estimate for the function $f$. \n",
    "\n",
    "\n",
    "\n",
    "## Iterative Fitting, Classification and AdaBoost\n",
    "\n",
    "Let us consider a binary classification problem with two outcomes $y_i \\in \\{-1,1\\}$ and $i=0,1,2,\\dots,n-1$ as our set of\n",
    "observations. We define a classification function $G(x)$ which produces a prediction taking one or the other of the two values \n",
    "$\\{-1,1\\}$.\n",
    "\n",
    "The error rate of the training sample is then"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\mathrm{\\overline{err}}=\\frac{1}{n} \\sum_{i=0}^{n-1} I(y_i\\ne G(x_i)).\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The iterative procedure starts with defining a weak classifier whose\n",
    "error rate is barely better than random guessing.  The iterative\n",
    "procedure in boosting is to sequentially apply a  weak\n",
    "classification algorithm to repeatedly modified versions of the data\n",
    "producing a sequence of weak classifiers $G_m(x)$.\n",
    "\n",
    "Here we will express our  function $f(x)$ in terms of $G(x)$. That is"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "f_M(x) = \\sum_{i=1}^M \\beta_m b(x;\\gamma_m),\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "will be a function of"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "G_M(x) = \\mathrm{sign} \\sum_{i=1}^M \\alpha_m G_m(x).\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adaptive Boosting, AdaBoost\n",
    "\n",
    "In our iterative procedure we define thus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "f_m(x) = f_{m-1}(x)+\\beta_mG_m(x).\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The simplest possible cost function which leads (also simple from a computational point of view) to the AdaBoost algorithm is the\n",
    "exponential cost/loss function defined as"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "C(\\boldsymbol{y},\\boldsymbol{f}) = \\sum_{i=0}^{n-1}\\exp{(-y_i(f_{m-1}(x_i)+\\beta G(x_i))}.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We optimize $\\beta$ and $G$ for each value of $m=1:M$ as we did in the regression case.\n",
    "This is normally done in two steps. Let us however first rewrite the cost function as"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "C(\\boldsymbol{y},\\boldsymbol{f}) = \\sum_{i=0}^{n-1}w_i^{m}\\exp{(-y_i\\beta G(x_i))},\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where we have defined $w_i^m= \\exp{(-y_if_{m-1}(x_i))}$.\n",
    "\n",
    "## Building up AdaBoost\n",
    "\n",
    "First, for any $\\beta > 0$, we optimize $G$ by setting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "G_m(x) = \\mathrm{sign} \\sum_{i=0}^{n-1} w_i^m I(y_i \\ne G_(x_i)),\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "which is the classifier that minimizes the weighted error rate in predicting $y$.\n",
    "\n",
    "We can do this by rewriting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\exp{-(\\beta)}\\sum_{y_i=G(x_i)}w_i^m+\\exp{(\\beta)}\\sum_{y_i\\ne G(x_i)}w_i^m,\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "which can be rewritten as"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "(\\exp{(\\beta)}-\\exp{-(\\beta)})\\sum_{i=0}^{n-1}w_i^mI(y_i\\ne G(x_i))+\\exp{(-\\beta)}\\sum_{i=0}^{n-1}w_i^m=0,\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "which leads to"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\beta_m = \\frac{1}{2}\\log{\\frac{1-\\mathrm{\\overline{err}}}{\\mathrm{\\overline{err}}}},\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where we have redefined the error as"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\mathrm{\\overline{err}}_m=\\frac{1}{n}\\frac{\\sum_{i=0}^{n-1}w_i^mI(y_i\\ne G(x_i)}{\\sum_{i=0}^{n-1}w_i^m},\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "which leads to an update of"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "f_m(x) = f_{m-1}(x) +\\beta_m G_m(x).\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This leads to the new weights"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "w_i^{m+1} = w_i^m \\exp{(-y_i\\beta_m G_m(x_i))}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adaptive boosting: AdaBoost, Basic Algorithm\n",
    "\n",
    "The algorithm here is rather straightforward. Assume that our weak\n",
    "classifier is a decision tree and we consider a binary set of outputs\n",
    "with $y_i \\in \\{-1,1\\}$ and $i=0,1,2,\\dots,n-1$ as our set of\n",
    "observations. Our design matrix is given in terms of the\n",
    "feature/predictor vectors\n",
    "$\\boldsymbol{X}=[\\boldsymbol{x}_0\\boldsymbol{x}_1\\dots\\boldsymbol{x}_{p-1}]$. Finally, we define also a\n",
    "classifier determined by our data via a function $G(x)$. This function tells us how well we are able to classify our outputs/targets $\\boldsymbol{y}$. \n",
    "\n",
    "We have already defined the misclassification error $\\mathrm{err}$ as"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\mathrm{err}=\\frac{1}{n}\\sum_{i=0}^{n-1}I(y_i\\ne G(x_i)),\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where the function $I()$ is one if we misclassify and zero if we classify correctly. \n",
    "\n",
    "## Basic Steps of AdaBoost\n",
    "\n",
    "With the above definitions we are now ready to set up the algorithm for AdaBoost.\n",
    "The basic idea is to set up weights which will be used to scale the correctly classified and the misclassified cases.\n",
    "1. We start by initializing all weights to $w_i = 1/n$, with $i=0,1,2,\\dots n-1$. It is easy to see that we must have $\\sum_{i=0}^{n-1}w_i = 1$.\n",
    "\n",
    "2. We rewrite the misclassification error as"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\mathrm{\\overline{err}}_m=\\frac{\\sum_{i=0}^{n-1}w_i^m I(y_i\\ne G(x_i))}{\\sum_{i=0}^{n-1}w_i},\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Then we start looping over all attempts at classifying, namely we start an iterative process for $m=1:M$, where $M$ is the final number of classifications. Our given classifier could for example be a plain decision tree.\n",
    "\n",
    "a. Fit then a given classifier to the training set using the weights $w_i$.\n",
    "\n",
    "b. Compute then $\\mathrm{err}$ and figure out which events are classified properly and which are classified wrongly.\n",
    "\n",
    "c. Define a quantity $\\alpha_{m} = \\log{(1-\\mathrm{\\overline{err}}_m)/\\mathrm{\\overline{err}}_m}$\n",
    "\n",
    "d. Set the new weights to $w_i = w_i\\times \\exp{(\\alpha_m I(y_i\\ne G(x_i)}$.\n",
    "\n",
    "\n",
    "5. Compute the new classifier $G(x)= \\sum_{i=0}^{n-1}\\alpha_m I(y_i\\ne G(x_i)$.\n",
    "\n",
    "For the iterations with $m \\le 2$ the weights are modified\n",
    "individually at each steps. The observations which were misclassified\n",
    "at iteration $m-1$ have a weight which is larger than those which were\n",
    "classified properly. As this proceeds, the observations which were\n",
    "difficult to classifiy correctly are given a larger influence. Each\n",
    "new classification step $m$ is then forced to concentrate on those\n",
    "observations that are missed in the previous iterations.\n",
    "\n",
    "\n",
    "\n",
    "## AdaBoost Examples\n",
    "\n",
    "Using **Scikit-Learn** it is easy to apply the adaptive boosting algorithm, as done here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "from sklearn.ensemble import AdaBoostClassifier\n",
    "\n",
    "ada_clf = AdaBoostClassifier(\n",
    "    DecisionTreeClassifier(max_depth=1), n_estimators=200,\n",
    "    algorithm=\"SAMME.R\", learning_rate=0.5, random_state=42)\n",
    "ada_clf.fit(X_train, y_train)\n",
    "\n",
    "from sklearn.ensemble import AdaBoostClassifier\n",
    "\n",
    "ada_clf = AdaBoostClassifier(\n",
    "    DecisionTreeClassifier(max_depth=1), n_estimators=200,\n",
    "    algorithm=\"SAMME.R\", learning_rate=0.5, random_state=42)\n",
    "ada_clf.fit(X_train_scaled, y_train)\n",
    "y_pred = ada_clf.predict(X_test_scaled)\n",
    "skplt.metrics.plot_confusion_matrix(y_test, y_pred, normalize=True)\n",
    "plt.show()\n",
    "y_probas = ada_clf.predict_proba(X_test_scaled)\n",
    "skplt.metrics.plot_roc(y_test, y_probas)\n",
    "plt.show()\n",
    "skplt.metrics.plot_cumulative_gain(y_test, y_probas)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gradient boosting: Basics with Steepest Descent/Functional Gradient Descent\n",
    "\n",
    "Gradient boosting is again a similar technique to Adaptive boosting,\n",
    "it combines so-called weak classifiers or regressors into a strong\n",
    "method via a series of iterations.\n",
    "\n",
    "In order to understand the method, let us illustrate its basics by\n",
    "bringing back the essential steps in linear regression, where our cost\n",
    "function was the least squares function.\n",
    "\n",
    "## The Squared-Error again! Steepest Descent\n",
    "\n",
    "We start again with our cost function ${\\cal C}(\\boldsymbol{y}m\\boldsymbol{f})=\\sum_{i=0}^{n-1}{\\cal L}(y_i, f(x_i))$ where we want to minimize\n",
    "This means that for every iteration, we need to optimize"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "(\\hat{\\boldsymbol{f}}) = \\mathrm{argmin}_{\\boldsymbol{f}}\\hspace{0.1cm} \\sum_{i=0}^{n-1}(y_i-f(x_i))^2.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We define a real function $h_m(x)$ that defines our final function $f_M(x)$ as"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "f_M(x) = \\sum_{m=0}^M h_m(x).\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the steepest decent approach we approximate $h_m(x) = -\\rho_m g_m(x)$, where $\\rho_m$ is a scalar and $g_m(x)$ the gradient defined as"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "g_m(x_i) = \\left[ \\frac{\\partial {\\cal L}(y_i, f(x_i))}{\\partial f(x_i)}\\right]_{f(x_i)=f_{m-1}(x_i)}.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the new gradient we can update $f_m(x) = f_{m-1}(x) -\\rho_m g_m(x)$. Using the above squared-error function we see that\n",
    "the gradient is $g_m(x_i) = -2(y_i-f(x_i))$.\n",
    "\n",
    "Choosing $f_0(x)=0$ we obtain $g_m(x) = -2y_i$ and inserting this into the minimization problem for the cost function we have"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "(\\rho_1) = \\mathrm{argmin}_{\\rho}\\hspace{0.1cm} \\sum_{i=0}^{n-1}(y_i+2\\rho y_i)^2.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Steepest Descent Example\n",
    "\n",
    "Optimizing with respect to $\\rho$ we obtain (taking the derivative) that $\\rho_1 = -1/2$. We have then that"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "f_1(x) = f_{0}(x) -\\rho_1 g_1(x)=-y_i.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can then proceed and compute"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "g_2(x_i) = \\left[ \\frac{\\partial {\\cal L}(y_i, f(x_i))}{\\partial f(x_i)}\\right]_{f(x_i)=f_{1}(x_i)=y_i}=-4y_i,\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and find a new value for $\\rho_2=-1/2$ and continue till we have reached $m=M$. We can modify the steepest descent method, or steepest boosting, by introducing what is called **gradient boosting**. \n",
    "\n",
    "## Gradient Boosting, algorithm\n",
    "\n",
    "Steepest descent is however not much used, since it only optimizes $f$ at a fixed set of $n$ points,\n",
    "so we do not learn a function that can generalize. However, we can modify the algorithm by\n",
    "fitting a weak learner to approximate the negative gradient signal. \n",
    "\n",
    "Suppose we have a cost function $C(f)=\\sum_{i=0}^{n-1}L(y_i, f(x_i))$ where $y_i$ is our target and $f(x_i)$ the function which is meant to model $y_i$. The above cost function could be our standard  squared-error  function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "C(\\boldsymbol{y},\\boldsymbol{f})=\\sum_{i=0}^{n-1}(y_i-f(x_i))^2.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The way we proceed in an iterative fashion is to\n",
    "1. Initialize our estimate $f_0(x)$.\n",
    "\n",
    "2. For $m=1:M$, we\n",
    "\n",
    "a. compute the negative gradient vector $\\boldsymbol{u}_m = -\\partial C(\\boldsymbol{y},\\boldsymbol{f})/\\partial \\boldsymbol{f}(x)$ at $f(x) = f_{m-1}(x)$;\n",
    "\n",
    "b. fit the so-called base-learner to the negative gradient $h_m(u_m,x)$;\n",
    "\n",
    "c. update the estimate $f_m(x) = f_{m-1}(x)+h_m(u_m,x)$;\n",
    "\n",
    "\n",
    "4. The final estimate is then $f_M(x) = \\sum_{m=1}^M h_m(u_m,x)$.\n",
    "\n",
    "## Gradient Boosting, Examples of Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.ensemble import GradientBoostingRegressor\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "import scikitplot as skplt\n",
    "from sklearn.metrics import mean_squared_error\n",
    "\n",
    "n = 100\n",
    "maxdegree = 6\n",
    "\n",
    "# Make data set.\n",
    "x = np.linspace(-3, 3, n).reshape(-1, 1)\n",
    "y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.normal(0, 0.1, x.shape)\n",
    "\n",
    "error = np.zeros(maxdegree)\n",
    "bias = np.zeros(maxdegree)\n",
    "variance = np.zeros(maxdegree)\n",
    "polydegree = np.zeros(maxdegree)\n",
    "X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)\n",
    "scaler = StandardScaler()\n",
    "scaler.fit(X_train)\n",
    "X_train_scaled = scaler.transform(X_train)\n",
    "X_test_scaled = scaler.transform(X_test)\n",
    "\n",
    "for degree in range(1,maxdegree):\n",
    "    model = GradientBoostingRegressor(max_depth=degree, n_estimators=100, learning_rate=1.0)  \n",
    "    model.fit(X_train_scaled,y_train)\n",
    "    y_pred = model.predict(X_test_scaled)\n",
    "    polydegree[degree] = degree\n",
    "    error[degree] = np.mean( np.mean((y_test - y_pred)**2) )\n",
    "    bias[degree] = np.mean( (y_test - np.mean(y_pred))**2 )\n",
    "    variance[degree] = np.mean( np.var(y_pred) )\n",
    "    print('Max depth:', degree)\n",
    "    print('Error:', error[degree])\n",
    "    print('Bias^2:', bias[degree])\n",
    "    print('Var:', variance[degree])\n",
    "    print('{} >= {} + {} = {}'.format(error[degree], bias[degree], variance[degree], bias[degree]+variance[degree]))\n",
    "\n",
    "plt.xlim(1,maxdegree-1)\n",
    "plt.plot(polydegree, error, label='Error')\n",
    "plt.plot(polydegree, bias, label='bias')\n",
    "plt.plot(polydegree, variance, label='Variance')\n",
    "plt.legend()\n",
    "save_fig(\"gdregression\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gradient Boosting, Classification Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from sklearn.model_selection import  train_test_split \n",
    "from sklearn.datasets import load_breast_cancer\n",
    "import scikitplot as skplt\n",
    "from sklearn.ensemble import GradientBoostingClassifier\n",
    "from sklearn.model_selection import cross_validate\n",
    "\n",
    "# Load the data\n",
    "cancer = load_breast_cancer()\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,random_state=0)\n",
    "print(X_train.shape)\n",
    "print(X_test.shape)\n",
    "#now scale the data\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "scaler = StandardScaler()\n",
    "scaler.fit(X_train)\n",
    "X_train_scaled = scaler.transform(X_train)\n",
    "X_test_scaled = scaler.transform(X_test)\n",
    "\n",
    "gd_clf = GradientBoostingClassifier(max_depth=3, n_estimators=100, learning_rate=1.0)  \n",
    "gd_clf.fit(X_train_scaled, y_train)\n",
    "#Cross validation\n",
    "accuracy = cross_validate(gd_clf,X_test_scaled,y_test,cv=10)['test_score']\n",
    "print(accuracy)\n",
    "print(\"Test set accuracy with Random Forests and scaled data: {:.2f}\".format(gd_clf.score(X_test_scaled,y_test)))\n",
    "\n",
    "import scikitplot as skplt\n",
    "y_pred = gd_clf.predict(X_test_scaled)\n",
    "skplt.metrics.plot_confusion_matrix(y_test, y_pred, normalize=True)\n",
    "save_fig(\"gdclassiffierconfusion\")\n",
    "plt.show()\n",
    "y_probas = gd_clf.predict_proba(X_test_scaled)\n",
    "skplt.metrics.plot_roc(y_test, y_probas)\n",
    "save_fig(\"gdclassiffierroc\")\n",
    "plt.show()\n",
    "skplt.metrics.plot_cumulative_gain(y_test, y_probas)\n",
    "save_fig(\"gdclassiffiercgain\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## XGBoost: Extreme Gradient Boosting\n",
    "\n",
    "\n",
    "[XGBoost](https://github.com/dmlc/xgboost) or Extreme Gradient\n",
    "Boosting, is an optimized distributed gradient boosting library\n",
    "designed to be highly efficient, flexible and portable. It implements\n",
    "machine learning algorithms under the Gradient Boosting\n",
    "framework. XGBoost provides a parallel tree boosting that solve many\n",
    "data science problems in a fast and accurate way. See the [article by Chen and Guestrin](https://arxiv.org/abs/1603.02754).\n",
    "\n",
    "The authors design and build a highly scalable end-to-end tree\n",
    "boosting system. It has  a theoretically justified weighted quantile\n",
    "sketch for efficient proposal calculation. It introduces a novel sparsity-aware algorithm for parallel tree learning and an effective cache-aware block structure for out-of-core tree learning.\n",
    "\n",
    "It is now the algorithm which wins essentially all ML competitions!!!\n",
    "\n",
    "## Regression Case"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from sklearn.model_selection import train_test_split\n",
    "import xgboost as xgb\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "import scikitplot as skplt\n",
    "from sklearn.metrics import mean_squared_error\n",
    "\n",
    "n = 100\n",
    "maxdegree = 6\n",
    "\n",
    "# Make data set.\n",
    "x = np.linspace(-3, 3, n).reshape(-1, 1)\n",
    "y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.normal(0, 0.1, x.shape)\n",
    "\n",
    "error = np.zeros(maxdegree)\n",
    "bias = np.zeros(maxdegree)\n",
    "variance = np.zeros(maxdegree)\n",
    "polydegree = np.zeros(maxdegree)\n",
    "X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)\n",
    "scaler = StandardScaler()\n",
    "scaler.fit(X_train)\n",
    "X_train_scaled = scaler.transform(X_train)\n",
    "X_test_scaled = scaler.transform(X_test)\n",
    "\n",
    "for degree in range(maxdegree):\n",
    "    model =  xgb.XGBRegressor(objective ='reg:squarederror', colsaobjective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.1,max_depth = degree, alpha = 10, n_estimators = 200)\n",
    "\n",
    "    model.fit(X_train_scaled,y_train)\n",
    "    y_pred = model.predict(X_test_scaled)\n",
    "    polydegree[degree] = degree\n",
    "    error[degree] = np.mean( np.mean((y_test - y_pred)**2) )\n",
    "    bias[degree] = np.mean( (y_test - np.mean(y_pred))**2 )\n",
    "    variance[degree] = np.mean( np.var(y_pred) )\n",
    "    print('Max depth:', degree)\n",
    "    print('Error:', error[degree])\n",
    "    print('Bias^2:', bias[degree])\n",
    "    print('Var:', variance[degree])\n",
    "    print('{} >= {} + {} = {}'.format(error[degree], bias[degree], variance[degree], bias[degree]+variance[degree]))\n",
    "\n",
    "plt.xlim(1,maxdegree-1)\n",
    "plt.plot(polydegree, error, label='Error')\n",
    "plt.plot(polydegree, bias, label='bias')\n",
    "plt.plot(polydegree, variance, label='Variance')\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Xgboost on the Cancer Data\n",
    "\n",
    "As you will see from the confusion matrix below, XGBoots does an excellent job on the Wisconsin cancer data and outperforms essentially all agorithms we have discussed till now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from sklearn.model_selection import  train_test_split \n",
    "from sklearn.datasets import load_breast_cancer\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "from sklearn.model_selection import cross_validate\n",
    "import scikitplot as skplt\n",
    "import xgboost as xgb\n",
    "# Load the data\n",
    "cancer = load_breast_cancer()\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,random_state=0)\n",
    "print(X_train.shape)\n",
    "print(X_test.shape)\n",
    "#now scale the data\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "scaler = StandardScaler()\n",
    "scaler.fit(X_train)\n",
    "X_train_scaled = scaler.transform(X_train)\n",
    "X_test_scaled = scaler.transform(X_test)\n",
    "\n",
    "xg_clf = xgb.XGBClassifier()\n",
    "xg_clf.fit(X_train_scaled,y_train)\n",
    "\n",
    "y_test = xg_clf.predict(X_test_scaled)\n",
    "\n",
    "print(\"Test set accuracy with Random Forests and scaled data: {:.2f}\".format(xg_clf.score(X_test_scaled,y_test)))\n",
    "\n",
    "import scikitplot as skplt\n",
    "y_pred = xg_clf.predict(X_test_scaled)\n",
    "skplt.metrics.plot_confusion_matrix(y_test, y_pred, normalize=True)\n",
    "save_fig(\"xdclassiffierconfusion\")\n",
    "plt.show()\n",
    "y_probas = xg_clf.predict_proba(X_test_scaled)\n",
    "skplt.metrics.plot_roc(y_test, y_probas)\n",
    "save_fig(\"xdclassiffierroc\")\n",
    "plt.show()\n",
    "skplt.metrics.plot_cumulative_gain(y_test, y_probas)\n",
    "save_fig(\"gdclassiffiercgain\")\n",
    "plt.show()\n",
    "\n",
    "\n",
    "xgb.plot_tree(xg_clf,num_trees=0)\n",
    "plt.rcParams['figure.figsize'] = [50, 10]\n",
    "save_fig(\"xgtree\")\n",
    "plt.show()\n",
    "\n",
    "xgb.plot_importance(xg_clf)\n",
    "plt.rcParams['figure.figsize'] = [5, 5]\n",
    "save_fig(\"xgparams\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Topics we have covered\n",
    "\n",
    "The course has two central parts\n",
    "\n",
    "1. Statistical analysis and optimization of data\n",
    "\n",
    "2. Machine learning\n",
    "\n",
    "## Statistical analysis and optimization of data\n",
    "\n",
    "The following topics have been discussed:\n",
    "1. Basic concepts, expectation values, variance, covariance, correlation functions and errors;\n",
    "\n",
    "2. Gradient methods for data optimization\n",
    "\n",
    "3. Estimation of errors using cross-validation, bootstrapping and jackknife methods;\n",
    "\n",
    "## Machine learning\n",
    "\n",
    "The following topics will be covered\n",
    "1. Linear methods for regression and classification:\n",
    "\n",
    "a. Ordinary Least Squares\n",
    "\n",
    "b. Ridge regression\n",
    "\n",
    "c. Lasso regression\n",
    "\n",
    "d. Logistic regression\n",
    "\n",
    "\n",
    "5. Neural networks and deep learning:\n",
    "\n",
    "a. Feed Forward Neural Networks\n",
    "\n",
    "b. Convolutional Neural Networks\n",
    "\n",
    "c. Recurrent Neural Networks\n",
    "\n",
    "\n",
    "4. Decisions trees and ensemble methods:\n",
    "\n",
    "a. Decision trees\n",
    "\n",
    "b. Bagging and voting\n",
    "\n",
    "c. Random forests\n",
    "\n",
    "d. Boosting and gradient boosting\n",
    "\n",
    "\n",
    "## Learning outcomes and overarching aims of this course\n",
    "\n",
    "The course introduces a variety of central algorithms and methods\n",
    "essential for studies of data analysis and machine learning. The\n",
    "course is project based and through the various projects, normally\n",
    "three, you will be exposed to fundamental research problems\n",
    "in these fields, with the aim to reproduce state of the art scientific\n",
    "results. The students will learn to develop and structure large codes\n",
    "for studying these systems, get acquainted with computing facilities\n",
    "and learn to handle large scientific projects. A good scientific and\n",
    "ethical conduct is emphasized throughout the course. \n",
    "\n",
    "* Understand linear methods for regression and classification;\n",
    "\n",
    "* Learn about neural network;\n",
    "\n",
    "* Learn about bagging, boosting and trees\n",
    "\n",
    "* Learn about basic data analysis;\n",
    "\n",
    "* Be capable of extending the acquired knowledge to other systems and cases;\n",
    "\n",
    "* Have an understanding of central algorithms used in data analysis and machine learning;\n",
    "\n",
    "* Work on numerical projects to illustrate the theory. The projects play a central role and you are expected to know modern programming languages like Python or C++.\n",
    "\n",
    "## Perspective on Machine Learning\n",
    "\n",
    "1. Rapidly emerging application area\n",
    "\n",
    "2. Experiment AND theory are evolving in many many fields. Still many low-hanging fruits.\n",
    "\n",
    "3. Requires education/retraining for more widespread adoption\n",
    "\n",
    "4. A lot of “word-of-mouth” development methods\n",
    "\n",
    "Huge amounts of data sets require automation, classical analysis tools often inadequate. \n",
    "High energy physics hit this wall in the 90’s.\n",
    "In 2009 single top quark production was determined via [Boosted decision trees, Bayesian\n",
    "Neural Networks, etc.](https://arxiv.org/pdf/0903.0850.pdf)\n",
    "\n",
    "\n",
    "## Machine Learning Research\n",
    "\n",
    "Where to find recent results:\n",
    "1. Conference proceedings, arXiv and blog posts!\n",
    "\n",
    "2. **NIPS**: [Neural Information Processing Systems](https://papers.nips.cc)\n",
    "\n",
    "3. **ICLR**: [International Conference on Learning Representations](https://openreview.net/group?id=ICLR.cc/2018/Conference#accepted-oral-papers)\n",
    "\n",
    "4. **ICML**: International Conference on Machine Learning\n",
    "\n",
    "5. [Journal of Machine Learning Research](http://www.jmlr.org/papers/v19/) \n",
    "\n",
    "6. [Follow ML on ArXiv](https://arxiv.org/list/cs.LG/recent)\n",
    "\n",
    "## Starting your Machine Learning Project\n",
    "\n",
    "1. Identify problem type: classification, regression\n",
    "\n",
    "2. Consider your data carefully\n",
    "\n",
    "3. Choose a simple model that fits 1. and 2.\n",
    "\n",
    "4. Consider your data carefully again! Think of data representation more carefully.\n",
    "\n",
    "5. Based on your results, feedback loop to earliest possible point\n",
    "\n",
    "## Choose a Model and Algorithm\n",
    "\n",
    "1. Supervised?\n",
    "\n",
    "2. Start with the simplest model that fits your problem\n",
    "\n",
    "3. Start with minimal processing of data\n",
    "\n",
    "## Preparing Your Data\n",
    "\n",
    "1. Shuffle your data\n",
    "\n",
    "2. Mean center your data\n",
    "\n",
    "  * Why?\n",
    "\n",
    "\n",
    "3. Normalize the variance\n",
    "\n",
    "  * Why?\n",
    "\n",
    "\n",
    "4. **Whitening**\n",
    "\n",
    "  * Decorrelates data\n",
    "\n",
    "  * Can be hit or miss\n",
    "\n",
    "\n",
    "5. When to do train/test split?\n",
    "\n",
    "## Which Activation and Weights to Choose in Neural Networks\n",
    "\n",
    "1. RELU? ELU?\n",
    "\n",
    "2. Sigmoid or Tanh?\n",
    "\n",
    "3. Set all weights to 0?\n",
    "\n",
    "  * Terrible idea\n",
    "\n",
    "\n",
    "4. Set all weights to random values?\n",
    "\n",
    "  * Small random values\n",
    "\n",
    "\n",
    "## Optimization Methods and Hyperparameters\n",
    "1. Stochastic gradient descent\n",
    "\n",
    "a. Stochastic gradient descent + momentum\n",
    "\n",
    "\n",
    "2. State-of-the-art approaches:\n",
    "\n",
    "  * RMSProp\n",
    "\n",
    "  * Adam\n",
    "\n",
    "  * and more\n",
    "\n",
    "\n",
    "Which regularization and hyperparameters? $L_1$ or $L_2$, soft\n",
    "classifiers, depths of trees and many other. Need to explore a large\n",
    "set of hyperparameters and regularization methods.\n",
    "\n",
    "\n",
    "## Resampling\n",
    "\n",
    "When do we resample?\n",
    "\n",
    "1. [Bootstrap](https://www.cambridge.org/core/books/bootstrap-methods-and-their-application/ED2FD043579F27952363566DC09CBD6A)\n",
    "\n",
    "2. [Cross-validation](https://www.youtube.com/watch?v=fSytzGwwBVw&ab_channel=StatQuestwithJoshStarmer)\n",
    "\n",
    "3. Jackknife and many other\n",
    "\n",
    "## What's the future like?\n",
    "\n",
    "Based on multi-layer nonlinear neural networks, deep learning can\n",
    "learn directly from raw data, automatically extract and abstract\n",
    "features from layer to layer, and then achieve the goal of regression,\n",
    "classification, or ranking. Deep learning has made breakthroughs in\n",
    "computer vision, speech processing and natural language, and reached\n",
    "or even surpassed human level. The success of deep learning is mainly\n",
    "due to the three factors: big data, big model, and big computing.\n",
    "\n",
    "In the past few decades, many different architectures of deep neural\n",
    "networks have been proposed, such as\n",
    "1. Convolutional neural networks, which are mostly used in image and video data processing, and have also been applied to sequential data such as text processing;\n",
    "\n",
    "2. Recurrent neural networks, which can process sequential data of variable length and have been widely used in natural language understanding and speech processing;\n",
    "\n",
    "3. Encoder-decoder framework, which is mostly used for image or sequence generation, such as machine translation, text summarization, and image captioning.\n",
    "\n",
    "## Types of Machine Learning, a repetition\n",
    "\n",
    "The approaches to machine learning are many, but are often split into two main categories. \n",
    "In *supervised learning* we know the answer to a problem,\n",
    "and let the computer deduce the logic behind it. On the other hand, *unsupervised learning*\n",
    "is a method for finding patterns and relationship in data sets without any prior knowledge of the system.\n",
    "Some authours also operate with a third category, namely *reinforcement learning*. This is a paradigm \n",
    "of learning inspired by behavioural psychology, where learning is achieved by trial-and-error, \n",
    "solely from rewards and punishment.\n",
    "\n",
    "Another way to categorize machine learning tasks is to consider the desired output of a system.\n",
    "Some of the most common tasks are:\n",
    "\n",
    "  * Classification: Outputs are divided into two or more classes. The goal is to   produce a model that assigns inputs into one of these classes. An example is to identify  digits based on pictures of hand-written ones. Classification is typically supervised learning.\n",
    "\n",
    "  * Regression: Finding a functional relationship between an input data set and a reference data set.   The goal is to construct a function that maps input data to continuous output values.\n",
    "\n",
    "  * Clustering: Data are divided into groups with certain common traits, without knowing the different groups beforehand.  It is thus a form of unsupervised learning.\n",
    "\n",
    "  * Other unsupervised learning algortihms like **Boltzmann machines**\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "## Autoencoders: Overarching view\n",
    "\n",
    "Autoencoders are artificial neural networks capable of learning\n",
    "efficient representations of the input data (these representations are called codings)  without\n",
    "any supervision (i.e., the training set is unlabeled). These codings\n",
    "typically have a much lower dimensionality than the input data, making\n",
    "autoencoders useful for dimensionality reduction. \n",
    "\n",
    "More importantly, autoencoders act as powerful feature detectors, and\n",
    "they can be used for unsupervised pretraining of deep neural networks.\n",
    "\n",
    "Lastly, they are capable of randomly generating new data that looks\n",
    "very similar to the training data; this is called a generative\n",
    "model. For example, you could train an autoencoder on pictures of\n",
    "faces, and it would then be able to generate new faces.  Surprisingly,\n",
    "autoencoders work by simply learning to copy their inputs to their\n",
    "outputs. This may sound like a trivial task, but we will see that\n",
    "constraining the network in various ways can make it rather\n",
    "difficult. For example, you can limit the size of the internal\n",
    "representation, or you can add noise to the inputs and train the\n",
    "network to recover the original inputs. These constraints prevent the\n",
    "autoencoder from trivially copying the inputs directly to the outputs,\n",
    "which forces it to learn efficient ways of representing the data. In\n",
    "short, the codings are byproducts of the autoencoder’s attempt to\n",
    "learn the identity function under some constraints.\n",
    "\n",
    "[Video on autoencoders](https://www.coursera.org/lecture/building-deep-learning-models-with-tensorflow/autoencoders-1U4L3)\n",
    "\n",
    "See also A. Geron's textbook, chapter 15.\n",
    "\n",
    "## Bayesian Machine Learning\n",
    "\n",
    "This is an important topic if we aim at extracting a probability\n",
    "distribution. This gives us also a confidence interval and error\n",
    "estimates.\n",
    "\n",
    "Bayesian machine learning allows us to encode our prior beliefs about\n",
    "what those models should look like, independent of what the data tells\n",
    "us. This is especially useful when we don’t have a ton of data to\n",
    "confidently learn our model.\n",
    "\n",
    "[Video on Bayesian deep learning](https://www.youtube.com/watch?v=E1qhGw8QxqY&ab_channel=AndrewGordonWilson)\n",
    "\n",
    "See also the [slides here](https://github.com/CompPhysics/MachineLearning/blob/master/doc/Articles/lec03.pdf).\n",
    "\n",
    "## Reinforcement Learning\n",
    "\n",
    "Reinforcement Learning (RL) is one of the most exciting fields of\n",
    "Machine Learning today, and also one of the oldest. It has been around\n",
    "since the 1950s, producing many interesting applications over the\n",
    "years.\n",
    "\n",
    "It studies\n",
    "how agents take actions based on trial and error, so as to maximize\n",
    "some notion of cumulative reward in a dynamic system or\n",
    "environment. Due to its generality, the problem has also been studied\n",
    "in many other disciplines, such as game theory, control theory,\n",
    "operations research, information theory, multi-agent systems, swarm\n",
    "intelligence, statistics, and genetic algorithms.\n",
    "\n",
    "In March 2016, AlphaGo, a computer program that plays the board game\n",
    "Go, beat Lee Sedol in a five-game match. This was the first time a\n",
    "computer Go program had beaten a 9-dan (highest rank) professional\n",
    "without handicaps. AlphaGo is based on deep convolutional neural\n",
    "networks and reinforcement learning. AlphaGo’s victory was a major\n",
    "milestone in artificial intelligence and it has also made\n",
    "reinforcement learning a hot research area in the field of machine\n",
    "learning.\n",
    "\n",
    "[Lecture on Reinforcement Learning](https://www.youtube.com/watch?v=FgzM3zpZ55o&ab_channel=stanfordonline).\n",
    "\n",
    "See also A. Geron's textbook, chapter 16.\n",
    "\n",
    "## Transfer learning\n",
    "\n",
    "The goal of transfer learning is to transfer the model or knowledge\n",
    "obtained from a source task to the target task, in order to resolve\n",
    "the issues of insufficient training data in the target task. The\n",
    "rationality of doing so lies in that usually the source and target\n",
    "tasks have inter-correlations, and therefore either the features,\n",
    "samples, or models in the source task might provide useful information\n",
    "for us to better solve the target task. Transfer learning is a hot\n",
    "research topic in recent years, with many problems still waiting to be studied.\n",
    "\n",
    "[Lecture on transfer learning](https://www.ias.edu/video/machinelearning/2020/0331-SamoryKpotufe).\n",
    "\n",
    "## Adversarial learning\n",
    "\n",
    "The conventional deep generative model has a potential problem: the\n",
    "model tends to generate extreme instances to maximize the\n",
    "probabilistic likelihood, which will hurt its performance. Adversarial\n",
    "learning utilizes the adversarial behaviors (e.g., generating\n",
    "adversarial instances or training an adversarial model) to enhance the\n",
    "robustness of the model and improve the quality of the generated\n",
    "data. In recent years, one of the most promising unsupervised learning\n",
    "technologies, generative adversarial networks (GAN), has already been\n",
    "successfully applied to image, speech, and text.\n",
    "\n",
    "[Lecture on adversial learning](https://www.youtube.com/watch?v=CIfsB_EYsVI&ab_channel=StanfordUniversitySchoolofEngineering).\n",
    "\n",
    "## Dual learning\n",
    "\n",
    "Dual learning is a new learning paradigm, the basic idea of which is\n",
    "to use the primal-dual structure between machine learning tasks to\n",
    "obtain effective feedback/regularization, and guide and strengthen the\n",
    "learning process, thus reducing the requirement of large-scale labeled\n",
    "data for deep learning. The idea of dual learning has been applied to\n",
    "many problems in machine learning, including machine translation,\n",
    "image style conversion, question answering and generation, image\n",
    "classification and generation, text classification and generation,\n",
    "image-to-text, and text-to-image.\n",
    "\n",
    "## Distributed machine learning\n",
    "\n",
    "Distributed computation will speed up machine learning algorithms,\n",
    "significantly improve their efficiency, and thus enlarge their\n",
    "application. When distributed meets machine learning, more than just\n",
    "implementing the machine learning algorithms in parallel is required.\n",
    "\n",
    "\n",
    "## Meta learning\n",
    "\n",
    "Meta learning is an emerging research direction in machine\n",
    "learning. Roughly speaking, meta learning concerns learning how to\n",
    "learn, and focuses on the understanding and adaptation of the learning\n",
    "itself, instead of just completing a specific learning task. That is,\n",
    "a meta learner needs to be able to evaluate its own learning methods\n",
    "and adjust its own learning methods according to specific learning\n",
    "tasks.\n",
    "\n",
    "## The Challenges Facing Machine Learning\n",
    "\n",
    "While there has been much progress in machine learning, there are also challenges.\n",
    "\n",
    "For example, the mainstream machine learning technologies are\n",
    "black-box approaches, making us concerned about their potential\n",
    "risks. To tackle this challenge, we may want to make machine learning\n",
    "more explainable and controllable. As another example, the\n",
    "computational complexity of machine learning algorithms is usually\n",
    "very high and we may want to invent lightweight algorithms or\n",
    "implementations. Furthermore, in many domains such as physics,\n",
    "chemistry, biology, and social sciences, people usually seek elegantly\n",
    "simple equations (e.g., the Schrödinger equation) to uncover the\n",
    "underlying laws behind various phenomena. In the field of machine\n",
    "learning, can we reveal simple laws instead of designing more complex\n",
    "models for data fitting? Although there are many challenges, we are\n",
    "still very optimistic about the future of machine learning. As we look\n",
    "forward to the future, here are what we think the research hotspots in\n",
    "the next ten years will be.\n",
    "\n",
    "See the article on [Discovery of Physics From Data: Universal Laws and Discrepancies](https://www.frontiersin.org/articles/10.3389/frai.2020.00025/full)\n",
    "\n",
    "## Explainable machine learning\n",
    "\n",
    "Machine learning, especially deep learning, evolves rapidly. The\n",
    "ability gap between machine and human on many complex cognitive tasks\n",
    "becomes narrower and narrower. However, we are still in the very early\n",
    "stage in terms of explaining why those effective models work and how\n",
    "they work.\n",
    "\n",
    "**What is missing: the gap between correlation and causation**. Standard Machine Learning is based on what e have called a frequentist approach. \n",
    "\n",
    "Most\n",
    "machine learning techniques, especially the statistical ones, depend\n",
    "highly on correlations in data sets to make predictions and analyses. In\n",
    "contrast, rational humans tend to reply on clear and trustworthy\n",
    "causality relations obtained via logical reasoning on real and clear\n",
    "facts. It is one of the core goals of explainable machine learning to\n",
    "transition from solving problems by data correlation to solving\n",
    "problems by logical reasoning.\n",
    "\n",
    "**Bayesian Machine Learning is one of the exciting research directions in this field**.\n",
    "\n",
    "## Quantum machine learning\n",
    "\n",
    "Quantum machine learning is an emerging interdisciplinary research\n",
    "area at the intersection of quantum computing and machine learning.\n",
    "\n",
    "Quantum computers use effects such as quantum coherence and quantum\n",
    "entanglement to process information, which is fundamentally different\n",
    "from classical computers. Quantum algorithms have surpassed the best\n",
    "classical algorithms in several problems (e.g., searching for an\n",
    "unsorted database, inverting a sparse matrix), which we call quantum\n",
    "acceleration.\n",
    "\n",
    "When quantum computing meets machine learning, it can be a mutually\n",
    "beneficial and reinforcing process, as it allows us to take advantage\n",
    "of quantum computing to improve the performance of classical machine\n",
    "learning algorithms. In addition, we can also use the machine learning\n",
    "algorithms (on classic computers) to analyze and improve quantum\n",
    "computing systems.\n",
    "\n",
    "[Lecture on Quantum ML](https://www.youtube.com/watch?v=Xh9pUu3-WxM&ab_channel=InstituteforPure%26AppliedMathematics%28IPAM%29).\n",
    "\n",
    "\n",
    "[Read interview with Maria Schuld on her work on Quantum Machine Learning](https://physics.aps.org/articles/v13/179?utm_campaign=weekly&utm_medium=email&utm_source=emailalert). See also [her recent textbook](https://www.springer.com/gp/book/9783319964232). \n",
    "\n",
    "\n",
    "## Quantum machine learning algorithms based on linear algebra\n",
    "\n",
    "Many quantum machine learning algorithms are based on variants of\n",
    "quantum algorithms for solving linear equations, which can efficiently\n",
    "solve N-variable linear equations with complexity of O(log2 N) under\n",
    "certain conditions. The quantum matrix inversion algorithm can\n",
    "accelerate many machine learning methods, such as least square linear\n",
    "regression, least square version of support vector machine, Gaussian\n",
    "process, and more. The training of these algorithms can be simplified\n",
    "to solve linear equations. The key bottleneck of this type of quantum\n",
    "machine learning algorithms is data input—that is, how to initialize\n",
    "the quantum system with the entire data set. Although efficient\n",
    "data-input algorithms exist for certain situations, how to efficiently\n",
    "input data into a quantum system is as yet unknown for most cases.\n",
    "\n",
    "## Quantum reinforcement learning\n",
    "\n",
    "In quantum reinforcement learning, a quantum agent interacts with the\n",
    "classical environment to obtain rewards from the environment, so as to\n",
    "adjust and improve its behavioral strategies. In some cases, it\n",
    "achieves quantum acceleration by the quantum processing capabilities\n",
    "of the agent or the possibility of exploring the environment through\n",
    "quantum superposition. Such algorithms have been proposed in\n",
    "superconducting circuits and systems of trapped ions.\n",
    "\n",
    "## Quantum deep learning\n",
    "\n",
    "Dedicated quantum information processors, such as quantum annealers\n",
    "and programmable photonic circuits, are well suited for building deep\n",
    "quantum networks. The simplest deep quantum network is the Boltzmann\n",
    "machine. The classical Boltzmann machine consists of bits with tunable\n",
    "interactions and is trained by adjusting the interaction of these bits\n",
    "so that the distribution of its expression conforms to the statistics\n",
    "of the data. To quantize the Boltzmann machine, the neural network can\n",
    "simply be represented as a set of interacting quantum spins that\n",
    "correspond to an adjustable Ising model. Then, by initializing the\n",
    "input neurons in the Boltzmann machine to a fixed state and allowing\n",
    "the system to heat up, we can read out the output qubits to get the\n",
    "result.\n",
    "\n",
    "\n",
    "## Social machine learning\n",
    "\n",
    "Machine learning aims to imitate how humans\n",
    "learn. While we have developed successful machine learning algorithms,\n",
    "until now we have ignored one important fact: humans are social. Each\n",
    "of us is one part of the total society and it is difficult for us to\n",
    "live, learn, and improve ourselves, alone and isolated. Therefore, we\n",
    "should design machines with social properties. Can we let machines\n",
    "evolve by imitating human society so as to achieve more effective,\n",
    "intelligent, interpretable “social machine learning”?\n",
    "\n",
    "And much more.\n",
    "\n",
    "\n",
    "\n",
    "## The last words?\n",
    "\n",
    "Early computer scientist Alan Kay said, **The best way to predict the\n",
    "future is to create it**. Therefore, all machine learning\n",
    "practitioners, whether scholars or engineers, professors or students,\n",
    "need to work together to advance these important research\n",
    "topics. Together, we will not just predict the future, but create it."
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 4
}