Maximum likelihood

In order to define the total likelihood for all possible outcomes from a dataset \( \mathcal{D}=\{(y_i,x_i)\} \), with the binary labels \( y_i\in\{0,1\} \) and where the data points are drawn independently, we use the so-called Maximum Likelihood Estimation (MLE) principle. We aim thus at maximizing the probability of seeing the observed data. We can then approximate the likelihood in terms of the product of the individual probabilities of a specific outcome \( y_i \), that is

$$ \begin{align*} P(\mathcal{D}|\boldsymbol{\beta})& = \prod_{i=1}^n \left[p(y_i=1|x_i,\boldsymbol{\beta})\right]^{y_i}\left[1-p(y_i=1|x_i,\boldsymbol{\beta}))\right]^{1-y_i}\nonumber \\ \end{align*} $$

from which we obtain the log-likelihood and our cost/loss function

$$ \mathcal{C}(\boldsymbol{\beta}) = \sum_{i=1}^n \left( y_i\log{p(y_i=1|x_i,\boldsymbol{\beta})} + (1-y_i)\log\left[1-p(y_i=1|x_i,\boldsymbol{\beta}))\right]\right). $$