MLE for binary classification

Topics

maximum likelihood estimation

binary classification

The principle of Maximum Likelihood Estimation (MLE) is a general method for estimating parameters for any statistical model. It states that given a dataset and a statistical model, the best parameters are those that maximize the probability (likelihood) of observing the given data under that model.

Assume training examples are independent and identically distributed. For binary classification, assume $y^{(i)}$ (target) follows a bernoulli distribution with parameter:

p^{(i)} = P (y = 1∣ x^{(i)}; θ) = h_{θ} (x^{(i)})

The probability mass function for $y^{(i)}$ is

P (y^{(i)} ∣ x^{(i)}; θ) = (h_{θ} (x^{(i)}))^{y^{(i)}} (1 - h_{θ} (x^{(i)}))^{1 - y^{(i)}}

The likelihood of the entire dataset is the product of individual probabilities:

L (θ) = P (y^{(1)}, ..., y^{(m)} ∣ x^{(1)}, ..., x^{(m)}; θ) = i = 1 \prod m P (y^{(i)} ∣ x^{(i)}; θ) = i = 1 \prod m (h_{θ} (x^{(i)}))^{y^{(i)}} (1 - h_{θ} (x^{(i)}))^{1 - y^{(i)}}

To maximize likelihood $L (θ)$ , it is numerically more stable and computationally easier to maximize the log-likelihood $lo g L (θ)$ :

lo g L (θ) = i = 1 \sum m [y^{(i)} lo g (h_{θ} (x^{(i)})) + (1 - y^{(i)}) lo g (1 - h_{θ} (x^{(i)}))]

Maximizing $lo g L (θ)$ is equivalent to minimizing $- lo g L (θ)$ . In an optimization setting, we can even divide by $m$ to get an “average loss”.

Note

Observe that this formulation of MLE for binary classification is basically the cost function for logistic regression optimization problem as well.

Altamash Khan

Altamash Khan

MLE for binary classification

Backlinks

Altamash Khan

MLE for binary classification

Related

Backlinks