Topics
Logistic Regression is a linear model used for binary classification. Model predicts probability that input belongs to positive class (). We get output by applying sigmoid function to linear combination of features and weights: , where represents parameters (weights and bias, which are usually merged).
Goal of training: find parameters that best fit training data (typicall done using maximum likelihood estimation) which is equivalent to minimizing a cost function that measures difference between predicted probability and actual label for each training example .
Loss Function (Single Training Instance)
Commonly used loss function for binary classification, including LR, is binary cross entropy loss, also known as Log Loss. For single instance , where is true label and is predicted probability , the loss is:
- If : Loss . Loss is small when is close to 1, large when is close to 0. Penalizes model for assigning low probability to the true positive class
- If : Loss . Loss is small when is close to 0 (meaning is close to 1), large when is close to 1. Penalizes model for assigning high probability to the wrong (positive) class
This loss function is directly derived from the cross entropy between the true distribution (a bernoulli distribution concentrated at ) and the predicted distribution (a Bernoulli distribution with parameter ).
Loss Function (Over Entire Training Set)
Total cost function is typically average of loss over all training examples:
Minimizing with respect to finds parameters that yield lowest average prediction error across dataset.
In most cases, there is no analytical solution to find the parameters that maximize the log-likelihood (or minimize the negative log-likelihood) for logistic regression. Instead, iterative numerical optimization procedures such as gradient descent for logistic regression used.
Why Log Loss instead of Mean Squared Error (MSE)?
Consider using MSE: . While MSE works for linear regression, it is problematic for LR when combined with the sigmoid activation.
- . Substituting this into the MSE formula results in a non-convex cost function with respect to
- Non-convex functions have multiple local minima. gradient descent can get stuck in these local minima, failing to find the global minimum and thus optimal
- Log Loss function for Logistic Regression is convex. This guarantees that gradient descent will converge to the unique global minimum, finding optimal parameters