logistic regression optimization problem

Topics

logistic regression

optimization

Logistic Regression is a linear model used for binary classification. Model predicts probability $h (x)$ that input $x$ belongs to positive class ( $y = 1$ ). We get $h (x)$ output by applying sigmoid function $σ$ to linear combination of features and weights: $h_{θ} (x) = σ (θ^{T} x)$ , where $θ$ represents parameters (weights and bias, which are usually merged).

Goal of training: find parameters $θ$ that best fit training data (typicall done using maximum likelihood estimation) which is equivalent to minimizing a cost function $J (θ)$ that measures difference between predicted probability $h_{θ} (x^{(i)})$ and actual label $y^{(i)}$ for each training example $(x^{(i)}, y^{(i)})$ .

Loss Function (Single Training Instance)

Commonly used loss function for binary classification, including LR, is binary cross entropy loss (BCE), also known as Log Loss. For single instance $(x, y)$ , where $y \in {0, 1}$ is true label and $h_{θ} (x)$ is predicted probability $P (y = 1∣ x; θ)$ , the loss is:

L (h_{θ} (x), y) = - [y lo g (h_{θ} (x)) + (1 - y) lo g (1 - h_{θ} (x))]

If $y = 1$ : Loss $L = - lo g (h_{θ} (x))$ . Loss is small when $h_{θ} (x)$ is close to 1, large when $h_{θ} (x)$ is close to 0. Penalizes model for assigning low probability to the true positive class
If $y = 0$ : Loss $L = - lo g (1 - h_{θ} (x))$ . Loss is small when $h_{θ} (x)$ is close to 0 (meaning $1 - h_{θ} (x)$ is close to 1), large when $h_{θ} (x)$ is close to 1. Penalizes model for assigning high probability to the wrong (positive) class

This loss function is directly derived from the cross entropy between the true distribution (a bernoulli distribution concentrated at $y$ ) and the predicted distribution (a Bernoulli distribution with parameter $h_{θ} (x)$ ).

Loss Function (Over Entire Training Set)

Total cost function $J (θ)$ is typically average of loss over all $m$ training examples:

J (θ) = \frac{1}{m} i = 1 \sum m L (h_{θ} (x^{(i)}), y^{(i)}) = - \frac{1}{m} i = 1 \sum m [y^{(i)} lo g (h_{θ} (x^{(i)})) + (1 - y^{(i)}) lo g (1 - h_{θ} (x^{(i)}))]

Minimizing $J (θ)$ with respect to $θ$ finds parameters that yield lowest average prediction error across dataset.

In most cases, there is no analytical solution to find the parameters $θ$ that maximize the log-likelihood (or minimize the negative log-likelihood) for logistic regression. Instead, iterative numerical optimization procedures such as gradient descent for logistic regression used.

Why Log Loss (BCE) instead of Mean Squared Error (MSE)?

Consider using MSE: $J_{MSE} (θ) = \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)})^{2}$ . While MSE works for linear regression, it is problematic for LR when combined with the sigmoid activation.

$h_{θ} (x) = σ (θ^{T} x)$ . Substituting this into the MSE formula results in a non-convex cost function with respect to $θ$
Non-convex functions have multiple local minima. The gradient descent optimizer can get stuck in these local minima, failing to find the global minimum and thus optimal $θ$
Log Loss function $J (θ)$ for Logistic Regression is convex. This guarantees that gradient descent will converge to the unique global minimum, finding optimal parameters $θ$

Altamash Khan

Altamash Khan

logistic regression optimization problem

Loss Function (Single Training Instance)

Loss Function (Over Entire Training Set)

Backlinks

Altamash Khan

logistic regression optimization problem

Loss Function (Single Training Instance)

Loss Function (Over Entire Training Set)

Related

Backlinks