gradient descent for logistic regression

Topics

optimization

logistic regression

gradient descent

Gradient descent is an iterative optimization algorithm that finds the minimum of a function by repeatedly moving in the direction opposite to the gradient. The algo on the logistic regression optimization problem $J (θ)$ is performed as follows:

Initialize parameters: Start with an initial guess for the parameter vector $θ$ . This is often a vector of zeros or small random values
Iteratively update parameters: In each iteration $r$ , update the parameters $θ_{r}$ to $θ_{r + 1}$ using the following rule: $θ_{r + 1} = θ_{r} - η \nabla J (θ_{r})$ where $θ_{r}$ is the vector of parameters at iteration $r$ , and $η$ (eta) is the learning rate (or step length), a small positive value that determines the size of the step taken in the direction of the negative gradient. $\nabla J (θ_{r})$ is the gradient of the cost function $J (θ)$ evaluated at $θ_{r}$ . The gradient is a vector of partial derivatives of $J (θ)$ with respect to each parameter $θ_{j}$
Repeat until Convergence: Continue computing gradient and updating params for a predetermined number of iterations or until the change in the parameters $θ$ or the cost function $J (θ)$ between iterations is smaller than a specified tolerance

Calculate the Gradient

To perform the update, you need to calculate the partial derivative of the cost function $J (θ)$ with respect to each parameter $θ_{j}$ . The cost function is

J (θ) = - \frac{1}{m} i = 1 \sum m [y^{(i)} lo g (h_{θ} (x^{(i)})) + (1 - y^{(i)}) lo g (1 - h_{θ} (x^{(i)}))]

For a single observation $(x^{(i)}, y^{(i)})$ in logistic regression:

Log-likelihood term: $ℓ_{i} (θ) = y^{(i)} lo g h_{θ} (x^{(i)}) + (1 - y^{(i)}) lo g (1 - h_{θ} (x^{(i)}))$
Sigmoid function: $h_{θ} (x^{(i)}) = σ (θ^{T} x^{(i)}) = \frac{1}{1 + e ^{- θ^{T} x^{(i)}}}$
Derivative of sigmoid: $σ^{'} (z) = σ (z) (1 - σ (z))$

Using chain rule:

Let $z^{(i)} = θ^{T} x^{(i)}$
$\frac{\partial ℓ _{i}}{\partial θ _{j}} = \frac{\partial ℓ _{i}}{\partial h} \cdot \frac{\partial h}{\partial z} \cdot \frac{\partial z}{\partial θ _{j}}$

Breaking it down:

$\frac{\partial ℓ _{i}}{\partial h} = \frac{y ^{(i)}}{h} - \frac{1 - y ^{(i)}}{1 - h}$
$\frac{\partial h}{\partial z} = h (1 - h)$
$\frac{\partial z}{\partial θ _{j}} = x_{j}^{(i)}$

Combining terms:
$\frac{\partial ℓ _{i}}{\partial θ _{j}} = (y^{(i)} - h_{θ} (x^{(i)})) x_{j}^{(i)}$

For full gradient (all $m$ examples):

Vector form: \nabla_{θ} J (θ) \nabla_{θ} J (θ) = \frac{1}{m} i = 1 \sum m (h_{θ} (x^{(i)}) - y^{(i)}) x^{(i)} = \frac{1}{m} X^{T} (h_{θ} (X) - y)

This derivative represents the average error (predicted probability minus actual outcome) scaled by the corresponding feature value $x_{j}^{(i)}$ across all training examples.

import numpy as np
# Assume X (m, n+1), y (m,), theta (n+1,), learning_rate eta
# n+1 is because of merging w and b terms
m = len(y)
z = X.dot(theta)
h = sigmoid(z) # Using sigmoid function from LR_003
gradient = (1/m) * X.T.dot(h - y)
theta = theta - eta * gradient

logistic regression from scratch

Altamash Khan

Altamash Khan

gradient descent for logistic regression

Calculate the Gradient

Backlinks

Altamash Khan

gradient descent for logistic regression

Calculate the Gradient

Related

Backlinks