Topics
Gradient descent is an iterative optimization algorithm that finds the minimum of a function by repeatedly moving in the direction opposite to the gradient. The algo on the logistic regression optimization problem is performed as follows:
- Initialize parameters: Start with an initial guess for the parameter vector . This is often a vector of zeros or small random values
- Iteratively update parameters: In each iteration , update the parameters to using the following rule: where is the vector of parameters at iteration , and (eta) is the learning rate (or step length), a small positive value that determines the size of the step taken in the direction of the negative gradient. is the gradient of the cost function evaluated at . The gradient is a vector of partial derivatives of with respect to each parameter
- Repeat until Convergence: Continue computing gradient and updating params for a predetermined number of iterations or until the change in the parameters or the cost function between iterations is smaller than a specified tolerance
Calculate the Gradient
To perform the update, you need to calculate the partial derivative of the cost function with respect to each parameter . The cost function is
For a single observation in logistic regression:
- Log-likelihood term:
- Sigmoid function:
- Derivative of sigmoid:
Using chain rule:
- Let
Breaking it down:
Combining terms:
For full gradient (all examples):
This derivative represents the average error (predicted probability minus actual outcome) scaled by the corresponding feature value across all training examples.
import numpy as np
# Assume X (m, n+1), y (m,), theta (n+1,), learning_rate eta
# n+1 is because of merging w and b terms
m = len(y)
z = X.dot(theta)
h = sigmoid(z) # Using sigmoid function from LR_003
gradient = (1/m) * X.T.dot(h - y)
theta = theta - eta * gradient