sovling SVM with gradient descent

Topics

support vector machines

gradient descent

optimization

SVM aims to find a hyperplane that maximizes the margin between classes. The standard formulation is a convex quadratic programming (QP) problem

minimize \frac{1}{2} ∣∣ w ∣ ∣^{2}

subject to $y_{i} (w^{T} x_{i} + b) \geq 1$ for all data points $(x_{i}, y_{i})$ (for hard margin SVM). Solving this QP can use various methods:

Dual problem solvers (like SMO), common for SVMs that use kernels
Generic QP solvers
Gradient Descent (GD) or Stochastic Gradient Descent (SGD) on the primal formulation using hinge loss

Using GD/SGD means minimizing a different, but equivalent, objective function for the soft margin SVM. The constrained primal problem:

Minimize:

minimize (\frac{1}{2} ∣∣ w ∣ ∣^{2} + C \sum ξ_{i})

subject to $y_{i} (w^{T} x_{i} + b) \geq 1 - ξ_{i}$ , $ξ_{i} \geq 0$

This is equivalent to minimizing the regularized hinge loss objective (unconstrained):

\frac{1}{2} ∣∣ w ∣ ∣^{2} + C \sum max (0, 1 - y_{i} (w^{T} x_{i} + b))

This objective function is convex. Therefore, GD/SGD can find the global minimum:

Initialize $w$ and $b$ (e.g., to zeros)
Repeat until convergence:
1. Compute gradient of the objective function w.r.t $w$ and $b$
2. Update parameters: $w := w - η \nabla_{w} (Objective)$ , $b := b - η \nabla_{b} (Objective)$ , where $η$ is learning rate

Gradient Calculation

Objective is sum of regularization term $\frac{1}{2} ∣∣ w ∣ ∣^{2}$ and sum of hinge losses (across dataset) $C \sum max (0, 1 - y_{i} (w^{T} x_{i} + b))$ . Gradient of a sum is sum of gradients:

Gradient of $\frac{1}{2} ∣∣ w ∣ ∣^{2}$ w.r.t $w$ is $w$ and $0$ w.r.t $b$
Gradient of $C \sum max (0, 1 - y_{i} (w^{T} x_{i} + b))$ : for each data point $(x_{i}, y_{i})$ , let $z_{i} = y_{i} (w^{T} x_{i} + b)$ :
- If $z_{i} \geq 1$ (point correctly classified, outside margin): Gradient of $max (0, 1 - z_{i})$ is $0$
- If $z_{i} < 1$ (point violates margin or misclassified): Gradient of $max (0, 1 - z_{i})$ w.r.t $w$ is $- y_{i} x_{i}$ and w.r.t $b$ is $- y_{i}$

Combined Gradients:

\nabla_{w} (Objective) \nabla_{b} (Objective) = w + C i where z_{i} < 1 \sum (- y_{i} x_{i}) = C i where z_{i} < 1 \sum (- y_{i})

Parameter Updates

For Batched GD, find gradients as above (across the dataset) and update the params $w$ and $b$ at the end, whereas for SGD, update based on gradient from a single sample $(x_{i}, y_{i})$ at each step:

If $z_{i} \geq 1$ :

$\nabla_{w} \approx w$ (only regularization contributes)
$\nabla_{b} \approx 0$

Else ( $z_{i} < 1$ , margin violated):

$\nabla_{w} \approx w - C y_{i} x_{i}$
$\nabla_{b} \approx - C y_{i}$

Update:

$w := w - η \nabla_{w}$
$b := b - η \nabla_{b}$

Altamash Khan

Altamash Khan

sovling SVM with gradient descent

Gradient Calculation

Parameter Updates

Backlinks

Altamash Khan

sovling SVM with gradient descent

Gradient Calculation

Parameter Updates

Related

Backlinks