Topics
SVM aims to find a hyperplane that maximizes the margin between classes. The standard formulation is a convex quadratic programming (QP) problem
subject to for all data points (for hard margin SVM). Solving this QP can use various methods:
- Dual problem solvers (like SMO), common for SVMs that use kernels
- Generic QP solvers
- Gradient Descent (GD) or Stochastic Gradient Descent (SGD) on the primal formulation using hinge loss
Using GD/SGD means minimizing a different, but equivalent, objective function for the soft margin SVM. The constrained primal problem:
Minimize:
subject to ,
This is equivalent to minimizing the regularized hinge loss objective (unconstrained):
This objective function is convex. Therefore, GD/SGD can find the global minimum:
- Initialize and (e.g., to zeros)
- Repeat until convergence:
- Compute gradient of the objective function w.r.t and
- Update parameters: , , where is learning rate
Gradient Calculation
Objective is sum of regularization term and sum of hinge losses (across dataset) . Gradient of a sum is sum of gradients:
- Gradient of w.r.t is and w.r.t
- Gradient of : for each data point , let :
- If (point correctly classified, outside margin): Gradient of is
- If (point violates margin or misclassified): Gradient of w.r.t is and w.r.t is
Combined Gradients:
Parameter Updates
For Batched GD, find gradients as above (across the dataset) and update the params and at the end, whereas for SGD, update based on gradient from a single sample at each step:
If :
- (only regularization contributes)
Else (, margin violated):
Update: