Topics
Support Vector Machines (SVM) find the optimal hyperplane to separate data points into different classes. Key idea is to maximize the distance between the hyperplane and the closest data points. This distance is called the margin. A larger margin means better generalization to unseen data.
The separating hyperplane is defined by . Data points are classified based on the sign of . Additionally, to ensure a margin, we introduce canonical hyperplanes (margin boundaries) at and . The data points closest to the separating hyperplane, called support vectors, lie exactly on these boundaries.
Note
Choosing a different value, like or any for , would result in the same optimal separating hyperplane but with scaled parameters ( and would be scaled by ). Setting it to is simply a standardized normalization that simplifies the mathematical formulation of the optimization problem.
Objective
The geometric distance between the and hyperplanes is:
Maximizing this distance is the objective and has desirable properties.
Why minimize
Minimizing is same as minimizing (squaring a positive value doesn’t change location of minimum). The factor is added for mathematical convenience. The gradient of is , which simplifies derivative calculations, especially when using lagrange multipliers for optimization.
This minimization is performed subject to constraints: for each training point , the constraint ensures it is on the correct side of the margin. Using the canonical representation where the functional margin is 1, the constraint is .
For soft margin SVM, which allows some points to be misclassified or violate the margin, slack variables are introduced. The constraint becomes , and a penalty term is added to the objective function. The objective is then termed as soft margin objective in SVM.
So basically, maximize the margin → minimize the cost function defined above subject to some constraints. Observe that this is a convex quadratic programming (QP) problem and there are few ways to solve this:
- typically solved by formulating and solving its dual problem
- specialized algorithms like Sequential Minimal Optimization (SMO) are effective
- gradient descent or SGD on the primal formulation