Support Vector Machines

Maximum margin classifiers

  • Also known as optimal separating hyperplane
  • Margin is the distance between hyperplane and closest training data point
  • We want to select a hyperplane for which this distance is maximum
  • Once we identify optimal separating hyper plane there can be many equidistance training points with the shortest distance from hyperplane
  • Such point are called support vectors
    • These points support the hyperplane in a sense that if they are moved slightly optimal hyperplane will also move

Training

svm1
  • Equation 9.10 ensures that left hand side of equation 9.11 gives perpendicular distance from the hyperplane
  • Equation assumes y has two values (+1) and (-1)

Support Vector classifier

  • Maximum margin classifier does not work when supporting hyperplane does not exist.
  • Support vector classifier relaxes optimization objective to get that work
  • Unlike maximum margin classifier this one is less prone to overfit as well
    • The formal one is very sensitive to change in single observation
  • Also know as soft margin classifier
svm2
  • ε variable allows training point to be on wrong side of margin
    • If ε > 0 it is on the wrong side of margin
    • If ε > 1 it is on the wrong side of hyperplane
  • Parameter C is the budget that constraints how many points are allowed on wrong side of hyperplane
  • C is selected with cross validation and controls bias variance trade off
  • Point that lies directly on the margin or on the wrong side of margin for their class are called support vectors
    • Because these points affects the choice of hyperplane
    • And this is the property which makes it robust to outliers
      • LDA calculates mean of all the observation
      • However LR is less sensitivity to outliers
  • Computation note – when we try to solve above optimization problem with lagrange multiplier we found that it depends on dot product of training samples
    • This will be very important when we discuss support vector machine in next section
    • Dot product is only with support vector, both while training and while solving [1]
svm3.PNG

Support vector machine

  • Above two classifier does not work when desired decision boundary is not linear
  • One solution is to create polynomial features (as we generally do for LR)
  • But fundamental problem with this approach is that how many and which terms you should create
  • Also creating large number of feature raises computational problem
  • For the case of SVM, that fact that it involved only dot product of observation allows us to perform kernel trick.svm4svm5svm6svm7
  • Kernel acts as similarity function
  • Above equation makes it clear that we are not calculating(and storing) higher order polynomial still taking the advantage of it
  • Second one is polynomial kernel and last one is radial kernel
  • This video shows visualization of kernel trick

Multiclass SVM

 Hinge Loss

  • Loss is 0 for w*x which are not support vectors
  • Look at equation 9.14 and 9.15 above
    • Looking at its property episilon, it relates to hinge loss
    • Yet to check it in SVM lagrange solution though
  • In hinge loss also we would predict +1 if w*x > 0 else -1
  • Why not keep boundary at 0
    • This can make w a zero to make positive training example right
    • Zero parameter vector would mean while prediction we would always get 0
      • Because prediction is w*x
      • This makes it difficult to assign labels during prediction
    • We keep boundary away from 0 to make zero we don’t get 0 parameter vector

References

[0] : An Introduction to Statistical Learning – http://www-bcf.usc.edu/~gareth/ISL/

[1] : https://towardsdatascience.com/understanding-support-vector-machine-part-2-kernel-trick-mercers-theorem-e1e6848c6c4d

2 thoughts on “Support Vector Machines

Leave a comment