Consider a two-class classifier
(for the same
problem described in the beginning of the chapter) with oriented
hyper-planes. The decision function
of the classifier
is
, where
.
If the training data is linearly separable, then a set of
pairs can be found such that the constraints in Eqn.
are
satisfied. Notice
that there is ambiguity in the magnitude of w and b. They can be
arbitrary scaled such that
, where
is the
training data nearest to the decision plane.
Intuitively, the classifier with the largest margin will give lower
expected risk (Eqn.
), i.e. better generalisation.
Comparing the two different decision planes in Figure
,
the classifier with smaller margin will have higher expected risk.
The margin for this linear classifier is just 2/||w|| (Figure
). Hence to maximise the margin, one needs to minimise the
||w|| with the constraints in Eqn.
. In short, the
training of this classifier is achieved by solving a linearly
constraint-ed optimisation problem.
Figure: Comparing liner classifier with different margin size
Figure: Decision margin for a oriented hyper-planes classifier
This optimisation problem is solved using the Lagrangian
formulation. The Lagrangian for minimising ||w|| is shown in
Eqn.
. This needs to be minimised with respect to w, b, and
it is simultaneously required that
and
Eqn.
be satisfied. Applying
the prima-dual formulation, this can be achieved by maximising
subject to the constraints that
,
and also satisfying Eqn.
.
From the constraints in the dual formulation, the condition in
Eqn.
and
is obtained. These can be substituted into
Eqn.
to give the new Lagrangian in Eqn.
.
The training problem is transformed to maximise Eqn.
with
respect to constraints in Eqn.
,
and
.
This is a convex QP problem, since ||w|| is convex (Figure
shows that the
norm of a vector is convex) and all
constraints are linear. For this particular problem, the
Karush-Kuhn-Tucker (KKT) conditions are Eqn.
,
and
with an additional condition as stated in
Eqn.
. From these KKT condition, the following conclusion can
be made.
At this point it would be beneficial to consider the significance
the
value for each set of training data.
Those training data with nonzero
values will fall on the +1 or
-1 plane (see Figure
), i.e. these are the data that
contribute to defining the decision boundary. If the other data are
removed and the classifier is retrained on the remaining data, the
training will result in the same decision boundary. These data points
are called the Support Vectors (SV). SV with larger
are more
important, since they have stronger influence on the decision boundary.
Solving Eqn.
will give the value for all
.
From Eqn.
, w is the a linear combination of all the
training data
(only those training points with nonzero
value contribute). b is found by using Eqn.
, by
selecting any training data with nonzero
values. It is
numerically wiser to average b over all such training data.