So far the SVM classifier can only have a linear hyper-plane as its decision surface. This formulation can be further extended to build a nonlinear SVM. The motivation for this extension is that a SVM with nonlinear decision surface can classify nonlinearly separable data.
Consider a mapping
, Eqn.
, which maps the training
data from
to some higher Euclidean space
, which
possibly has infinite dimensions. In this high dimension space, the
data can be linearly separable, hence the linear SVM formulation above
can be applied to these data. In the SVM formulation, the training
data only appear in the form of dot products,
i.e. in Eqn.
, the same is true
in the decision function
(
). These can be replace by dot
products in the Euclidean space
, i.e.
.
The dot product in the high dimension space can also be replace by
a kernel function, Eqn.
. By computing the dot product
directly using a kernel function, one avoid the mapping
. This is desirable because
has possibly infinite
dimensions and
can be tricky or impossible to compute.
Using a kernel function, one need not explicitly know what
is.
By using a kernel function, a SVM that operates in infinite dimensional
space can be constructed. This also means that w cannot be precomputed,
and that the decision function will be replaced by Eqn.
,
where for every new test data, the kernel function for each SV need
to be recomputed.
The question to ask next is ``what kernel function can be used
?'' i.e. is
there any constraint on the type of kernel function suitable for this
task. For any kernel function suitable for SVM, there must exist at
least one pair of
, such that Eqn.
is
true, i.e. the kernel function represents the dot product of the data
in
. The kernel that has these properties satisfies the
Mercer's condition [Vapnik, 1995], i.e. for any g(x) with finite
norm (Eqn.
), Eqn.
must hold. Any positive
definite kernel satisfies this condition.
It can be proved that the kernel functions in Table
satisfy this condition.
It is interesting to note that by choosing different kernel functions,
the SVM can emulate some well known classifiers [Osuna et al., 1997b], as shown
in table
.
Table: Some possible kernel functions and the types of decision
surface they define