Kernel Function and Kernel Trick

 Parametric Models

They store the parameter only and not the entire training dataset.

Memory Based Methods

Store entire training data or a subset of the training dataset. They are fast to train, but prediction takes a longer time. Eg. Knn

Need for Kernels

Not all problems are linearly separable. Also, when a feature vector is transformed to a higher dimensional space, linearly separable problems become non linearly separable.

For example, let us consider a point (x,y). In the 2D space it might be non-linearly separable. that is a straight line cannot separate the points. We can add a new dimension that is a combination of the existing dimensions. Let it be

z = x^2 + y^2

Now the transformed feature space is (x, y, z) and is of 3 dimensions. In this dimension, z might help to move the data points along x and y in such a way that there is a linear separator between them. That is we move from the original feature space to a transformed feature space.

We cannot say how many dimensions we need to increase to get this linearity. It depends on the complexity of the data distribution and the problem to be solved.

Linearly Separable Data:

There is no need to change the dimension. We retain the original dimension or feature space.

Dot Product

It is a similarity measure that measures the distance between 2 points in space. If the dot product is positive, the points are similar to each other and the angle between them will be less than 90 degrees. A zero dot product show the vectors are dissimilar and are at 90 degrees to each other. A negative dot product shows they are pointing in opposite directions and have significantly different properties and so they are pointing in different directions.

Kernel Trick

We know that non-linearly separable problems in lower dimensions can often become linearly separable when transformed into higher dimensions. However, increasing the dimensionality also increases computational complexity. Is there a way to obtain a value in the higher dimension that maintains the similarity measure between and in the lower dimension? Although we will not obtain the same value for the similarity measure in higher dimensions, this value will still reflect the relative distances between points, preserving their relationships.

Consider the following example, which demonstrates that the dot product in a higher dimension maintains the relative relationship observed in the original space.

Let x=(x1,x2,x3), y = (y1,y2,y3) in 3D feature space. Let is convert it to a 9D feature space as follows: f(x)=(x1.x1, x1.x2, x1.x3, x2.x1, x2.x2, x2.x3, x3.x1, x3.x2, x3.x3)

Let us understand the concept of kernel with a numercial example. x=(1,2,3), y=(4,5,6)

Then transforming the feature space to 9D space (x1.x1, x1.x2, x1.x3, x2.x1, x2.x2, x2.x3, x3.x1, x3.x2, x3.x3) results in:

f(x)=(1,2,3,2,4,6,3,6,9) f(y)=(16,20,24,20,25,30,24,30,36) The dot product in the transformed feature space is given by: f(x).f(y) = 1024

Let us use a function K(x,y) such that K(x,y) = (x.y)^2

x and y are points in 3D space, then

K(x,y) = ((1,2,3).(4,5,6) ) ^ 2

= 32^2 = 1024

Note that this is a special case in which the similarity value is preserved in the higher dimension and usually the numerical value of the similarity measure is not the same in the higher dimension.

Kernel function

The function K(x,y) is called the kernel function and it represents the similarity between the points in the transformed space. Kernel function is the tool used to perform kernel trick. These functions are applied on the samples in the training dataset. They help to create non-linear decision boundaries in non-parametric models.

Lets take the kernel function used in the example,

K(x,y) = (x.y)^2

This performs the dot product of vectors in 3D space and gives the effect of computing the dot product of f(x) and f(y) in the 9D space. Note it does not transform x and y to f(x) and f(y) to find the effect in 9D space. This transformation to higher dimension is usually the most computationally expensive and that has been taken care of.

Kernel Matrix
It is also called the gram matrix and it gives the pairwise similarity between the data points after applying the kernel function.





Comments

Popular posts from this blog

ANN Series - 10 - Backpropagation of Errors

Naive Bayesian Classifiers - Multinomial, Bernoulli and Gaussian with Solved Examples and Laplace Smoothing

Clustering - K means Clustering