Kernel Functions for Machine Learning Applications

In recent years, Kernel methods have received major attention, particularly due to the increased popularity of the Support Vector Machines. Kernel functions can be used in many applications as they provide a simple bridge from linearity to non-linearity for algorithms which can be expressed in terms of dot products. In this article, we will list a few kernel functions and some of their properties.

Many of these functions have been incorporated in Accord.NET, a framework for creating machine learning, statistics, and computer vision applications.

Contents

  1. Kernel Methods
    1. The Kernel Trick
    2. Kernel Properties
    3. Choosing the Right Kernel
  2. Kernel Functions
    1. Linear Kernel
    2. Polynomial Kernel
    3. Gaussian Kernel
    4. Exponential Kernel
    5. Laplacian Kernel
    6. ANOVA Kernel
    7. Hyperbolic Tangent (Sigmoid) Kernel
    8. Rational Quadratic Kernel
    9. Multiquadric Kernel
    10. Inverse Multiquadric Kernel
    11. Circular Kernel
    12. Spherical Kernel
    13. Wave Kernel
    14. Power Kernel
    15. Log Kernel
    16. Spline Kernel
    17. B-Spline Kernel
    18. Bessel Kernel
    19. Cauchy Kernel
    20. Chi-Square Kernel
    21. Histogram Intersection Kernel
    22. Generalized Histogram Intersection Kernel
    23. Generalized T-Student Kernel
    24. Bayesian Kernel
    25. Wavelet Kernel
  3. Source code
  4. See also
  5. References

Kernel Methods

Kernel methods are a class of algorithms for pattern analysis or recognition, whose best known element is the support vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (such as clusters, rankings, principal components, correlations, classifications) in general types of data (such as sequences, text documents, sets of points, vectors, images, graphs, etc) (Wikipedia, 2010a).

The main characteristic of Kernel Methods, however, is their distinct approach to this problem. Kernel methods map the data into higher dimensional spaces in the hope that in this higher-dimensional space the data could become more easily separated or better structured. There are also no constraints on the form of this mapping, which could even lead to infinite-dimensional spaces. This mapping function, however, hardly needs to be computed because of a tool called the kernel trick.

The Kernel trick

The Kernel trick is a very interesting and powerful tool. It is powerful because it provides a bridge from linearity to non-linearity to any algorithm that can expressed solely on terms of dot products between two vectors. It comes from the fact that, if we first map our input data into a higher-dimensional space, a linear algorithm operating in this space will behave non-linearly in the original input space.

Now, the Kernel trick is really interesting because that mapping does not need to be ever computed. If our algorithm can be expressed only in terms of a inner product between two vectors, all we need is replace this inner product with the inner product from some other suitable space. That is where resides the “trick”: wherever a dot product is used, it is replaced with a Kernel function. The kernel function denotes an inner product in feature space and is usually denoted as:

K(x,y) = <φ(x),φ(y)>

Using the Kernel function, the algorithm can then be carried into a higher-dimension space without explicitly mapping the input points into this space. This is highly desirable, as sometimes our higher-dimensional feature space could even be infinite-dimensional and thus unfeasible to compute.

Kernel Properties

Kernel functions must be continuous, symmetric, and most preferably should have a positive (semi-) definite Gram matrix. Kernels which are said to satisfy the Mercer’s theorem are positive semi-definite, meaning their kernel matrices have only non-negative Eigen values. The use of a positive definite kernel insures that the optimization problem will be convex and solution will be unique.

However, many kernel functions which aren’t strictly positive definite also have been shown to perform very well in practice. An example is the Sigmoid kernel, which, despite its wide use, it is not positive semi-definite for certain values of its parameters. Boughorbel (2005) also experimentally demonstrated that Kernels which are only conditionally positive definite can possibly outperform most classical kernels in some applications.

Kernels also can be classified as anisotropic stationary, isotropic stationary, compactly supported, locally stationary, nonstationary or separable nonstationary. Moreover, kernels can also be labeled scale-invariant or scale-dependant, which is an interesting property as scale-invariant kernels drive the training process invariant to a scaling of the data.

Choosing the Right Kernel

Choosing the most appropriate kernel highly depends on the problem at hand – and fine tuning its parameters can easily become a tedious and cumbersome task. Automatic kernel selection is possible and is discussed in the works by Tom Howley and Michael Madden.

The choice of a Kernel depends on the problem at hand because it depends on what we are trying to model. A polynomial kernel, for example, allows us to model feature conjunctions up to the order of the polynomial. Radial basis functions allows to pick out circles (or hyperspheres) – in constrast with the Linear kernel, which allows only to pick out lines (or hyperplanes).

The motivation behind the choice of a particular kernel can be very intuitive and straightforward depending on what kind of information we are expecting to extract about the data. Please see the final notes on this topic from Introduction to Information Retrieval, by Manning, Raghavan and Schütze for a better explanation on the subject.

Kernel Functions

Below is a list of some kernel functions available from the existing literature. As was the case with previous articles, every LaTeX notation for the formulas below are readily available from their alternate text html tag. I can not guarantee all of them are perfectly correct, thus use them at your own risk. Most of them have links to articles where they have been originally used or proposed.

1. Linear Kernel

The Linear kernel is the simplest kernel function. It is given by the inner product <x,y> plus an optional constant c. Kernel algorithms using a linear kernel are often equivalent to their non-kernel counterparts, i.e. KPCA with linear kernel is the same as standard PCA.

k(x, y) = x^T y + c

2. Polynomial Kernel

The Polynomial kernel is a non-stationary kernel. Polynomial kernels are well suited for problems where all the training data is normalized.

k(x, y) = (alpha x^T y + c)^d
Adjustable parameters are the slope alpha, the constant term c and the polynomial degree d.

3. Gaussian Kernel

The Gaussian kernel is an example of radial basis function kernel.

k(x, y) = expleft(-frac{ lVert x-y rVert ^2}{2sigma^2}right)

Alternatively, it could also be implemented using

k(x, y) = expleft(- gamma lVert x-y rVert ^2 )

The adjustable parameter sigma plays a major role in the performance of the kernel, and should be carefully tuned to the problem at hand. If overestimated, the exponential will behave almost linearly and the higher-dimensional projection will start to lose its non-linear power. In the other hand, if underestimated, the function will lack regularization and the decision boundary will be highly sensitive to noise in training data.

4. Exponential Kernel

The exponential kernel is closely related to the Gaussian kernel, with only the square of the norm left out. It is also a radial basis function kernel.

k(x, y) = expleft(-frac{ lVert x-y rVert }{2sigma^2}right)

5. Laplacian Kernel

The Laplace Kernel is completely equivalent to the exponential kernel, except for being less sensitive for changes in the sigma parameter. Being equivalent, it is also a radial basis function kernel.

k(x, y) = expleft(- frac{lVert x-y rVert }{sigma}right)

It is important to note that the observations made about the sigma parameter for the Gaussian kernel also apply to the Exponential and Laplacian kernels.

6. ANOVA Kernel

The ANOVA kernel is also a radial basis function kernel, just as the Gaussian and Laplacian kernels. It is said to perform well in multidimensional regression problems (Hofmann, 2008).

k(x, y) =  sum_{k=1}^n  exp (-sigma (x^k - y^k)^2)^d

7. Hyperbolic Tangent (Sigmoid) Kernel

The Hyperbolic Tangent Kernel is also known as the Sigmoid Kernel and as the Multilayer Perceptron (MLP) kernel. The Sigmoid Kernel comes from the Neural Networks field, where the bipolar sigmoid function is often used as an activation function for artificial neurons.

k(x, y) = tanh (alpha x^T y + c)

It is interesting to note that a SVM model using a sigmoid kernel function is equivalent to a two-layer, perceptron neural network. This kernel was quite popular for support vector machines due to its origin from neural network theory. Also, despite being only conditionally positive definite, it has been found to perform well in practice.

There are two adjustable parameters in the sigmoid kernel, the slope alpha and the intercept constant c. A common value for alpha is 1/N, where N is the data dimension. A more detailed study on sigmoid kernels can be found in the works by Hsuan-Tien and Chih-Jen.

8. Rational Quadratic Kernel

The Rational Quadratic kernel is less computationally intensive than the Gaussian kernel and can be used as an alternative when using the Gaussian becomes too expensive.

k(x, y) = 1 - frac{lVert x-y rVert^2}{lVert x-y rVert^2 + c}

9. Multiquadric Kernel

The Multiquadric kernel can be used in the same situations as the Rational Quadratic kernel. As is the case with the Sigmoid kernel, it is also an example of an non-positive definite kernel.

k(x, y) = sqrt{lVert x-y rVert^2 + c^2}

10. Inverse Multiquadric Kernel

The Inverse Multi Quadric kernel. As with the Gaussian kernel, it results in a kernel matrix with full rank (Micchelli, 1986) and thus forms a infinite dimension feature space.

k(x, y) = frac{1}{sqrt{lVert x-y rVert^2 + theta^2}}

11. Circular Kernel

The circular kernel is used in geostatic applications. It is an example of an isotropic stationary kernel and is positive definite in R2.

k(x, y) = frac{2}{pi} arccos ( - frac{ lVert x-y rVert}{sigma}) - frac{2}{pi} frac{ lVert x-y rVert}{sigma} sqrt{1 - left(frac{ lVert x-y rVert}{sigma} right)^2}
mbox{if}~ lVert x-y rVert < sigma mbox{, zero otherwise}

12. Spherical Kernel

The spherical kernel is similar to the circular kernel, but is positive definite in R3.

k(x, y) = 1 - frac{3}{2} frac{lVert x-y rVert}{sigma} + frac{1}{2} left( frac{ lVert x-y rVert}{sigma} right)^3

mbox{if}~ lVert x-y rVert < sigma mbox{, zero otherwise}

13. Wave Kernel

The Wave kernel is also symmetric positive semi-definite (Huang, 2008).

k(x, y) = frac{theta}{lVert x-y rVert right} sin frac{lVert x-y rVert }{theta}

14. Power Kernel

The Power kernel is also known as the (unrectified) triangular kernel. It is an example of scale-invariant kernel (Sahbi and Fleuret, 2004) and is also only conditionally positive definite.

k(x,y) = - lVert x-y rVert ^d

15. Log Kernel

The Log kernel seems to be particularly interesting for images, but is only conditionally positive definite.

k(x,y) = - log (lVert x-y rVert ^d + 1)

16. Spline Kernel

The Spline kernel is given as a piece-wise cubic polynomial, as derived in the works by Gunn (1998).

k(x, y) = 1 + xy + xy~min(x,y) - frac{x+y}{2}~min(x,y)^2+frac{1}{3}min(x,y)^3

However, what it actually mean is:

k(x,y) = prod_{i=1}^d 1 + x_i y_i + x_i y_i min(x_i, y_i) - frac{x_i + y_i}{2} min(x_i,y_i)^2 + frac{min(x_i,y_i)^3}{3}

Withx,y in R^d

17. B-Spline (Radial Basis Function) Kernel

The B-Spline kernel is defined on the interval [−1, 1]. It is given by the recursive formula:

k(x,y) = B_{2p+1}(x-y)

mbox{where~} p in N mbox{~with~} B_{i+1} := B_i otimes  B_0.

In the work by Bart Hamers it is given by:

k(x, y) = prod_{p=1}^d B_{2n+1}(x_p - y_p)

Alternatively, Bn can be computed using the explicit expression (Fomel, 2000):

B_n(x) = frac{1}{n!} sum_{k=0}^{n+1} binom{n+1}{k} (-1)^k (x + frac{n+1}{2} - k)^n_+

Where x+ is defined as the truncated power function:

x^d_+ = begin{cases} x^d, & mbox{if }x > 0   0, & mbox{otherwise} end{cases}

18. Bessel Kernel

The Bessel kernel is well known in the theory of function spaces of fractional smoothness. It is given by:

k(x, y) = frac{J_{v+1}( sigma lVert x-y rVert)}{ lVert x-y rVert ^ {-n(v+1)} }

where J is the Bessel function of first kind. However, in the Kernlab for R documentation, the Bessel kernel is said to be:

k(x,x') = - Bessel_{(nu+1)}^n (sigma |x - x'|^2)

19. Cauchy Kernel

The Cauchy kernel comes from the Cauchy distribution (Basak, 2008). It is a long-tailed kernel and can be used to give long-range influence and sensitivity over the high dimension space.

k(x, y) = frac{1}{1 + frac{lVert x-y rVert^2}{sigma^2} }

20. Chi-Square Kernel

The Chi-Square kernel comes from the Chi-Square distribution:

k(x,y) = 1 - sum_{i=1}^n frac{(x_i-y_i)^2}{frac{1}{2}(x_i+y_i)}

However, as noted by commenter Alexis Mignon, this version of the kernel is only conditionally positive-definite (CPD). A positive-definite version of this kernel is given in (Vedaldi and Zisserman, 2011) as

and is suitable to be used by methods other than support vector machines.

21. Histogram Intersection Kernel

The Histogram Intersection Kernel is also known as the Min Kernel and has been proven useful in image classification.

k(x,y) = sum_{i=1}^n min(x_i,y_i)

22. Generalized Histogram Intersection

The Generalized Histogram Intersection kernel is built based on the Histogram Intersection Kernel for image classification but applies in a much larger variety of contexts (Boughorbel, 2005). It is given by:

k(x,y) = sum_{i=1}^m min(|x_i|^alpha,|y_i|^beta)

23. Generalized T-Student Kernel

The Generalized T-Student Kernel has been proven to be a Mercel Kernel, thus having a positive semi-definite Kernel matrix (Boughorbel, 2004). It is given by:

k(x,y) = frac{1}{1 + lVert x-y rVert ^d}

24. Bayesian Kernel

The Bayesian kernel could be given as:

k(x,y) = prod_{l=1}^N kappa_l (x_l,y_l)

where

kappa_l(a,b) = sum_{c in {0;1}} P(Y=c mid X_l=a) ~ P(Y=c mid X_l=b)

However, it really depends on the problem being modeled. For more information, please see the work by Alashwal, Deris and Othman, in which they used a SVM with Bayesian kernels in the prediction of protein-protein interactions.

25. Wavelet Kernel

The Wavelet kernel (Zhang et al, 2004) comes from Wavelet theory and is given as:

k(x,y) = prod_{i=1}^N h(frac{x_i-c_i}{a}) :  h(frac{y_i-c_i}{a})

Where a and c are the wavelet dilation and translation coefficients, respectively (the form presented above is a simplification, please see the original paper for details). A translation-invariant version of this kernel can be given as:

k(x,y) = prod_{i=1}^N h(frac{x_i-y_i}{a})

Where in both h(x) denotes a mother wavelet function. In the paper by Li Zhang, Weida Zhou, and Licheng Jiao, the authors suggests a possible h(x) as:

h(x) = cos(1.75x)exp(-frac{x^2}{2})

Which they also prove as an admissible kernel function.

 

Source Code

The latest version of the source code for almost all of the kernels listed above is available in the Accord.NET Framework. Some are also available in the sequel of this article, Kernel Support Vector Machines for Classification and Regression in C#. They are provided together with a comprehensive and simple implementation of SVMs (Support Vector Machines) in C#. However, for the latest sources, which may contain bug fixes and other enhancements, please download the most recent version available of Accord.NET.

See also

 

References

Citing this work

If you would like, please cite this work as: Souza, César R. “Kernel Functions for Machine Learning Applications.” 17 Mar. 2010. Web. <http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html>.

47 Comments

  1. Great article!

    I am just learning about perceptrons and kernels, and found this information very helpful.

    When will you be able to post some source code? I am trying to program a kernel perceptron right now, but am unable to figure out how to build the kernel function. I am having trouble connecting my example to the theory of the kernel function.

  2. César
    Thanks for nice article. I wanted to find the parameter range for each like sigma in Cauchy, Laplace etc do they vary in all the real number range? To optimize it.
    Also theta mentioned does it vary from 0 to 360?
    Thanks in advance
    Uday

  3. Hi Uday,

    In the paper Practical Selection of SVM Parameters and Noise Estimation for SVM Regression the authors have taken sigma values in the range (0.2~0.5)*range(x) for the Gaussian kernel, x being their input data. If the input data was normalized to be in the [0,1] range, then perhaps good choices for sigma would lie in the [0.2,0.5] range.

    But this is for the Gaussian kernel. I have linked the original sources for each kernel in the post, perhaps you may found additional information about those kernels in their original paper.

    The theta values in the Multiquadric kernels are just positive constants. Perhaps I should change the symbols to something else. For the Wave kernel, perhaps good values would be in the 0~2*pi range.

    In most cases, a grid search would be required to find the most suitable Kernel parameters. In this page you can find more directions for parameter tuning in SVMs.

    Regards,
    César

  4. Thanks César for link and suggestions. For Quadratic, Inverse Quadratic and MutiQuadratic the constant will be like shifting the axis? Do you think moving it in real range 0 to RealMax would have impact on learning?
    Uday

  5. I am not very sure of the exact interpretation of the constant parameter in those Kernels, but its value will probably affect learning. That is why a grid search or a pattern search is advisable in this case.

    Regards,
    César

  6. Just cleaning some old comments to avoid confusion. There was a problem with the ANOVA kernel and it has been corrected, thanks to Zhao. A preliminary version of the Wavelet kernel will also be available soon.

    Cheers!
    César

  7. Hi Anonymous,

    The RBF is not exactly a kernel but a family of kernels sharing the same common form: k(x,y) = exp(-gamma*||x-y||^2), with gamma being a positive constant. It is the same form depicted in the alternate form of the Gaussian kernel in this post. The most popular member of the RBF family of kernels is the Gaussian kernel, which can be derived from the general RBF form by taking gamma = 1/(sigma^2).

    Best regards,
    César

  8. u done great job. awesome. do you have source code for genetic kernel for support vector machine in java or Simple SMO in java

  9. Hi Abhinav,

    Thanks for noticing. I have updated the article with the correction.

    @Anonymous: No, unfortunately I do not have any code in Java. Perhaps you could try taking a look on Weka.

    Best Regards,
    César

  10. I have a question about the multiquadratic kernel. Usually a kernel value is larger, if the vectors are more similar to each other. But with the multiquadratic kernel, it is just the opposite. Can this work? Do you have to switch the classification value of the SVM?

  11. Hi Anonymous,

    Indeed, the Multiquadric kernel is not positive definite. Actually, it is a negative-definite kernel. I have included it for theoretical interest in presenting the Inverse Multiquadric kernel, but I have not applied it to any dataset to see how it performs.

    However, the interesting thing to note is that some non-positive (semi-)definite kernels have been used with relative success in practice, albeit it isn’t so clear in which situations they work. In some cases, an special or modified version of the learning algorithm which can deal with non-PSD kernels is required.

    For further information, perhaps you can try taking a look on the work by Hsuan-Tien Lin and Chih-Jen Lin for some detailed discussion on the use of the (non-PSD) sigmoid kernel.

    All the best,
    César

  12. Hi César Souza,

    Thanks for this great article. Can you please upload the MATLAB codes for SVM Training & classification with Gaussian Kernel for imagr datasets. I want to classify the images into different emotions.

  13. Thank you a lot for this contribution.
    I have to minor remarks however.

    The expression for the Cauchy kernel should have the squared sigma at the denominator
    (It doesn’t change the shape of the kernel, just for correctness).

    In Basak (2008) the authors says:
    “Note that, it is not possible to use Cauchy and Gaussian+Exponential kernels in SVM since these kernels are not Mercer kernels.”

    So, apparently, Cauchy is not a Mercer kernel.

  14. This is really interesting, as even if it is not a Mercer’s kernel, the Cauchy often works well within a SVM. Thanks for sharing this information!

    Also, I’ve added the squared sigma in the formula. Thanks!

    Regards,
    Cesar

  15. Do you have any more information about the Log Kernel? I’m currently using it to segment color images and finding it to be very powerful but I have only found one source (S Boughorbel et. al. 2005).

  16. Note that the chi2 kernel you propose is most probably not positive definite (PD) but only conditionally positive definite (CPD), while for a number of algorithms (including SVM) this has no consequence, since only the induced distance metric is important, for algorithms relying the PD property (like KernelPCA) this may be problematic. A PD kernel inducing the chi2 distance metrics is:

    k(x, y) = sum_i 2 x_i y_i / (x_i + y_i)

    For some explanation about the link between PD and CPD kernels see for instance:
    Hilbertian metrics and Positive definite kernels on probability measures. M. Hein and O. Bousquet. Techincal Report. Max Plank Institute

    For an example of use of this chi2 PD kernel see:
    Efficient Additive Kernels via Explicit Feature Maps

  17. Hi César, wonderful article. I wonder (specifically for the Gaussian and related Kernels) what norm is being used? L2, L1 or is that denotation for vector length? Thanks!

  18. Nice post but I think you have a typo under Kernel Properties? you said “Kernels which are said to satisfy the Mercer’s theorem are positive semi-definite, meaning their kernel matrices have no non-negative Eigen values.”
    I think you mean …their kernel matrices have only non-negative eigenvalues..i.e. all eigenvalues >= 0, not “no non-negative eigenvalues” which would imply all negative eigenvalues.

  19. Pingback: Quora
  20. Does these kernels same as kernel which are used in image processing? In image processing,Gaussian kernel is used for blurring of image.

    How to find kernel matrix from a kernel equation?

Leave a Reply

Your email address will not be published. Required fields are marked *