Kernel methods in general have gained increased attention in recent years, partly due to the grown of popularity of the Support Vector Machines. Support Vector Machines are linear classifiers and regressors that, through the Kernel trick, operate in reproducing Kernel Hilbert spaces and are thus able to perform non-linear classification and regression in their input space.
- Download the classification sample application (with source code)
- Download the regression sample application (with source code)
- Browse the framework’s examples (includes linear and logistic machines)
Foreword
If you would like to use SVMs in your .NET applications, download the Accord.NET Framework through NuGet. Afterwards, creating support vector machines for binary and multi-class problems with a variety of kernels becomes very easy. The Accord.NET Framework is a LGPL framework which can be used freely in commercial, closed-source, open-source or free applications. This article explains a bit how the SVM algorithms and the overall SVM module was designed before being added as part of the framework.
Contents
Introduction
Support Vector Machines
Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. In simple words, given a set of training examples, each marked as belonging to one of two categories, a SVM training algorithm builds a model that predicts whether a new example falls into one category or the other. Intuitively, an SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.
A linear support vector machine is composed of a set of given support vectors z and a set of weights w. The computation for the output of a given SVM with N support vectors z_{1}, z_{2}, … , z_{N} and weights w_{1}, w_{2}, … , w_{N} is then given by:
Kernel Support Vector Machines
The original optimal hyperplane algorithm proposed by Vladimir Vapnik in 1963 was a linear classifier. However, in 1992, Bernhard Boser, Isabelle Guyon and Vapnik suggested a way to create non-linear classifiers by applying the kernel trick (originally proposed by Aizerman et al.) to maximum-margin hyperplanes. The resulting algorithm is formally similar, except that every dot product is replaced by a non-linear kernel function. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. The transformation may be non-linear and the transformed space high dimensional; thus though the classifier is a hyperplane in the high-dimensional feature space, it may be non-linear in the original input space.
Using kernels, the original formulation for the SVM given SVM with support vectors z_{1}, z_{2}, … , z_{N} and weights w_{1}, _{w2}, … , w_{N} is now given by:
It is also very straightforward to see that, using a linear kernel of the form K(z,x) = <z,x> = z^{T}x, we recover the original formulation for the linear SVM.
The Kernel trick
The Kernel trick is a very interesting and powerful tool. It is powerful because it provides a bridge from linearity to non-linearity to any algorithm that solely depends on the dot product between two vectors. It comes from the fact that, if we first map our input data into a higher-dimensional space, a linear algorithm operating in this space will behave non-linearly in the original input space.
Now, the Kernel trick is really interesting because that mapping does not need to be ever computed. If our algorithm can be expressed only in terms of a inner product between two vectors, all we need is replace this inner product with the inner product from some other suitable space. That is where resides the “trick”: wherever a dot product is used, it is replaced with a Kernel function. The kernel function denotes an inner product in feature space and is usually denoted as:
K(x,y) = <φ(x),φ(y)>
Using the Kernel function, the algorithm can then be carried into a higher-dimension space without explicitly mapping the input points into this space. This is highly desirable, as sometimes our higher-dimensional feature space could even be infinite-dimensional and thus infeasible to compute.
Standard Kernels
Some common Kernel functions include the linear kernel, the polynomial kernel and the Gaussian kernel. Below is a simple list with their most interesting characteristics.
Linear Kernel | The Linear kernel is the simplest kernel function. It is given by the common inner product <x,y> plus an optional constant c. Kernel algorithms using a linear kernel are often equivalent to their non-kernel counterparts, i.e. KPCA with linear kernel is equivalent to standard PCA. | |
Polynomial Kernel | The Polynomial kernel is a non-stationary kernel. It is well suited for problems where all data is normalized. | |
Gaussian Kernel | The Gaussian kernel is by far one of the most versatile Kernels. It is a radial basis function kernel, and is the preferred Kernel when we don’t know much about the data we are trying to model. |
For more Kernel functions, check Kernel functions for Machine Learning Applications. The accompanying source code includes definitions for over 20 distinct Kernel functions, many of them detailed in the aforementioned post.
Learning Algorithms
Sequential Minimal Optimization
Previous SVM learning algorithms involved the use of quadratic programming solvers. Some of them used chunking to split the problem in smaller parts which could be solved more efficiently. Platt’s Sequential Minimal Optimization (SMO) algorithm puts chunking to the extreme by breaking the problem down into 2-dimensional sub-problems that can be solved analytically, eliminating the need for a numerical optimization algorithm.
The algorithm makes use of Lagrange multipliers to compute the optimization problem. Platt’s algorithm is composed of three main procedures or parts:
- run, which iterates over all points until convergence to a tolerance threshold;
- examineExample, which finds two points to jointly optimize;
- takeStep, which solves the 2-dimensional optimization problem analytically.
The algorithm is also governed by three extra parameters besides the Kernel function and the data points.
- The parameter Ccontrols the trade off between allowing some training errors and forcing rigid margins. Increasing the value of C increases the cost of misclassifications but may result in models that do not generalize well to points outside the training set.
- The parameter ε controls the width of the ε-insensitive zone, used to fit the training data. The value of ε can affect the number of support vectors used to construct the regression function. The bigger ε, the fewer support vectors are selected and the solution becomes more sparse. On the other hand, increasing the ε-value by too much will result in less accurate models.
- The parameter T is the convergence tolerance. It is the criterion for completing the training process.
After the algorithm ends, a new Support Vector Machine can be created using only the points whose Lagrange multipliers are higher than zero. The expected outputs y_{i} can be individually multiplied by their corresponding Lagrange multipliers a_{i} to form a single weight vector w.
Sequential Minimal Optimization for Regression
A version of SVM for regression was proposed in 1996 by Vladimir Vapnik, Harris Drucker, Chris Burges, Linda Kaufman and Alex Smola. The method was called support vector regression and, as is the case with the original Support Vector Machine formulation, depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction that is within a tolerance threshold ε.
Platt’s algorithm has also been modified for regression. Albeit still maintaining much of its original structure, the difference lies in the fact that the modified algorithm uses two Lagrange multipliers â_{i} and a_{i} for each input point i. After the algorithm ends, a new Support Vector Machine can be created using only points whose both Lagrange multipliers are higher than zero. The multipliers â_{i} and a_{i} are then subtracted to form a single weight vector w.
The algorithm is also governed by the same three parameters presented above. The parameter ε, however, receives a special meaning. It governs the size of the ε-insensitive tube over the regression line. The algorithm has been further developed and adapted by Alex J. Smola, Bernhard Schoelkopf and further optimizations were introduced by Shevade et al and Flake et al.
Source Code
Here is the class diagram for the Support Vector Machine module. We can see it is very simple in terms of standard class organization.
Class diagram for the (Kernel) Support Vector Machines module.
Support Vector Machine
Below is the class definition for the Linear Support Vector Machine. It is pretty much self explanatory.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
/// <summary> /// Sparse Linear Support Vector Machine (SVM) /// </summary> public class SupportVectorMachine { private int inputCount; private double[][] supportVectors; private double[] weights; private double threshold; /// <summary> /// Creates a new Support Vector Machine /// </summary> public SupportVectorMachine(int inputs) { this.inputCount = inputs; } /// <summary> /// Gets the number of inputs accepted by this SVM. /// </summary> public int Inputs { get { return inputCount; } } /// <summary> /// Gets or sets the collection of support vectors used by this machine. /// </summary> public double[][] SupportVectors { get { return supportVectors; } set { supportVectors = value; } } /// <summary> /// Gets or sets the collection of weights used by this machine. /// </summary> public double[] Weights { get { return weights; } set { weights = value; } } /// <summary> /// Gets or sets the threshold (bias) term for this machine. /// </summary> public double Threshold { get { return threshold; } set { threshold = value; } } /// <summary> /// Computes the given input to produce the corresponding output. /// </summary> /// <param name="input">An input vector.</param> /// <returns>The ouput for the given input.</returns> public virtual double Compute(double[] input) { double s = threshold; for (int i = 0; i < supportVectors.Length; i++) { double p = 0; for (int j = 0; j < input.Length; j++) p += supportVectors[i][j] * input[j]; s += weights[i] * p; } return s; } /// <summary> /// Computes the given inputs to produce the corresponding outputs. /// </summary> public double[] Compute(double[][] inputs) { double[] outputs = new double[inputs.Length]; for (int i = 0; i < inputs.Length; i++) outputs[i] = Compute(inputs[i]); return outputs; } } |
Kernel Support Vector Machine
Here is the class definition for the Kernel Support Vector Machine. It inherits from Support Vector Machine and extends it with a Kernel property. The Compute method is also overridden to include the chosen Kernel in the model computation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
/// <summary> /// Sparse Kernel Support Vector Machine (kSVM) /// </summary> /// <remarks> /// <para> /// The original optimal hyperplane algorithm (SVM) proposed by Vladimir Vapnik in 1963 was a /// linear classifier. However, in 1992, Bernhard Boser, Isabelle Guyon and Vapnik suggested /// a way to create non-linear classifiers by applying the kernel trick (originally proposed /// by Aizerman et al.) to maximum-margin hyperplanes. The resulting algorithm is formally /// similar, except that every dot product is replaced by a non-linear kernel function.</para> /// <para> /// This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. /// The transformation may be non-linear and the transformed space high dimensional; thus though /// the classifier is a hyperplane in the high-dimensional feature space, it may be non-linear in /// the original input space.</para> /// <para> /// References: /// <list type="bullet"> /// <item><description><a href="http://en.wikipedia.org/wiki/Support_vector_machine"> /// http://en.wikipedia.org/wiki/Support_vector_machine</a></description></item> /// <item><description><a href="http://www.kernel-machines.org/"> /// http://www.kernel-machines.org/</a></description></item> /// </list></para> /// </remarks> /// /// <example> /// <code> /// // Example XOR problem /// double[][] inputs = /// { /// new double[] { 0, 0 }, // 0 xor 0: 1 (label +1) /// new double[] { 0, 1 }, // 0 xor 1: 0 (label -1) /// new double[] { 1, 0 }, // 1 xor 0: 0 (label -1) /// new double[] { 1, 1 } // 1 xor 1: 1 (label +1) /// }; /// /// // Dichotomy SVM outputs should be given as [-1;+1] /// int[] labels = /// { /// // 1, 0, 0, 1 /// 1, -1, -1, 1 /// }; /// /// // Create a Kernel Support Vector Machine for the given inputs /// KernelSupportVectorMachine machine = new KernelSupportVectorMachine(new Gaussian(0.1), inputs[0].Length); /// /// // Instantiate a new learning algorithm for SVMs /// SequentialMinimalOptimization smo = new SequentialMinimalOptimization(machine, inputs, labels); /// /// // Set up the learning algorithm /// smo.Complexity = 1.0; /// /// // Run the learning algorithm /// double error = smo.Run(); /// /// // Compute the decision output for one of the input vectors /// int decision = System.Math.Sign(svm.Compute(inputs[0])); /// </code> /// </example> /// [Serializable] public class KernelSupportVectorMachine : SupportVectorMachine { private IKernel kernel; /// <summary> /// Creates a new Kernel Support Vector Machine. /// </summary> /// /// <param name="kernel">The chosen kernel for the machine.</param> /// <param name="inputs">The number of inputs for the machine.</param> /// /// <remarks> /// If the number of inputs is zero, this means the machine /// accepts a indefinite number of inputs. This is often the /// case for kernel vector machines using a sequence kernel. /// </remarks> /// public KernelSupportVectorMachine(IKernel kernel, int inputs) : base(inputs) { if (kernel == null) throw new ArgumentNullException("kernel"); this.kernel = kernel; } /// <summary> /// Gets or sets the kernel used by this machine. /// </summary> /// public IKernel Kernel { get { return kernel; } set { kernel = value; } } /// <summary> /// Computes the given input to produce the corresponding output. /// </summary> /// /// <remarks> /// For a binary decision problem, the decision for the negative /// or positive class is typically computed by taking the sign of /// the machine's output. /// </remarks> /// /// <param name="inputs">An input vector.</param> /// <returns>The output for the given input.</returns> /// public override double Compute(double[] inputs) { double s = Threshold; for (int i = 0; i < SupportVectors.Length; i++) s += Weights[i] * kernel.Function(SupportVectors[i], inputs); return s; } } |
Sequential Minimal Optimization
Here is the code for the Sequential Minimal Optimization (SMO) algorithm.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 |
/// <summary> /// Sequential Minimal Optimization (SMO) Algorithm /// </summary> /// /// <remarks> /// <para> /// The SMO algorithm is an algorithm for solving large quadratic programming (QP) /// optimization problems, widely used for the training of support vector machines. /// First developed by John C. Platt in 1998, SMO breaks up large QP problems into /// a series of smallest possible QP problems, which are then solved analytically.</para> /// <para> /// This class follows the original algorithm by Platt as strictly as possible.</para> /// /// <para> /// References: /// <list type="bullet"> /// <item><description> /// <a href="http://en.wikipedia.org/wiki/Sequential_Minimal_Optimization"> /// Wikipedia, The Free Encyclopedia. Sequential Minimal Optimization. Available on: /// http://en.wikipedia.org/wiki/Sequential_Minimal_Optimization </a></description></item> /// <item><description> /// <a href="http://research.microsoft.com/en-us/um/people/jplatt/smoTR.pdf"> /// John C. Platt, Sequential Minimal Optimization: A Fast Algorithm for Training Support /// Vector Machines. 1998. Available on: http://research.microsoft.com/en-us/um/people/jplatt/smoTR.pdf </a></description></item> /// <item><description> /// <a href="http://www.idiom.com/~zilla/Work/Notes/svmtutorial.pdf"> /// J. P. Lewis. A Short SVM (Support Vector Machine) Tutorial. Available on: /// http://www.idiom.com/~zilla/Work/Notes/svmtutorial.pdf </a></description></item> /// </list></para> /// </remarks> /// /// <example> /// <code> /// // Example XOR problem /// double[][] inputs = /// { /// new double[] { 0, 0 }, // 0 xor 0: 1 (label +1) /// new double[] { 0, 1 }, // 0 xor 1: 0 (label -1) /// new double[] { 1, 0 }, // 1 xor 0: 0 (label -1) /// new double[] { 1, 1 } // 1 xor 1: 1 (label +1) /// }; /// /// // Dichotomy SVM outputs should be given as [-1;+1] /// int[] labels = /// { /// 1, -1, -1, 1 /// }; /// /// // Create a Kernel Support Vector Machine for the given inputs /// KernelSupportVectorMachine machine = new KernelSupportVectorMachine(new Gaussian(0.1), inputs[0].Length); /// /// // Instantiate a new learning algorithm for SVMs /// SequentialMinimalOptimization smo = new SequentialMinimalOptimization(machine, inputs, labels); /// /// // Set up the learning algorithm /// smo.Complexity = 1.0; /// /// // Run the learning algorithm /// double error = smo.Run(); /// /// // Compute the decision output for one of the input vectors /// int decision = System.Math.Sign(svm.Compute(inputs[0])); /// </code> /// </example> /// public class SequentialMinimalOptimization : ISupportVectorMachineLearning { private static Random random = new Random(); // Training data private double[][] inputs; private int[] outputs; // Learning algorithm parameters private double c = 1.0; private double tolerance = 1e-3; private double epsilon = 1e-3; private bool useComplexityHeuristic; // Support Vector Machine parameters private SupportVectorMachine machine; private IKernel kernel; private double[] alpha; private double bias; // Error cache to speed up computations private double[] errors; /// <summary> /// Initializes a new instance of a Sequential Minimal Optimization (SMO) algorithm. /// </summary> /// /// <param name="machine">A Support Vector Machine.</param> /// <param name="inputs">The input data points as row vectors.</param> /// <param name="outputs">The classification label for each data point in the range [-1;+1].</param> /// public SequentialMinimalOptimization(SupportVectorMachine machine, double[][] inputs, int[] outputs) { // Initial argument checking if (machine == null) throw new ArgumentNullException("machine"); if (inputs == null) throw new ArgumentNullException("inputs"); if (outputs == null) throw new ArgumentNullException("outputs"); if (inputs.Length != outputs.Length) throw new ArgumentException("The number of inputs and outputs does not match.", "outputs"); for (int i = 0; i < outputs.Length; i++) { if (outputs[i] != 1 && outputs[i] != -1) throw new ArgumentOutOfRangeException("outputs", "One of the labels in the output vector is neither +1 or -1."); } if (machine.Inputs > 0) { // This machine has a fixed input vector size for (int i = 0; i < inputs.Length; i++) if (inputs[i].Length != machine.Inputs) throw new ArgumentException("The size of the input vectors does not match the expected number of inputs of the machine"); } // Machine this.machine = machine; // Kernel (if applicable) KernelSupportVectorMachine ksvm = machine as KernelSupportVectorMachine; this.kernel = (ksvm != null) ? ksvm.Kernel : new Linear(); // Learning data this.inputs = inputs; this.outputs = outputs; } //--------------------------------------------- #region Properties /// <summary> /// Complexity (cost) parameter C. Increasing the value of C forces the creation /// of a more accurate model that may not generalize well. Default value is the /// number of examples divided by the trace of the kernel matrix. /// </summary> /// <remarks> /// The cost parameter C controls the trade off between allowing training /// errors and forcing rigid margins. It creates a soft margin that permits /// some misclassifications. Increasing the value of C increases the cost of /// misclassifying points and forces the creation of a more accurate model /// that may not generalize well. /// </remarks> public double Complexity { get { return this.c; } set { this.c = value; } } /// <summary> /// Gets or sets a value indicating whether the Complexity parameter C /// should be computed automatically by employing an heuristic rule. /// </summary> /// <value> /// <c>true</c> if complexity should be computed automatically; otherwise, <c>false</c>. /// </value> public bool UseComplexityHeuristic { get { return useComplexityHeuristic; } set { useComplexityHeuristic = value; } } /// <summary> /// Insensitivity zone ε. Increasing the value of ε can result in fewer support /// vectors in the created model. Default value is 1e-3. /// </summary> /// <remarks> /// Parameter ε controls the width of the ε-insensitive zone, used to fit the training /// data. The value of ε can affect the number of support vectors used to construct the /// regression function. The bigger ε, the fewer support vectors are selected. On the /// other hand, bigger ε-values results in more flat estimates. /// </remarks> public double Epsilon { get { return epsilon; } set { epsilon = value; } } /// <summary> /// Convergence tolerance. Default value is 1e-3. /// </summary> /// <remarks> /// The criterion for completing the model training process. The default is 0.001. /// </remarks> public double Tolerance { get { return this.tolerance; } set { this.tolerance = value; } } #endregion //--------------------------------------------- /// <summary> /// Runs the SMO algorithm. /// </summary> /// /// <param name="computeError"> /// True to compute error after the training /// process completes, false otherwise. Default is true. /// </param> /// /// <returns> /// The misclassification error rate of /// the resulting support vector machine. /// </returns> /// public double Run(bool computeError) { // The SMO algorithm chooses to solve the smallest possible optimization problem // at every step. At every step, SMO chooses two Lagrange multipliers to jointly // optimize, finds the optimal values for these multipliers, and updates the SVM // to reflect the new optimal values // // Reference: http://research.microsoft.com/en-us/um/people/jplatt/smoTR.pdf // Initialize variables int N = inputs.Length; if (useComplexityHeuristic) c = computeComplexity(); // Lagrange multipliers this.alpha = new double[N]; // Error cache this.errors = new double[N]; // Algorithm: int numChanged = 0; int examineAll = 1; while (numChanged > 0 || examineAll > 0) { numChanged = 0; if (examineAll > 0) { // loop I over all training examples for (int i = 0; i < N; i++) numChanged += examineExample(i); } else { // loop I over examples where alpha is not 0 and not C for (int i = 0; i < N; i++) if (alpha[i] != 0 && alpha[i] != c) numChanged += examineExample(i); } if (examineAll == 1) examineAll = 0; else if (numChanged == 0) examineAll = 1; } // Store Support Vectors in the SV Machine. Only vectors which have lagrange multipliers // greater than zero will be stored as only those are actually required during evaluation. List<int> indices = new List<int>(); for (int i = 0; i < N; i++) { // Only store vectors with multipliers > 0 if (alpha[i] > 0) indices.Add(i); } int vectors = indices.Count; machine.SupportVectors = new double[vectors][]; machine.Weights = new double[vectors]; for (int i = 0; i < vectors; i++) { int j = indices[i]; machine.SupportVectors[i] = inputs[j]; machine.Weights[i] = alpha[j] * outputs[j]; } machine.Threshold = -bias; // Compute error if required. return (computeError) ? ComputeError(inputs, outputs) : 0.0; } /// <summary> /// Runs the SMO algorithm. /// </summary> /// /// <returns> /// The misclassification error rate of /// the resulting support vector machine. /// </returns> /// public double Run() { return Run(true); } /// <summary> /// Computes the error rate for a given set of input and outputs. /// </summary> /// public double ComputeError(double[][] inputs, int[] expectedOutputs) { // Compute errors int count = 0; for (int i = 0; i < inputs.Length; i++) { if (Math.Sign(compute(inputs[i])) != Math.Sign(expectedOutputs[i])) count++; } // Return misclassification error ratio return (double)count / inputs.Length; } //--------------------------------------------- /// <summary> /// Chooses which multipliers to optimize using heuristics. /// </summary> /// private int examineExample(int i2) { double[] p2 = inputs[i2]; // Input point at index i2 double y2 = outputs[i2]; // Classification label for p2 double alph2 = alpha[i2]; // Lagrange multiplier for p2 // SVM output on p2 - y2. Check if it has already been computed double e2 = (alph2 > 0 && alph2 < c) ? errors[i2] : compute(p2) - y2; double r2 = y2 * e2; // Heuristic 01 (for the first multiplier choice): // - Testing for KKT conditions within the tolerance margin if (!(r2 < -tolerance && alph2 < c) && !(r2 > tolerance && alph2 > 0)) return 0; // Heuristic 02 (for the second multiplier choice): // - Once a first Lagrange multiplier is chosen, SMO chooses the second Lagrange multiplier to // maximize the size of the step taken during joint optimization. Now, evaluating the kernel // function is time consuming, so SMO approximates the step size by the absolute value of the // absolute error difference. int i1 = -1; double max = 0; for (int i = 0; i < inputs.Length; i++) { if (alpha[i] > 0 && alpha[i] < c) { double error1 = errors[i]; double aux = System.Math.Abs(e2 - error1); if (aux > max) { max = aux; i1 = i; } } } if (i1 >= 0 && takeStep(i1, i2)) return 1; // Heuristic 03: // - Under unusual circumstances, SMO cannot make positive progress using the second // choice heuristic above. If it is the case, then SMO starts iterating through the // non-bound examples, searching for an second example that can make positive progress. int start = random.Next(inputs.Length); for (i1 = start; i1 < inputs.Length; i1++) { if (alpha[i1] > 0 && alpha[i1] < c) if (takeStep(i1, i2)) return 1; } for (i1 = 0; i1 < start; i1++) { if (alpha[i1] > 0 && alpha[i1] < c) if (takeStep(i1, i2)) return 1; } // Heuristic 04: // - If none of the non-bound examples make positive progress, then SMO starts iterating // through the entire training set until an example is found that makes positive progress. // Both the iteration through the non-bound examples and the iteration through the entire // training set are started at random locations, in order not to bias SMO towards the // examples at the beginning of the training set. start = random.Next(inputs.Length); for (i1 = start; i1 < inputs.Length; i1++) { if (takeStep(i1, i2)) return 1; } for (i1 = 0; i1 < start; i1++) { if (takeStep(i1, i2)) return 1; } // In extremely degenerate circumstances, none of the examples will make an adequate second // example. When this happens, the first example is skipped and SMO continues with another // chosen first example. return 0; } /// <summary> /// Analytically solves the optimization problem for two Lagrange multipliers. /// </summary> /// private bool takeStep(int i1, int i2) { if (i1 == i2) return false; double[] p1 = inputs[i1]; // Input point at index i1 double alph1 = alpha[i1]; // Lagrange multiplier for p1 double y1 = outputs[i1]; // Classification label for p1 // SVM output on p1 - y1. Check if it has already been computed double e1 = (alph1 > 0 && alph1 < c) ? errors[i1] : compute(p1) - y1; double[] p2 = inputs[i2]; // Input point at index i2 double alph2 = alpha[i2]; // Lagrange multiplier for p2 double y2 = outputs[i2]; // Classification label for p2 // SVM output on p2 - y2. Check if it has already been computed double e2 = (alph2 > 0 && alph2 < c) ? errors[i2] : compute(p2) - y2; double s = y1 * y2; // Compute L and H according to equations (13) and (14) (Platt, 1998) double L, H; if (y1 != y2) { // If the target y1 does not equal the target (13) // y2, then the following bounds apply to a2: L = Math.Max(0, alph2 - alph1); H = Math.Min(c, c + alph2 - alph1); } else { // If the target y1 does equal the target (14) // y2, then the following bounds apply to a2: L = Math.Max(0, alph2 + alph1 - c); H = Math.Min(c, alph2 + alph1); } if (L == H) return false; double k11, k22, k12, eta; k11 = kernel.Function(p1, p1); k12 = kernel.Function(p1, p2); k22 = kernel.Function(p2, p2); eta = k11 + k22 - 2.0 * k12; double a1, a2; if (eta > 0) { a2 = alph2 - y2 * (e2 - e1) / eta; if (a2 < L) a2 = L; else if (a2 > H) a2 = H; } else { // Compute objective function Lobj and Hobj at // a2=L and a2=H respectively, using (eq. 19) double L1 = alph1 + s * (alph2 - L); double H1 = alph1 + s * (alph2 - H); double f1 = y1 * (e1 + bias) - alph1 * k11 - s * alph2 * k12; double f2 = y2 * (e2 + bias) - alph2 * k22 - s * alph1 * k12; double Lobj = -0.5 * L1 * L1 * k11 - 0.5 * L * L * k22 - s * L * L1 * k12 - L1 * f1 - L * f2; double Hobj = -0.5 * H1 * H1 * k11 - 0.5 * H * H * k22 - s * H * H1 * k12 - H1 * f1 - H * f2; if (Lobj > Hobj + epsilon) a2 = L; else if (Lobj < Hobj - epsilon) a2 = H; else a2 = alph2; } if (Math.Abs(a2 - alph2) < epsilon * (a2 + alph2 + epsilon)) return false; a1 = alph1 + s * (alph2 - a2); if (a1 < 0) { a2 += s * a1; a1 = 0; } else if (a1 > c) { double d = a1 - c; a2 += s * d; a1 = c; } // Update threshold (bias) to reflect change in Lagrange multipliers double b1 = 0, b2 = 0; double new_b = 0, delta_b; if (a1 > 0 && a1 < c) { // a1 is not at bounds new_b = e1 + y1 * (a1 - alph1) * k11 + y2 * (a2 - alph2) * k12 + bias; } else { if (a2 > 0 && a2 < c) { // a1 is at bounds but a2 is not. new_b = e2 + y1 * (a1 - alph1) * k12 + y2 * (a2 - alph2) * k22 + bias; } else { // Both new Lagrange multipliers are at bound. SMO algorithm // chooses the threshold to be halfway in between b1 and b2. b1 = e1 + y1 * (a1 - alph1) * k11 + y2 * (a2 - alph2) * k12 + bias; b2 = e2 + y1 * (a1 - alph1) * k12 + y2 * (a2 - alph2) * k22 + bias; new_b = (b1 + b2) / 2; } } delta_b = new_b - bias; bias = new_b; // Update error cache using new Lagrange multipliers double t1 = y1 * (a1 - alph1); double t2 = y2 * (a2 - alph2); for (int i = 0; i < inputs.Length; i++) { if (0 < alpha[i] && alpha[i] < c) { double[] point = inputs[i]; errors[i] += t1 * kernel.Function(p1, point) + t2 * kernel.Function(p2, point) - delta_b; } } errors[i1] = 0f; errors[i2] = 0f; // Update lagrange multipliers alpha[i1] = a1; alpha[i2] = a2; return true; } /// <summary> /// Computes the SVM output for a given point. /// </summary> /// private double compute(double[] point) { double sum = -bias; for (int i = 0; i < inputs.Length; i++) { if (alpha[i] > 0) sum += alpha[i] * outputs[i] * kernel.Function(inputs[i], point); } return sum; } private double computeComplexity() { // Compute initial value for C as the number of examples // divided by the trace of the input sample kernel matrix. double sum = 0.0; for (int i = 0; i < inputs.Length; i++) sum += kernel.Function(inputs[i], inputs[i]); return inputs.Length / sum; } } |
Using the code
In the following example, we will be training a Polynomial Kernel Support Vector Machine to recognize the XOR classification problem. The XOR function is classic example of a pattern classification problem that is not linearly separable.
1 2 3 4 5 6 7 8 9 |
double[][] inputs = { new double[] { -1, -1}, new double[] { -1, 1}, new double[] { 1, -1}, new double[] { 1, 1} }; int[] xor = { -1, 1, 1, -1 }; |
Here, remember that the SVM is a margin classifier that classifies instances as either 1 or –1. So the training and expected output for the classification task should also be in this range. There are no such requirements for the inputs, though.
To create the Kernel Support Vector Machine with a Polynomial Kernel, do:
1 2 3 |
// Create Kernel Support Vector Machine with a Polynomial Kernel of 2nd degree KernelSupportVectorMachine machine = new KernelSupportVectorMachine( new Polynomial(2), inputs.Length); |
After the machine has been created, create a new Learning algorithm. As we are going to do classification, we will be using the standard SequentialMinimalOptimization algorithm.
1 2 3 4 5 |
// Create the sequential minimal optimization teacher var learn = new SequentialMinimalOptimization(machine, inputs, xor); // Run the learning algorithm double error = learn.Run(); |
After the model has been trained, we can compute its outputs for the given inputs.
1 |
double[] output = machine.Compute(inputs); |
The machine should be able to correctly identify all of the input instances.
Sample application
The sample application is able to perform both Classification and Regression using Support Vector Machines. It can read Excel spreadsheets and determines the task to be performed depending on the number of the columns in the sheet. If the input table contains two columns (e.g. X and Y) it will be interpreted as a regression problem X –> Y. If the input table contains three columns (e.g. x1, x2 and Y) it will be interpreted as a classification problem <x1,x2> belongs to class Y, Y being either 1 or -1.
Classification
To perform classification, load a classification task data such as the Yin Yang classification problem.
Yin Yang classification problem. The goal is to create a model which best determines whether a given point belongs to class blue or green. It is a clear example of a non-linearly separable problem.
Creation of a Gaussian Kernel Support Vector Machine with σ = 1.2236, C = 1.0, ε = 0.001 and T = 0.001.
Classification using the created Support Vector Machine. Notice it achieves an accuracy of 97%, with sensitivity and specifity rates of 98% and 96%, respectively.
Regression
To perform regression, we can load the Gaussian noise sine wave example.
Noise sine wave regression problem.
Creation of a Gaussian Kernel Support Vector Machine with σ = 1.2236, C = 1.0, ε = 0.2 and T = 0.001.
After the model has been created, we can plot the model approximation for the sine wave data. The blue line shows the curve approximation for the original red training dots.
Regression using the created Kernel Support Vector Machine. Notice the coefficient of determination r² of 0.95. The closer to one, the better.
See also
- Kernel Functions for Machine Learning Applications
- Principal Component Analysis (PCA)
- Kernel Principal Component Analysis (KPCA)
- Linear Discriminant Analysis (LDA)
- Non-Linear Discriminant Analysis with Kernels
- Partial Least Squares Analysis and Regression
- Logistic Regression Analysis in C#
References
- Wikipedia contributors, “Sequential Minimal Optimization,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Sequential_Minimal_Optimization (accessed April 24, 2010).
- Wikipedia contributors, “Support Vector Machine,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Support_vector_machine (accessed April 24, 2010).
- John C. Platt, Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines, Microsoft Research, 1998.
- J. P. Lewis, A Short SVM (Support Vector Machine) Tutorial. CGIT Lab / IMSC, University of Southern California.
- A. J. Smola and B. Schoelkopf. A Tutorial on Support Vector Regression. NeuroCOLT2 Technical Report Series, 1998.
- S.K. Shevade et al. Improvements to SMO Algorithm for SVM Regression, 1999.
- G. W. Flake, S. Lawrence. Efficient SVM Regression Training with SMO.
This is amazing! You are amazing!!
Thanks, but I am not really this amazing 😛
I am glad you found what you was searching for – if you have any questions, please ask 🙂
Cheers,
César
Nice tutorial, very good for the development of science and technology. My thesis is now in progress. I took the theme of power forecasting system using support vector regression whether this accord can be used for forecasting process. please help and advice. thank you
Hi Tri,
It probably can. You may need to use a fixed length window to try to predict the next samples from your data sequence. A similar technique is used in the Times Series Prediction sample application of the AForge.NET Framework. Perhaps you could adapt it from using Neural Networks to use Support Vector Regression.
Best regards,
César
thank you in advance on Mr Cesar already giving advice. I had already downloaded the file to me and I have tried but the problem is my lack of understanding about the concept of neural network so that I find difficult to adapt to the support vector regression. actually almost same concept with the Times Series Prediction sample application of the AForge.NET Framework.
really helpful tutorial!
but i having problem for dataset more than 2 attribute.
such as
http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data
it only classifies 13% correctly. I have tested all the kernels.
Hi Anik,
I see that the output labels in this dataset are presented as 1 and 2. Are you converting those to -1 and 1 before feeding the learning algorithm? This is a necessary precondition for the learning process.
Also you could try to normalize those values to have zero mean and unit variance. After the data is normalized, try a Gaussian kernel with varying sigma values. You could first try with sigma values varying from 0.1 to 1.0 in 0.1 increments, then from 1 to 10 in 1.0 increments. If you are using the Accord.NET Framework, you can use the GridSearch class to help in the parameter tuning.
Regards,
César
Great Tutorial.
But i having problem for dataset more than 3 attribute as time series data to complete the process of regression. example dataset that has 4 attribute like
x1, x2, x3, y where y depends on the attribute value x.
Hello,
I will take a look on the multiple inputs problems described here. Can you provide more details on which kind of problems are you having? Also, are you normalizing the inputs before feeding the learning algorithm?
Best regards,
César
I will predict the value y by using SVM. The following data that I use for the prediction process.
x1 x2 x3 y
2 3 4 29
4 2 2 74
3 4 2 49
4 2 5 83
2 5 6 51
4 3 2 79
3 2 1 34
thank help me.
whats the class for it?
how to use SMO with c # application that you created above for more than 2-dimensional problems features?, please explain with examples ..
Thank you very much..
Hello,
@albert:
The following code snippet can be used to train a regression SVM using your data example:
// Input data
double[][] inputs =
{
new double[] { 2, 3, 4 },
new double[] { 4, 2, 2 },
new double[] { 3, 4, 2 },
new double[] { 4, 2, 5 },
new double[] { 2, 5, 6 },
new double[] { 4, 3, 2 },
new double[] { 3, 2, 1 },
};
// Output data
double[] outputs = { 29, 74, 49, 83, 51, 79, 34 };
// Create a new machine using a Polynomial kernel of degree 2
var machine = new KernelSupportVectorMachine(new Polynomial(2), 3);
// Create a new SMO learning algorithm with the given parameters
var smo = new SequentialMinimalOptimizationRegression(machine, inputs, outputs);
smo.Complexity = 1.0;
smo.Epsilon = 1e-2;
smo.Tolerance = 1e-2;
// Train the machine
double error = smo.Run();
// Retrieve the regression outputs
double[] y = machine.Compute(inputs);
// The outputs will be very close to the original:
// y = 29.01, 73.98, 49.00, 100.41, 50.98, 79.01, 34.00
@agung:
The sample application is just a sample application. It is not intended to be used as a full feature application for SVMs. In order to make full use of the source code available here, you may want to implement your own application using the Accord.NET Framework, which does include all the functionality presented here.
Best regards,
César
can i classify wisconsin breast canser data set with this application?? and i can’t dowload the source code..
Thank you very much..
thank you for your answer
Want to ask again how the implementation of the SVM multiclass and how to find the kernel PCA?
Thank you very much
@Anonymous:
With the sample application, probably no. But with the source code, yes. Something happened with the download servers, which are currently down. I have uploaded the source code and the sample application to the Accord.NET project page instead.
@agung:
Please download the full Accord.NET framework. It has support for both Kernel PCA and Multi-class SVMs. There are some examples on the documentation on how to use them. If the online documentation isn’t available, you can also check the help file which comes together with the framework.
Best regards,
César
Thanx for your attention.
I have been downloaded Accord.NET at your link. I want to use SVM and KPCA method for my C# application. But I can’t use Accord.NET. Please tell me how to use Accord.NET?
Hello,
Instructions on how to install and use Accord.NET are available in the Accord.NET’s Installation and Use Guide, which can be found in http://accord-net.origo.ethz.ch/wiki/accord_net_installation_and_usage_guide.
There is also the documentation help file which comes together with the framework’s installer. It should be available in the Start Menu after a successful installation.
Best regards,
César
Hello,
I want to know how can we find the distance of each sample from the seperating margin. I would also like to know how can in incorporate different formulation of SVM in the code.
Thanks
Thanx before,
I have 200 entries feature the results of the kernel principal component analysis with gamma = 500, as follows:
Example on one of the data:
Raw data:
(46 53 54 73 107 65 77 56 46 89 65 24 38 54 55 71 79 102 92 82 89 70 60 38 54 55 71 79 102 92 82 89 70 60 42 47 49 57 97 97 124 128 110 87 79 52 47 49 57 97 97 124 128 110 87 79 52 36 54 49 89 136 154 161 179 158 116 64 66 54 49 89 136 154 161 179 158 116 64 66 33 49 61 139 154 161 189 184 184 135 110 43 49 61 139 154 161 189 184 184 135 110 43 38 55 71 157 158 158 184 166 176 177 119 55 55 71 157 158 158 184 166 176 177 119 55 46 55 80 156 166 184 182 177 179 162 155 50 55 80 156 166 184 182 177 179 162 155 50 36 34 92 168 173 179 183 178 179 175 151 50 34 92 168 173 179 183 178 179 175 151 50 38 43 111 168 180 164 206 170 180 171 140 48 43 111 168 180 164 206 170 180 171 140 48 54 51 113 171 173 187 186 181 168 182 158 41 51 113 171 173 187 186 181 168 182 158 41 41 32 113 161 170 183 173 179 148 154 151 35 32 113 161 170 183 173 179 148 154 151 35 43 49 104 159 141 148 183 164 133 145 151 47 49 104 159 141 148 183 164 133 145 151 47 47 127 112 165 111 124 190 153 130 137 153 93 127 112 165 111 124 190 153 1)
KPCA feature (with 10 principal component):
(-0.0168221382595615 -0.0977834005677016 -0.0422072220804251 0.106677188631142 -0.00534655068037047 0.0921886970828544 0.0742572332258422 -0.0256983526385723 0.136221182959765 -0.0211490956645272)
Then my training KPCA feature using SMO with kernel = 1.2236
and the accuracy is only 53% only. How to increase the accuracy of the data features?
How can i use Kernel support vector machines if my data class is unbalanced. i.e. for a two class problem , i have large training examples for one class while very few training sample for the second class. The trained svm is biased to the class with large training sample. How can i correct this thing
Thanks, great work Cesar.
It would be great if the application supports RBF kernel also.
Thanks
how to calculate and display the number of iterations of SVM for regression
Hi Tri,
To calculate the number of steps used during learning, you may add a counter variable in the SequentialMinimalOptimization/Regression class and perform an increment every time the takeStep() method is called.
Best regards,
César
Hi,
I have downloaded the source code and tried to classify my dataset. My dataset contains more than 10,000 features and has only two classes. But, I couldn’t be sure about what is the input format for it.
Also, please help me if my my problem can be solved by this or not.
Thanks in advance.
Binod
Hi Binod,
Support vector machines are well suited for datasets containing a high number of features but a manageable number of samples. Yes, I would suppose your problem could be solved using a SVM.
The application demonstrated here is just a sample application. It is not designed to work with custom datasets. You can roll your own application using the Accord.NET Framework and use the sample application as a starting point. The getting started guide has a nice example on how to create a new C# project for SVM training using the framework.
Best regards,
César
Thank you Cesar.
My features have binary values i.e. either 1 or -1. And my ratio of two classes is more than 1:500.
This is making great problem in my case. The precision is very less.
Please let me know, if you have worked using SVM in this type of situation and how well SVM works for this type of dataset.
Thanking you.
Binod
Hi Binod,
For such cases, there a small modification which can be done in the Sequential Minimal Optimization to handle unbalanced classes. Unfortunately, the implementation of the algorithm presented on this page or in the Accord.NET Framework does not implement this feature at this time.
If I can remember correctly, it basically consists on using different cost (C) values for each of the classes. There is a brief section in this guide (section 7) dealing with the subject. I hope it helps.
Best regards,
César
Hi Souza,
Normally after training we get the weights. How to get these weights in the program and use it for computation of outputs(prediction)?
Hi bvis,
If you can, please download the latest version of the Accord.NET Framework. A starting guide is available here and shows how to create a simple application using support vector machines.
Best regards,
César
Dear César,
Can you give me the link to download SVM source code for classification…. Your help is highly appreciate. Thank you
Dear César,
I’m working with SVR. So can you send me SVM for regression applications to I reference. Sorry for my english is not good.
Thank you very much.
Hoang Tuyet
Hello Cesar,
Great work! The link you provide like accord-net.origo.ethz.ch/wiki/getting_started cannot be open using firefox and IE. Pls guide me to open this link
Hi Hurstian,
The project has moved to http://accord.googlecode.com. The getting started guide is now available in the project’s documentation center. I hope it helps!
Best regards,
Cesar
Thanks Cesar, I manage to find it. Thumbs up !
Hi, this excellent. Cesar u gave the most straight forward explanation for an dummy like me. Can you please tell me what is IKernel interface and where it can be found?
Hi Camara, thanks for the positive feedback!
The IKernel interface is available in the Accord.NET Framework. It is a common interface implemented by all kernel functions available in the framework, as shown here. If you would like to use those functions in your projects, you could follow the getting started guide here: http://accord-framework.net/get-started.html.
Hope it helps!
Best regards,
Cesar
Hey i need to solve a 5 dimensional training data. im using Excel sheet for it. how i can go for Matlab. Can u help me out?
I want to find the lagrangian multiplier for this data
Data | x1 x2 y
————————————-
1 | 1 6 +1
2 | 1 10 +1
3 | 4 11 +1
4 | 5 2 -1
5 | 7 6 -1
6 | 10 4 -1
Can someone help me find the alpha for all the data? (Using SMO)
Please explain each step in simple mathematical tutorial.
Hello,
Thanks Cesar for Accord, it helps me a lot.
can you provide a simple example on how to use RANSAC with SVM for inputs varaibles selection ?
Best Regards,
David Alexander
This comment has been removed by the author.
Hello:
I am a student in computer engineering, I’ve built a SVM network and I want to recognize the faces of people, I used SMO algorithm to obtain alpha i value, I get features by using of Legendre Polynomial of order 4.
When I train the model of SVM the error rate was very high ,
my question what is the best kernel function that fitting this
Work and what are the best parameters.
Best Regards
Hi Fatema,
One of the best kernels is undoubtedly the Gaussian kernel. However, it is necessary to configure its Sigma parameter accordingly in order to obtain good results. One good way to calibrate it is by using the framework’s Gaussian.Estimate method, that can automatically suggest a sigma value for your Gaussian kernel by analyzing your data.
Hope it helps!
Best regards,
Cesar
This comment has been removed by the author.
This comment has been removed by the author.
Hi César Souza
thanks very much for your reply, I will try then I will ask you again
best regards
Can we find new class in in svm ?
Hi César Souza,
thanks for your tutorial!
I tried your code snippets which you have explained above. Unfortunately I am running into some troubles by applying your training data (and other). My results are always constant across all data. The SMO algorithm does not change any alpha (it is always equal to zero)…. Can you give any hint what I am doing wrong?
Thanks in advance and regards