Partial Least Squares Analysis and Regression in C#

PLS-2_thumb-5B3-5D

Partial Least Squares Regression (PLS) is a technique that generalizes and combines features from principal component analysis and (multivariate) multiple regression. It has been widely adopted in the field of chemometrics and social sciences.

The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.

Contents

  1. Introduction
  2. Overview
    1. Multivariate Linear Regression in Latent Space
    2. Algorithm 1: NIPALS
    3. Algorithm 2: SIMPLS
  3. Source Code
    1. Class Diagram
    2. Performing PLS using NIPALS
    3. Performing PLS using SIMPLS
    4. Multivariate Linear Regression
  4. Using the code
  5. Sample application
  6. See also
  7. References

Introduction

Partial least squares regression (PLS-regression) is a statistical method that is related to principal components regression. The goal of this method is to find a linear regression model by projecting both the predicted variables and the observable variables to new, latent variable spaces. It was developed in the 1960s by Herman Wold to be used in econometrics. However, today it is most commonly used for regression in the field of chemometrics.

In statistics, latent variables (as opposed to observable variables), are variables
that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models.

A PLS model will try to find the multidimensional direction in the X space that explains the maximum multidimensional variance direction in the Y space. PLS-regression is particularly suited when the matrix of predictors has more variables
than observations, and when there is multicollinearity among X values. Its interesting to note that standard linear regression would likely fail to produce meaningful interpretable models in those cases.

Overview

Multivariate Linear Regression in Latent Space

linear-regressionMultiple Linear Regression is a generalization of simple linear regression for multiple inputs. In turn, Multivariate Linear Regression is a generalization of Multiple Linear Regression for multiple outputs. The multivariate linear regression is a general linear regression model which can map an arbitrary dimension space into another arbitrary dimension space using only linear relationships. In the context of PLS, it is used to map the latent variable space for the inputs X into the latent variable space for the output variables Y. Those latent variable spaces are spawned by the loading matrices for X and Y, commonly denoted P and Q, respectively.

The goal of PLS algorithms are therefore to find those two matrices. There are mainly two algorithms to do this: NIPALS and SIMPLS.

Algorithm 1: NIPALS

Here is an exposition of the NIPALS algorithm for finding the loading matrices required for PLS regression. There are, however, many variations of this algorithm which normalize or do not normalize certain vectors.

Algorithm:

  • Let X be the mean-centered input matrix,
  • Let Y be the mean-centered output matrix,
  • Let P be the loadings matrix for X, and let pdenote the i-th column of P;
  • Let Q be the loadings matrix for Y, and let qdenote the i-th column of Q;
  • Let T be the score matrix for X, and tdenote the i-th column of T;
  • Let U be the score matrix for Y, and udenote the i-th column of U;
  • Let W be the PLS weight matrix, and wdenote the i-th column of W; and
  • Let B be a diagonal matrix of diagonal coefficients bi

Then:

  1. For each factor i to be calculated:
    1. Initially choose ui as the largest column vector in
      X (having the largest sum of squares)
    2. While (ti has not converged to a desired precision)
      1. wi ∝ X’u(estimate X weights)
      2. ti ∝ Xw(estimate X factor scores)
      3. qi ∝ Y’t(estimate Y weights)
      4. ui = Yq(estimate Y scores)
    3. bi = t’u (compute prediction coefficient b)
    4. pi = X’t (estimate X factor loadings)
    5. X = X – tp’ (deflate X)

In other statistical analysis such as PCA, it is often interesting to inspect how much of the variance can be explained by each of the principal component dimensions. The same can also be accomplished for PLS, both for the input (predictor) variables X and outputs (regressor) variables Y. For the input variables, the amount of variance explained by each factor can be computed as bi². For outputs, it can be computed as the sum of the squared elements of its column in the matrix P,  i.e. as Sum(pi²).

Algorithm 2: SIMPLS

SIMPLS is an alternative algorithm for finding the PLS matrices P and Q that has been derived considering the true objective of maximizing the covariance between the latent factors and the output vectors. Both NIPALS and SIMPLS are equivalent when there is just one output variable Y to be regressed. However, their answers can differ when Y is multi-dimensional. However, because the construction of the weight vectors used by SIMPLS is based on the empirical variance–covariance matrix of the joint input and output variables, outliers present in the data can adversely impact its performance.

Algorithm:

  • Let X be the mean-centered input matrix,
  • Let Y be the mean-centered output matrix,
  • Let P be the loadings matrix for X, and let pi denote the i-th column of P;
  • Let C be the loadings matrix for Y, and let ci denote the i-th column of C;
  • Let T be the score matrix for X, and ti denote the i-th column of T;
  • Let U be the score matrix for Y, and ui denote the i-th column of U; and
  • Let W be the PLS weight matrix, and wi denote the i-th column of W.

Then:

  1. Create the covariance matrix C = X’Y
  2. For each factor i to be calculated:
    1. Perform SVD on the covariance matrix and store the first left singular vector in wi and the first right singular value times the singular values in ci.
    2. ti ∝ X*wi            (estimate X factor scores)
    3. pi = X’*ti           (estimate X factor loadings)
    4. ci = ci/norm(ti)     (estimate Y weights)
    5. wi = wi/norm(ti)     (estimate X weights)
    6. ui = Y*ci            (estimate Y scores)
    7. vi = pi              (form the basis vector vi)
    8. Make v orthogonal to the previous loadings V
    9. Make u orthogonal to the previous scores T
    10. Deflate the covariance matrix C
      1. C = C – vi*(vi‘*C)

 

Source Code

This section contains the realization of the NIPALS and SIMPLS algorithms in C#. The models have been implemented considering an object-oriented structure that is particularly suitable to be data-bound to Windows.Forms (or WPF) controls.

Class Diagram

pls-diagram

Class diagram for the Partial Least Squares Analysis.

Performing PLS using NIPALS

Here is the source code for computing PLS using the NIPALS algorithm:

Performing PLS using SIMPLS

And here is the source code for computing PLS using the SIMPLS algorithm:

Multivariate Linear Regression

Multivariate Linear Regression is computed in a similar manner to a Multiple Linear Regression. The only difference is that, instead of having a weight vector and a intercept, we have a weight matrix and a intercept vector.

The weight matrix and the intercept vector are computed in the PartialLeastSquaresAnalysis class by the CreateRegression method. In case the analyzed data already was mean centered before being fed to the analysis, the constructed intercept vector will consist only of zeros.

 

Using the code

As an example, lets consider the example data from Hervé Abdi, where the goal is to predict the subjective evaluation of a set of 5 wines. The dependent variables that we want to predict for each wine are its likeability, and how well it goes with meat, or dessert (as rated by a panel of experts). The predictors are the price, the sugar, alcohol, and acidity content of each wine.

Next, we proceed to create the Partial Least Squares Analysis using the Covariance
method (data will only be mean centered but not normalized) and using the SIMPLS
algorithm.

After the analysis has been computed, we can proceed and create the regression model.

Now after the regression has been computed, we can find how well it has performed. The coefficient of determination r² for the variables Hedonic, Goes with Meat and Goes with Dessert can be computed by the CoefficientOfDetermination method of the MultivariateRegressionClass and will be, respectively, 0.9999, 0.9999 and 0.8750 – the closer to one, the better.

Sample application

The accompanying sample application performs Partial Least Squares Analysis and Regression in Excel worksheets. The predictors and dependent variables can be selected once the data has been loaded in the application.

Wine example from Hervé Abdi Variance explained by PLS using the SIMPLS algorithm
Left: Wine example from Hervé Abdi. Right: Variance explained by PLS using the SIMPLS algorithm

 

Partial Least Squares Analysis results and regression coefficients for the full regression model Projection of the dependent and predictors variables using the three first factors
Left: Partial Least Squares Analysis results and regression coefficients for the
full regression model. Right: projection of the dependent and predictors variables
using the three first factors.


Results from the Multivariate Linear Regression performed in Latent Space using three factors from PLS
Results from the Multivariate Linear Regression performed in Latent Space using
three factors from PLS.

Example data from Geladi and Kowalski in the PLS sample application Analysis results for the data using NIPALS. We can see that just two factors are enough to explain the whole variance of the set.
Left: Example data from Geladi and Kowalski. Right: Analysis results for the data using NIPALS. We can see that just two factors are enough to explain the whole variance of the set.

 

Projections to the latent spaces detected by the Partial Least Squares Analysis Loadings and coefficients for the two factor Multivariate Linear Regression (MLR) model discovered by Partial Least Squares (PLS)
Left: Projections to the latent spaces. Right: Loadings and coefficients for the
two factor Multivariate Linear Regression model.


Results from the Partial Least Squares Regression showing a perfect fit using only the first two factors

Results from the Partial Least Squares Regression showing a perfect fit using only the first two factors.

 

References

 

See also

17 Comments

  1. Hi,

    I am not very sure if it is the same thing, but it is possible to get the proportion of variance explained by each variable in the projection. The proportions are available through the FactorProportions property.

    Regards,
    César

  2. Hi,

    The code currently does not calculate VIP, but it surely looks like a very useful tool to implement. I’ll take a look on it later, thanks for letting me know about it.

    Regards,
    César

  3. Hi,

    Thanks for VIP supporting.

    One more question that the regression model can be save and reuse it to predict for future ussing. I suggest it is necessary to implement ISerialization in all of regression classes (PLS, Simple/Multiple/Multivariate) for this feature.

    Best regards,
    Erison

  4. Hi,

    Thanks for sharing these algorithms with the community! However I’m having some problems with the “largest” method in the NIPALS algorithm which should return the largest column in the “E” matrix. This method is not a part of the dll’s in the source code/sample application. How should I fix this?

  5. Hi,

    This has been really useful. I have been able to get SIMPLS to work but when I try to use NIPALS I get an index out of range exception when I call .Compute(). I use the same inputs and outputs in each case.

    Thanks for the help in advance

  6. Hello,

    Are you using the Accord.NET Framework or just the samples provided here? If you are using the framework, please let me know if you can share a simple example that triggers that error. This way I can attempt to fix it for the next version!

    Regards,
    Cesar

  7. Hello again Cesar, sorry for my very delayed reply! I haven’t been working on this project for some time but I’m back on it now. To answer your question I’m using the Accord.NET Framework and I’ll try to get a snippet. But for now the data used for the PLS is:

    inputs: double[6500,700],
    outputs: double[6500,1],
    AnalysisMethod.Standardize,
    PartialLeastSquaresAlgorithm.NIPALS

    SIMPLS is working for me so its not a huge issue in the short term. A big issue for me is that the inputs into the PLSAnalysis are of type double[,]. I’m working with a very large data set, which is a matrix of at least 90,000 by 700. You can’t have a 2d array of that size in the .NET framework (you run out of memory after 2gigs). Any thoughts on how I can solve this problem. Once could be using a jagged array or List but obviously this is no use if I use the Accord framework.

    Kind Regards,
    James

  8. Hi,I’m doing a Customer Satisfaction research.I want to know if the “NIPALS” and “SIMPLS” is suitable for calcuting Customer Satisfaction.I’ve already done Path modeling,but wondering how to caluate using partial least square(PLSSEM),just like that of SmartPLS.My statistic background is really weak…

  9. Hi,
    Does the PLS algorithm in Accord.NET support batch PLS, also called extended PLS or multiway PLS?
    Thanks,
    Courtney

Leave a Reply

Your email address will not be published. Required fields are marked *