# Partial Least Squares Analysis and Regression in C# Partial Least Squares Regression (PLS) is a technique that generalizes and combines features from principal component analysis and (multivariate) multiple regression. It has been widely adopted in the field of chemometrics and social sciences.

The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.

## Introduction

Partial least squares regression (PLS-regression) is a statistical method that is related to principal components regression. The goal of this method is to find a linear regression model by projecting both the predicted variables and the observable variables to new, latent variable spaces. It was developed in the 1960s by Herman Wold to be used in econometrics. However, today it is most commonly used for regression in the field of chemometrics.

In statistics, latent variables (as opposed to observable variables), are variables
that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models.

A PLS model will try to find the multidimensional direction in the X space that explains the maximum multidimensional variance direction in the Y space. PLS-regression is particularly suited when the matrix of predictors has more variables
than observations, and when there is multicollinearity among X values. Its interesting to note that standard linear regression would likely fail to produce meaningful interpretable models in those cases.

## Overview

### Multivariate Linear Regression in Latent Space Multiple Linear Regression is a generalization of simple linear regression for multiple inputs. In turn, Multivariate Linear Regression is a generalization of Multiple Linear Regression for multiple outputs. The multivariate linear regression is a general linear regression model which can map an arbitrary dimension space into another arbitrary dimension space using only linear relationships. In the context of PLS, it is used to map the latent variable space for the inputs X into the latent variable space for the output variables Y. Those latent variable spaces are spawned by the loading matrices for X and Y, commonly denoted P and Q, respectively.

The goal of PLS algorithms are therefore to find those two matrices. There are mainly two algorithms to do this: NIPALS and SIMPLS.

### Algorithm 1: NIPALS

Here is an exposition of the NIPALS algorithm for finding the loading matrices required for PLS regression. There are, however, many variations of this algorithm which normalize or do not normalize certain vectors.

Algorithm:

• Let X be the mean-centered input matrix,
• Let Y be the mean-centered output matrix,
• Let P be the loadings matrix for X, and let pdenote the i-th column of P;
• Let Q be the loadings matrix for Y, and let qdenote the i-th column of Q;
• Let T be the score matrix for X, and tdenote the i-th column of T;
• Let U be the score matrix for Y, and udenote the i-th column of U;
• Let W be the PLS weight matrix, and wdenote the i-th column of W; and
• Let B be a diagonal matrix of diagonal coefficients bi

Then:

1. For each factor i to be calculated:
1. Initially choose ui as the largest column vector in
X (having the largest sum of squares)
2. While (ti has not converged to a desired precision)
1. wi ∝ X’u(estimate X weights)
2. ti ∝ Xw(estimate X factor scores)
3. qi ∝ Y’t(estimate Y weights)
4. ui = Yq(estimate Y scores)
3. bi = t’u (compute prediction coefficient b)
5. X = X – tp’ (deflate X)

In other statistical analysis such as PCA, it is often interesting to inspect how much of the variance can be explained by each of the principal component dimensions. The same can also be accomplished for PLS, both for the input (predictor) variables X and outputs (regressor) variables Y. For the input variables, the amount of variance explained by each factor can be computed as bi². For outputs, it can be computed as the sum of the squared elements of its column in the matrix P,  i.e. as Sum(pi²).

### Algorithm 2: SIMPLS

SIMPLS is an alternative algorithm for finding the PLS matrices P and Q that has been derived considering the true objective of maximizing the covariance between the latent factors and the output vectors. Both NIPALS and SIMPLS are equivalent when there is just one output variable Y to be regressed. However, their answers can differ when Y is multi-dimensional. However, because the construction of the weight vectors used by SIMPLS is based on the empirical variance–covariance matrix of the joint input and output variables, outliers present in the data can adversely impact its performance.

Algorithm:

• Let X be the mean-centered input matrix,
• Let Y be the mean-centered output matrix,
• Let P be the loadings matrix for X, and let pi denote the i-th column of P;
• Let C be the loadings matrix for Y, and let ci denote the i-th column of C;
• Let T be the score matrix for X, and ti denote the i-th column of T;
• Let U be the score matrix for Y, and ui denote the i-th column of U; and
• Let W be the PLS weight matrix, and wi denote the i-th column of W.

Then:

1. Create the covariance matrix C = X’Y
2. For each factor i to be calculated:
1. Perform SVD on the covariance matrix and store the first left singular vector in wi and the first right singular value times the singular values in ci.
2. ti ∝ X*wi            (estimate X factor scores)
4. ci = ci/norm(ti)     (estimate Y weights)
5. wi = wi/norm(ti)     (estimate X weights)
6. ui = Y*ci            (estimate Y scores)
7. vi = pi              (form the basis vector vi)
9. Make u orthogonal to the previous scores T
10. Deflate the covariance matrix C
1. C = C – vi*(vi‘*C)

## Source Code

This section contains the realization of the NIPALS and SIMPLS algorithms in C#. The models have been implemented considering an object-oriented structure that is particularly suitable to be data-bound to Windows.Forms (or WPF) controls.

## Class Diagram Class diagram for the Partial Least Squares Analysis.

## Performing PLS using NIPALS

Here is the source code for computing PLS using the NIPALS algorithm:

## Performing PLS using SIMPLS

And here is the source code for computing PLS using the SIMPLS algorithm:

## Multivariate Linear Regression

Multivariate Linear Regression is computed in a similar manner to a Multiple Linear Regression. The only difference is that, instead of having a weight vector and a intercept, we have a weight matrix and a intercept vector.

The weight matrix and the intercept vector are computed in the PartialLeastSquaresAnalysis class by the CreateRegression method. In case the analyzed data already was mean centered before being fed to the analysis, the constructed intercept vector will consist only of zeros.

## Using the code

As an example, lets consider the example data from Hervé Abdi, where the goal is to predict the subjective evaluation of a set of 5 wines. The dependent variables that we want to predict for each wine are its likeability, and how well it goes with meat, or dessert (as rated by a panel of experts). The predictors are the price, the sugar, alcohol, and acidity content of each wine.

Next, we proceed to create the Partial Least Squares Analysis using the Covariance
method (data will only be mean centered but not normalized) and using the SIMPLS
algorithm.

After the analysis has been computed, we can proceed and create the regression model.

Now after the regression has been computed, we can find how well it has performed. The coefficient of determination r² for the variables Hedonic, Goes with Meat and Goes with Dessert can be computed by the CoefficientOfDetermination method of the MultivariateRegressionClass and will be, respectively, 0.9999, 0.9999 and 0.8750 – the closer to one, the better.

## Sample application

The accompanying sample application performs Partial Least Squares Analysis and Regression in Excel worksheets. The predictors and dependent variables can be selected once the data has been loaded in the application.

Left: Wine example from Hervé Abdi. Right: Variance explained by PLS using the SIMPLS algorithm

Left: Partial Least Squares Analysis results and regression coefficients for the
full regression model. Right: projection of the dependent and predictors variables
using the three first factors. Results from the Multivariate Linear Regression performed in Latent Space using
three factors from PLS.

Left: Example data from Geladi and Kowalski. Right: Analysis results for the data using NIPALS. We can see that just two factors are enough to explain the whole variance of the set.

Left: Projections to the latent spaces. Right: Loadings and coefficients for the
two factor Multivariate Linear Regression model.
Results from the Partial Least Squares Regression showing a perfect fit using only the first two factors.

### References

1. Erison says:

Hi,

Is it possible to compute VIP (Variable Importance in Projection) in your code?

– Erison

2. César Souza says:

Hi,

I am not very sure if it is the same thing, but it is possible to get the proportion of variance explained by each variable in the projection. The proportions are available through the FactorProportions property.

Regards,
César

3. Erison says:

Hi,

There is a R package(http://mevik.net/work/software/VIP.R) to calculate VIP.

By the way, Thank you very much.

– Erison

4. César Souza says:

Hi,

The code currently does not calculate VIP, but it surely looks like a very useful tool to implement. I’ll take a look on it later, thanks for letting me know about it.

Regards,
César

5. César Souza says:

Hi,

The latest release of Accord.NET does support computation of VIP (Variance Importance in Projection) for PLS.

Regards,
César

6. Erison says:

Hi,

Thanks for VIP supporting.

One more question that the regression model can be save and reuse it to predict for future ussing. I suggest it is necessary to implement ISerialization in all of regression classes (PLS, Simple/Multiple/Multivariate) for this feature.

Best regards,
Erison

7. César Souza says:

Hi Erison,

Thanks for the suggestion. This is an already planned feature and most probably will be available in the next release of Accord.

Best regards,
César

8. kalpanaganeshm says:

I visit your site today. It is giving good information. Thanks for your sharing.

Informatics Outsourcing
Market Research

9. Green Elegant says:

Hi,

Thanks for sharing these algorithms with the community! However I’m having some problems with the “largest” method in the NIPALS algorithm which should return the largest column in the “E” matrix. This method is not a part of the dll’s in the source code/sample application. How should I fix this?

10. César Souza says:

Hi there,

I am glad you found it useful. The entire source code for the framework is available in Google Code; the method you have asked is available here.

Regards,
Cesar

11. Anonymous says:

Hi,

This has been really useful. I have been able to get SIMPLS to work but when I try to use NIPALS I get an index out of range exception when I call .Compute(). I use the same inputs and outputs in each case.

Thanks for the help in advance

12. César Souza says:

Hello,

Are you using the Accord.NET Framework or just the samples provided here? If you are using the framework, please let me know if you can share a simple example that triggers that error. This way I can attempt to fix it for the next version!

Regards,
Cesar

13. Anonymous says:

Hello again Cesar, sorry for my very delayed reply! I haven’t been working on this project for some time but I’m back on it now. To answer your question I’m using the Accord.NET Framework and I’ll try to get a snippet. But for now the data used for the PLS is:

inputs: double[6500,700],
outputs: double[6500,1],
AnalysisMethod.Standardize,
PartialLeastSquaresAlgorithm.NIPALS

SIMPLS is working for me so its not a huge issue in the short term. A big issue for me is that the inputs into the PLSAnalysis are of type double[,]. I’m working with a very large data set, which is a matrix of at least 90,000 by 700. You can’t have a 2d array of that size in the .NET framework (you run out of memory after 2gigs). Any thoughts on how I can solve this problem. Once could be using a jagged array or List but obviously this is no use if I use the Accord framework.

Kind Regards,
James

14. 霄陈 says:

Hi,I’m doing a Customer Satisfaction research.I want to know if the “NIPALS” and “SIMPLS” is suitable for calcuting Customer Satisfaction.I’ve already done Path modeling,but wondering how to caluate using partial least square(PLSSEM),just like that of SmartPLS.My statistic background is really weak…

15. Anonymous says:

Hi,
Does the PLS algorithm in Accord.NET support batch PLS, also called extended PLS or multiway PLS?
Thanks,
Courtney

16. César Souza says:

Hi Courtney,

Well, unfortunately not yet 🙁 But I have added a feature request to the framework page so it can be added in the future.

Best regards,
cesar

17. john cena says:

Hi,
Its a well written article. I would like to work on it. But i am unable to download the code. Could you please mail me the code? My mail id is : alerts2info@gmail.com