Hidden Markov Models in C#


Hidden Markov Models (HMM) are stochastic methods to model temporal and sequence data. They are especially known for their application in temporal pattern recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical score following, partial discharges and bioinformatics.


The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications. It is based on the already excellent AForge.NET Framework. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.


  1. Introduction
  2. Definition
    1. Notation
    2. Canonical problems
    3. Choosing the structure
  3. Algorithms
    1. Evaluation
    2. Decoding
    3. Learning
  4. Using the code
  5. Remarks
    1. Known issues
  6. Acknowledgements
  7. See also
  8. References



Hidden Markov Models were first described in a series of statistical papers by Leonard E. Baum and other authors in the second half of the 1960s. One of the first applications of HMMs was speech recognition, starting in the mid-1970s. Indeed, one of the most comprehensive explanations on the topic was published in “A Tutorial On Hidden Markov Models And Selected Applications in Speech Recognition”, by Lawrence R. Rabiner in 1989. In the second half of the 1980s, HMMs began to be applied to the analysis of biological sequences, in particular DNA. Since then, they have become ubiquitous in the field of bioinformatics.

Dynamical systems of discrete nature assumed to be governed by a Markov chain emits a sequence of observable outputs. Under the Markov assumption, it is also assumed that the latest output depends only on the current state of the system. Such states are often not known from the observer when only the output values are observable.

Example of a hidden Markov model Hidden Markov Models attempt to model such systems and allow, among other things, (1) to infer the most likely sequence of states that produced a given output sequence, to (2) infer which will be the most likely next state (and thus predicting the next output) and (3) calculate the probability that a given sequence of outputs originated from the system (allowing the use of hidden Markov models for sequence classification).

The “hidden” in Hidden Markov Models comes from the fact that the observer does not know in which state the system may be in, but has only a probabilistic insight on where it should be.



Hidden Markov Models can be seem as finite state machines where for each sequence unit observation there is a state transition and, for each state, there is a output symbol emission.


Traditionally, HMMs have been defined by the following quintuple:

lambda = (N, M, A, B, pi)


  • N is the number of states for the model
  • M is the number of distinct observations symbols per state, i.e. the discrete alphabet size.
  • A is the NxN state transition probability distribution given in the form of a matrix A = {aij}
  • B is the NxM observation symbol probability distribution given in the form of a matrix B = {bj(k)}
  • π is the initial state distribution vector π = {πi}


Note that, if we opt out the structure parameters M and N we have the more often used compact notation

lambda = (A, B, pi)

Canonical problems

There are three canonical problems associated with hidden Markov models, which I’ll quote from Wikipedia:

  1. Given the parameters of the model, compute the probability of a particular output sequence. This requires summation over all possible state sequences, but can be done efficiently using the Forward algorithm, which is a form of dynamic programming.
  2. Given the parameters of the model and a particular output sequence, find the state sequence that is most likely to have generated that output sequence. This requires finding a maximum over all possible state sequences, but can similarly be solved efficiently by the Viterbi algorithm.
  3. Given an output sequence or a set of such sequences, find the most likely set of state transition and output probabilities. In other words, derive the maximum likelihood estimate of the parameters of the HMM given a dataset of output sequences. No tractable algorithm is known for solving this problem exactly, but a local maximum likelihood can be derived efficiently using the Baum-Welch algorithm or the Baldi-Chauvin algorithm. The Baum-Welch algorithm is an example of a forward-backward algorithm, and is a special case of the Expectation-maximization algorithm.

The solution for those problems are exactly what makes Hidden Markov Models useful. The ability to learn from the data (using the solution of problem 3) and then become able to make predictions (solution to problem 2) and able to classify sequences (solution of problem 2) is nothing but applied machine learning. From this perspective, HMMs can just be seem as supervisioned sequence classifiers and sequence predictors with some other useful interesting properties.

Choosing the structure

Choosing the structure for a hidden Markov model is not always obvious. The number of states depend on the application and to what interpretation one is willing to give to the hidden states. Some domain knowledge is required to build a suitable model and also to choose the initial parameters that an HMM can take. There is also some trial and error involved, and there are sometimes complex tradeoffs that have to be made between model complexity and difficulty of learning, just as is the case with most machine learning techniques.

Additional information can be found on http://www.cse.unsw.edu.au/~waleed/phd/tr9806/node12.html.



The solution to the three canonical problems are the algorithms that makes HMMs useful. Each of the three problems are described in the three subsections below.


The first canonical problem is the evaluation of the probability of a particular output sequence. It can be efficiently computed using either the Viterbi-forward or the Forward algorithms, both of which are forms of dynamic programming.

The Viterbi algorithm originally computes the most likely sequence of states which has originated a sequence of observations. In doing so, it is also able to return the probability of traversing this particular sequence of states. So to obtain Viterbi probabilities, please refer to the Decoding problem referred below.

The Forward algorithm, unlike the Viterbi algorithm, does not find a particular sequence of states; instead it computes the probability that any sequence of states has produced the sequence of observations. In both algorithms, a matrix is used to store computations about the possible state sequence paths that the model can assume. The forward algorithm also plays a key role in the Learning problem, and is thus implemented as a separate method.



The second canonical problem is the discovery of the most likely sequence of states that generated a given output sequence. This can be computed efficiently using the Viterbi algorithm. A trackback is used to detect the maximum probability path travelled by the algorithm. The probability of travelling such sequence is also computed in the process.



The third and last problem is the problem of learning the most likely parameters that best models a system given a set of sequences originated from this system. Most implementations I’ve seem did not consider the problem of learning from a set of sequences, but only from a single sequence at a time. The algorithm below, however, is fully suitable to learn from a set of sequences and also uses scaling, which is another thing I have not seem in other implementations.

The source code follows the original algorithm by Rabiner (1989). There are, however, some known issues with the algorithms detailed in Rabiner’s paper. More information about those issues is available in a next section of this article entitled “Remarks”.


Using the code

Lets suppose we have gathered some sequences from a system we wish to model. The sequences are expressed as a integer array such as:

For us, it can be obvious to see that the system is outputting sequences that always start with a zero and have one or more ones at the end. But lets try to fit a Hidden Markov Model to predict those sequences.

Once the model is trained, lets test to see if it recognizes some sequences:

Of course the model performs well as this a rather simple example. A more useful test case would consist of allowing for some errors in the input sequences in the hope that the model will become more tolerant to measurement errors.

We can see that, despite having a very low probability, the likelihood values for the sequences containing a simulated measurement error are greater than the likelihoods for the sequences which do not follow the sequence structure at all.

In a subsequent article, we will see that those low values for the likelihoods will not be a problem because HMMs are often used in sets to form sequence classifiers. When used in such configurations, what really matters is which HMM returns the highest probability among others in the set.



A practical issue in the use of Hidden Markov Models to model long sequences is the numerical scaling of conditional probabilities. The probability of observing a long sequence given most models is extremely small, and the use of these extremely small numbers in computations often leads to numerical instability, making application of HMMs to genome length sequences quite challenging.

There are two common approaches to dealing with small conditional probabilities. One approach is to rescale the conditional probabilities using carefully designed scaling factors, and the other approach is to work with the logarithms of the conditional probabilities. For more information on using logarithms please see the work entitled “Numerically Stable Hidden Markov Model Implementation”, by Tobias P. Mann.

Known issues

The code on this article is based on the Tutorial by Rabiner. There are, however, some problems with the scaling and other algorithms. An errata depicting all issues is available in the website “An Erratum for ‘A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition’” and is maintained by Ali Rahimi. I have not yet verified if the implementation presented here also suffers from the same mistakes explained there. This code has worked well under many situations, but I cannot guarantee its perfectness. Please use at your own risk.


Thanks to Guilherme C. Pedroso, for the help with the Baum-Welch generalization for multiple input sequences. He has also co-written a very interesting article using hidden Markov models for gesture recognition, entitled “Automatic Recognition of Finger Spelling for LIBRAS based on a Two-Layer Architecture” published in the 25th Symposium On Applied Computing (ACM SAC 2010).

See also



  1. Hello Cesar and thanks a lot for this great piece of information and your crystal clear code! Can you please give me some information about how your HiddenMarkovModel class could be helpfull in a speech recognition project (speaker independant)?

    Thanks again!

  2. Hi Mad,

    I am glad you found it useful. For the field of speech recognition perhaps it would be better to use Continuous-density hidden Markov models instead (which could be trained directly on Cepstrum coefficients of spoken words). Please note that I am not very familiar with the field, but I believe this is one of the most popular approaches for this problem.

    Continuous-density models are already available in the latest version of the Accord.NET Framework. Please note, however, that a major refactoring is planned for the HMM namespace in order to allow for different training algorithms other than Baum-Welch. Thus the class usage may change a little in a next release of the framework. I hope you also find it useful in your research.

    Happy new year,

  3. Hey Cesar, I have implemented an MFCC to get cepstral coefficients but results are like “4.852438240110826, 6.716542468921553, 6.289309087157678 , 5.6106505660776484, …. ” These vectors of decimal values are not accepted from the continuous HMM in Accord.NET Framework (i understand values must be like 4,5,6,7…) in order to be trained correctly…Is there something I am missing or I should normalize these values into something “continuous” in order to pass them directly in the CHMM?
    Also, why you are using 2 states in your continuous hmm classfier example? How can I find the correct number of states needed?
    Thanks for your time.

  4. Hi John,

    You mentioned “decimal” values. Are you using the “decimal” type from .NET? In this case, please use doubles instead of decimals. In continuous-density HMMs there is no need to discretize your values into integers as you mentioned, so you can just feed an double[] (or double[][]) directly.

    By the way, the code on this page is for discrete-density models only. Please make sure you are using the latest version of the Accord.NET Framework, which has classes for continuous-density models.

    And finally, if you can wait a little, a new version of the Accord.NET will have the HMM namespace redesigned to allow easier creation of specific model architectures, such as ergodic or forward-only models which are commonly used in speech recognition problems.

    Feel free to send me an email if you wish to take a look on the latest version before it is officially released.

    Best regards,

  5. Hi

    I’m not really sure i understand the reason for picking two and three states respectively in the “Usage of code”-section!

  6. Hello Cesar,

    Thanks for your effort in this library,it is really very helpful.
    I have some questions and hope to get answers or any guiding information from you.I don’t now what is the third parameter in HiddenMarkovClassifier constructor, is it the density for observations??
    I am trying to use the HMM classifier for classifying image frames sequences, the misclassification rate is too high for testing with the same training data when I used NormalDistribution so how can I fit observed feature vectors in a distribution. I want also to make use of information about possible hidden states for visemes for sequences and also fit this info to the best distribution instead of using Forward Topology that generates a uniform distribution. Please provide me with any clue or guiding info.
    Best regards

  7. Hi there,

    Are you using the complete version of the Accord.NET Framework? It contains the latest bugfixes and enhancements and most likely will perform better. The framework is available in


    Regarding your question about parameters, please take a look on the code examples given in the Accord.NET documentation for the continuous models:


    The third parameter is your initial guess for the observation probability densities. In the case of multivariate observations, it should be initialized carefully. A good approach to perform this initialization is using K-means or segmental K-Means. Basically, it involves clustering all observations and using the cluster information to initialize the states.

    This will most likely be available in a future version of the framework.

    Best regards,

  8. Thanks Cesar for your reply.
    I have been into the documentation before but wasn’t very clear for me that the third parameter is for observation probability density.
    I am using the latest version of accord, I successfully fit my observation into a mixture of mutlivariate gaussian but I have a problem when using it in HiddenMarkovClassifierLearning with BaumWelchLearning, i received an exception: “Covariance matrix is not positive “
    + “definite. Try specifying a regularization constant in the fitting options.”. So where should I specify this regularization constant? And why this is needed?

    Another question that I asked before in my previous post but wasn’t clear enough. If I know the sequences of states of the training data, can I pass it to the HiddenMarkov classifier in some way, instead of using “new Forward(nstates)”.
    Thanks a lot and sorry for bothering you.

  9. Hi CECUEG,

    Thanks for the note. I will try to make the documentation more clear.

    About the positive definite issue, the problem is that sometimes the data is either insufficient or it contains a constant variable, so the estimated Covariance matrix may become non-positive definite. To sidestep this issue, one adds a small regularization parameter (a scalar value) to the diagonal elements of the covariance matrix. To use such parameter, consider the following example:

    // Configure the learning algorithms to train the sequence classifier
    var teacher = new HiddenMarkovClassifierLearning(

    // Train each model until the log-likelihood changes less than 0.0001
    modelIndex => new BaumWelchLearning(
    Tolerance = 0.0001,
    Iterations = 0,

    FittingOptions = new NormalOptions()
    Diagonal = true, // only diagonal covariance matrices
    Regularization = 1e-5 // avoid non-positive definite errors

    In the example above, the regularization is passed as a FittingOptions object for the underlying Normal distributions.

    I will work on the documentation to make those steps more clear. Thanks for letting me know about it.

    About your last question, if you have the state path for the training sequences, then you could a supervised learning algorithm instead of Baum-Welch. Unfortunately one is not available in the framework at the current time, but may be in future versions.

    Best regards,

  10. César Souza, thank you for your article!

    But where is the code of “backward” method?
    Accord.NET contains this code without scaling coefficients and have many dependencies.
    Could you post it here?

  11. Hi Kvanttt!

    The “backward” algorithm can be seen on the file ForwardBackwardAlgorithm.cs. It contains the code with and without scaling coefficients. For the code with scaling coefficients you can check line 296; and for the code without scaling you can take a look on line 335. Those methods have only a single dependency, a Hidden Markov Model object passed as a parameter. But please note that the HMM object is simply a container for the transition matrix A, emission matrix B matrices and initial probabilities vector pi. Those are extracted on the first lines of the algorithm, so if you wish, you can pass those directly if you do not want to depend on the other parts of the Accord.NET Framework!

    Hope it helps!

    Best regards,

  12. Hi Cesar,
    I am using the latest version of Accord.net(v2.13) released on 30.08.2014,
    there is no method like “hmm.Learn” in it, they have made sperate libraries for learning, which are difficult for me to understand, some how i came up with this code

    static void Main(string[] args)

    int[][] sequences = new int[][]
    new int[] { 0,1,1,1,1,1,1 },
    new int[] { 0,1,1,1 },
    new int[] { 0,1,1,1,1 },
    new int[] { 0,1, },
    new int[] { 0,1,1 },

    // Creates a new Hidden Markov Model with 3 states for
    // an output alphabet of two characters (zero and one)
    HiddenMarkovModel hmm = new HiddenMarkovModel(2, 2);

    BaumWelchLearning Learn = new BaumWelchLearning(hmm);

    // Calculate the probability that the given
    // sequences originated from the model
    double l1 = hmm.Evaluate(new int[] { 0, 1, 1, 1, 1, 1, 1 }); // l1 = 0.9999
    double l2 = hmm.Evaluate(new int[] { 0, 1, 1, 1 }); // l2 = 0.9999

    double l3 = hmm.Evaluate(new int[] { 1, 1 }); // l3 = 0.0000
    double l4 = hmm.Evaluate(new int[] { 1, 0, 0, 0 }); // l4 = 0.0000

    “A :{0}t B :{1}t C :{2}t D :{3}”,l1, l2, l3, l4);


    The Results are awful, Please guide me how to use the new version, I am doing my FYP and I really need help, as i have to submit it in 10 days and I am really tensed, please guide me, I will be really really thankful to you if you guide me in detail, Thankyou.

    Best Regards,

    Results are
    A:0 B:0 C:-2.01642771949625E+28 D:-8.06571087798499E+28

  13. Hi Sheroz!

    The results that you are getting are the expected values, it is just that they are being reported as log-probabilities rather than likelihoods. Just apply a Math.Exp() function to your values and you will recover values in the 0…1 range. For example:

    A: Math.Exp(0) = 1
    B: Math.Exp(0) = 1
    C: Math.Exp(-2.01642771949625E+28) = 0
    D: Math.Exp(-8.06571087798499E+28) = 0

    Hope it helps! If you need further assistance, please ask!

    Best regards,

  14. Thank you Very much Cesar it worked.
    One more thing I need to ask, As I have big data set, I can’t do learning at every compile time, so is there any method to save The learning so that I don’t have to learn again and again.
    Hope to see your reply soon.
    Once again Thank you 🙂

  15. Hi Shehroz!

    Sure! After you perform the learning, you can save the hidden Markov models to disk using the .Save method of the HiddenMarkovModel class. The documentation page for the hidden Markov model save method is given here, although all you have to do is to pass the path on the disk where you would like to Markov model to be stored. Afterwards, you can load the model again by using the HiddenMarkovModel.Load method.

    Hope it helps!

    Best regards,

  16. Hello Cesar,

    Thank you for your excellent articles and code.

    I was wondering if it’s also possible to build Hierarchical Hidden Markov Models with the Accord framework and if so, how to go about it?

    Thanks in advance,

Leave a Reply

Your email address will not be published. Required fields are marked *