Hidden Markov Models (HMM) are stochastic methods to model temporal and sequence data. They are especially known for their application in temporal pattern recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical score following, partial discharges and bioinformatics.
- Download sample and source code
- Download the Accord.NET Framework
The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications. It is based on the already excellent AForge.NET Framework. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.
Contents
Introduction
Hidden Markov Models were first described in a series of statistical papers by Leonard E. Baum and other authors in the second half of the 1960s. One of the first applications of HMMs was speech recognition, starting in the mid-1970s. Indeed, one of the most comprehensive explanations on the topic was published in “A Tutorial On Hidden Markov Models And Selected Applications in Speech Recognition”, by Lawrence R. Rabiner in 1989. In the second half of the 1980s, HMMs began to be applied to the analysis of biological sequences, in particular DNA. Since then, they have become ubiquitous in the field of bioinformatics.
Dynamical systems of discrete nature assumed to be governed by a Markov chain emits a sequence of observable outputs. Under the Markov assumption, it is also assumed that the latest output depends only on the current state of the system. Such states are often not known from the observer when only the output values are observable.
Hidden Markov Models attempt to model such systems and allow, among other things, (1) to infer the most likely sequence of states that produced a given output sequence, to (2) infer which will be the most likely next state (and thus predicting the next output) and (3) calculate the probability that a given sequence of outputs originated from the system (allowing the use of hidden Markov models for sequence classification).
The “hidden” in Hidden Markov Models comes from the fact that the observer does not know in which state the system may be in, but has only a probabilistic insight on where it should be.
Definition
Hidden Markov Models can be seem as finite state machines where for each sequence unit observation there is a state transition and, for each state, there is a output symbol emission.
Notation
Traditionally, HMMs have been defined by the following quintuple:
where
- N is the number of states for the model
- M is the number of distinct observations symbols per state, i.e. the discrete alphabet size.
- A is the NxN state transition probability distribution given in the form of a matrix A = {aij}
- B is the NxM observation symbol probability distribution given in the form of a matrix B = {bj(k)}
- π is the initial state distribution vector π = {πi}
Note that, if we opt out the structure parameters M and N we have the more often used compact notation
Canonical problems
There are three canonical problems associated with hidden Markov models, which I’ll quote from Wikipedia:
- Given the parameters of the model, compute the probability of a particular output sequence. This requires summation over all possible state sequences, but can be done efficiently using the Forward algorithm, which is a form of dynamic programming.
- Given the parameters of the model and a particular output sequence, find the state sequence that is most likely to have generated that output sequence. This requires finding a maximum over all possible state sequences, but can similarly be solved efficiently by the Viterbi algorithm.
- Given an output sequence or a set of such sequences, find the most likely set of state transition and output probabilities. In other words, derive the maximum likelihood estimate of the parameters of the HMM given a dataset of output sequences. No tractable algorithm is known for solving this problem exactly, but a local maximum likelihood can be derived efficiently using the Baum-Welch algorithm or the Baldi-Chauvin algorithm. The Baum-Welch algorithm is an example of a forward-backward algorithm, and is a special case of the Expectation-maximization algorithm.
The solution for those problems are exactly what makes Hidden Markov Models useful. The ability to learn from the data (using the solution of problem 3) and then become able to make predictions (solution to problem 2) and able to classify sequences (solution of problem 2) is nothing but applied machine learning. From this perspective, HMMs can just be seem as supervisioned sequence classifiers and sequence predictors with some other useful interesting properties.
Choosing the structure
Choosing the structure for a hidden Markov model is not always obvious. The number of states depend on the application and to what interpretation one is willing to give to the hidden states. Some domain knowledge is required to build a suitable model and also to choose the initial parameters that an HMM can take. There is also some trial and error involved, and there are sometimes complex tradeoffs that have to be made between model complexity and difficulty of learning, just as is the case with most machine learning techniques.
Additional information can be found on http://www.cse.unsw.edu.au/~waleed/phd/tr9806/node12.html.
Algorithms
The solution to the three canonical problems are the algorithms that makes HMMs useful. Each of the three problems are described in the three subsections below.
Evaluation
The first canonical problem is the evaluation of the probability of a particular output sequence. It can be efficiently computed using either the Viterbi-forward or the Forward algorithms, both of which are forms of dynamic programming.
The Viterbi algorithm originally computes the most likely sequence of states which has originated a sequence of observations. In doing so, it is also able to return the probability of traversing this particular sequence of states. So to obtain Viterbi probabilities, please refer to the Decoding problem referred below.
The Forward algorithm, unlike the Viterbi algorithm, does not find a particular sequence of states; instead it computes the probability that any sequence of states has produced the sequence of observations. In both algorithms, a matrix is used to store computations about the possible state sequence paths that the model can assume. The forward algorithm also plays a key role in the Learning problem, and is thus implemented as a separate method.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
<!-- code formatted by http://manoli.net/csharpformat/ --> <pre> <span>/// <summary></span> <span>/// Calculates the probability that this model has generated the given sequence.</span> <span>/// </summary></span> <span>/// <remarks></span> <span>/// Evaluation problem. Given the HMM M = (A, B, pi) and the observation</span> <span>/// sequence O = {o1, o2, ..., oK}, calculate the probability that model</span> <span>/// M has generated sequence O. This can be computed efficiently using the</span> <span>/// either the Viterbi or the Forward algorithms.</span> <span>/// </remarks></span> <span>/// <param name="observations"></span> <span>/// A sequence of observations.</span> <span>/// </param></span> <span>/// <param name="logarithm"></span> <span>/// True to return the log-likelihood, false to return</span> <span>/// the likelihood. Default is false.</span> <span>/// </param></span> <span>/// <returns></span> <span>/// The probability that the given sequence has been generated by this model.</span> <span>/// </returns></span> <span>public</span> <span>double</span> Evaluate(<span>int</span>[] observations, <span>bool</span> logarithm) { <span>if</span> (observations == <span>null</span>) <span>throw</span> <span>new</span> ArgumentNullException(<span>"observations"</span>); <span>if</span> (observations.Length == 0) <span>return</span> 0.0; <span>// Forward algorithm</span> <span>double</span> likelihood = 0; <span>double</span>[] coefficients; <span>// Compute forward probabilities</span> forward(observations, <span>out</span> coefficients); <span>for</span> (<span>int</span> i = 0; i < coefficients.Length; i++) likelihood += Math.Log(coefficients[i]); <span>// Return the sequence probability</span> <span>return</span> logarithm ? likelihood : Math.Exp(likelihood); } <span>/// <summary></span> <span>/// Baum-Welch forward pass (with scaling)</span> <span>/// </summary></span> <span>/// <remarks></span> <span>/// Reference: http://courses.media.mit.edu/2010fall/mas622j/ProblemSets/ps4/tutorial.pdf</span> <span>/// </remarks></span> <span>private</span> <span>double</span>[,] forward(<span>int</span>[] observations, <span>out</span> <span>double</span>[] c) { <span>int</span> T = observations.Length; <span>double</span>[] pi = Probabilities; <span>double</span>[,] A = Transitions; <span>double</span>[,] fwd = <span>new</span> <span>double</span>[T, States]; c = <span>new</span> <span>double</span>[T]; <span>// 1. Initialization</span> <span>for</span> (<span>int</span> i = 0; i < States; i++) c[0] += fwd[0, i] = pi[i] * B[i, observations[0]]; <span>if</span> (c[0] != 0) <span>// Scaling</span> { <span>for</span> (<span>int</span> i = 0; i < States; i++) fwd[0, i] = fwd[0, i] / c[0]; } <span>// 2. Induction</span> <span>for</span> (<span>int</span> t = 1; t < T; t++) { <span>for</span> (<span>int</span> i = 0; i < States; i++) { <span>double</span> p = B[i, observations[t]]; <span>double</span> sum = 0.0; <span>for</span> (<span>int</span> j = 0; j < States; j++) sum += fwd[t - 1, j] * A[j, i]; fwd[t, i] = sum * p; c[t] += fwd[t, i]; <span>// scaling coefficient</span> } <span>if</span> (c[t] != 0) <span>// Scaling</span> { <span>for</span> (<span>int</span> i = 0; i < States; i++) fwd[t, i] = fwd[t, i] / c[t]; } } <span>return</span> fwd; } |
Decoding
The second canonical problem is the discovery of the most likely sequence of states that generated a given output sequence. This can be computed efficiently using the Viterbi algorithm. A trackback is used to detect the maximum probability path travelled by the algorithm. The probability of travelling such sequence is also computed in the process.
1 |
<span>/// <summary></span><br /><span>/// Calculates the most likely sequence of hidden states</span><br /><span>/// that produced the given observation sequence.</span><br /><span>/// </summary></span><br /><span>/// <remarks></span><br /><span>/// Decoding problem. Given the HMM M = (A, B, pi) and the observation sequence </span><br /><span>/// O = {o1,o2, ..., oK}, calculate the most likely sequence of hidden states Si</span><br /><span>/// that produced this observation sequence O. This can be computed efficiently</span><br /><span>/// using the Viterbi algorithm.</span><br /><span>/// </remarks></span><br /><span>/// <param name="observations">A sequence of observations.</param></span><br /><span>/// <param name="probability">The state optimized probability.</param></span><br /><span>/// <returns>The sequence of states that most likely produced the sequence.</returns></span><br /><span>public</span> <span>int</span>[] Decode(<span>int</span>[] observations, <span>out</span> <span>double</span> probability)<br />{<br /> <span>// Viterbi algorithm.</span><br /><br /> <span>int</span> T = observations.Length;<br /> <span>int</span> states = States;<br /> <span>int</span> minState;<br /> <span>double</span> minWeight;<br /> <span>double</span> weight;<br /><br /> <span>int</span>[,] s = <span>new</span> <span>int</span>[states, T];<br /> <span>double</span>[,] a = <span>new</span> <span>double</span>[states, T];<br /><br /><br /> <span>// Base</span><br /> <span>for</span> (<span>int</span> i = 0; i < states; i++)<br /> {<br /> a[i, 0] = (-1.0 * System.Math.Log(pi[i])) - System.Math.Log(B[i, observations[0]]);<br /> }<br /><br /> <span>// Induction</span><br /> <span>for</span> (<span>int</span> t = 1; t < T; t++)<br /> {<br /> <span>for</span> (<span>int</span> j = 0; j < states; j++)<br /> {<br /> minState = 0;<br /> minWeight = a[0, t - 1] - System.Math.Log(A[0, j]);<br /><br /> <span>for</span> (<span>int</span> i = 1; i < states; i++)<br /> {<br /> weight = a[i, t - 1] - System.Math.Log(A[i, j]);<br /><br /> <span>if</span> (weight < minWeight)<br /> {<br /> minState = i;<br /> minWeight = weight;<br /> }<br /> }<br /><br /> a[j, t] = minWeight - System.Math.Log(B[j, observations[t]]);<br /> s[j, t] = minState;<br /> }<br /> }<br /><br /><br /> <span>// Find minimum value for time T-1</span><br /> minState = 0;<br /> minWeight = a[0, T - 1];<br /><br /> <span>for</span> (<span>int</span> i = 1; i < states; i++)<br /> {<br /> <span>if</span> (a[i, T - 1] < minWeight)<br /> {<br /> minState = i;<br /> minWeight = a[i, T - 1];<br /> }<br /> }<br /><br /> <span>// Trackback</span><br /> <span>int</span>[] path = <span>new</span> <span>int</span>[T];<br /> path[T - 1] = minState;<br /><br /> <span>for</span> (<span>int</span> t = T - 2; t >= 0; t--)<br /> path[t] = s[path[t + 1], t + 1];<br /><br /><br /> probability = System.Math.Exp(-minWeight);<br /> <span>return</span> path;<br />} |
Learning
The third and last problem is the problem of learning the most likely parameters that best models a system given a set of sequences originated from this system. Most implementations I’ve seem did not consider the problem of learning from a set of sequences, but only from a single sequence at a time. The algorithm below, however, is fully suitable to learn from a set of sequences and also uses scaling, which is another thing I have not seem in other implementations.
The source code follows the original algorithm by Rabiner (1989). There are, however, some known issues with the algorithms detailed in Rabiner’s paper. More information about those issues is available in a next section of this article entitled “Remarks”.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 |
<!-- code formatted by http://manoli.net/csharpformat/ --> <pre> <span>/// <summary></span> <span>/// Runs the Baum-Welch learning algorithm for hidden Markov models.</span> <span>/// </summary></span> <span>/// <remarks></span> <span>/// Learning problem. Given some training observation sequences O = {o1, o2, ..., oK}</span> <span>/// and general structure of HMM (numbers of hidden and visible states), determine</span> <span>/// HMM parameters M = (A, B, pi) that best fit training data. </span> <span>/// </remarks></span> <span>/// <param name="iterations"></span> <span>/// The maximum number of iterations to be performed by the learning algorithm. If</span> <span>/// specified as zero, the algorithm will learn until convergence of the model average</span> <span>/// likelihood respecting the desired limit.</span> <span>/// </param></span> <span>/// <param name="observations"></span> <span>/// An array of observation sequences to be used to train the model.</span> <span>/// </param></span> <span>/// <param name="tolerance"></span> <span>/// The likelihood convergence limit L between two iterations of the algorithm. The</span> <span>/// algorithm will stop when the change in the likelihood for two consecutive iterations</span> <span>/// has not changed by more than L percent of the likelihood. If left as zero, the</span> <span>/// algorithm will ignore this parameter and iterates over a number of fixed iterations</span> <span>/// specified by the previous parameter.</span> <span>/// </param></span> <span>/// <returns></span> <span>/// The average log-likelihood for the observations after the model has been trained.</span> <span>/// </returns></span> <span>public</span> <span>double</span> Learn(<span>int</span>[][] observations, <span>int</span> iterations, <span>double</span> tolerance) { <span>if</span> (iterations == 0 && tolerance == 0) <span>throw</span> <span>new</span> ArgumentException(<span>"Iterations and limit cannot be both zero."</span>); <span>// Baum-Welch algorithm.</span> <span>// The Baum–Welch algorithm is a particular case of a generalized expectation-maximization</span> <span>// (GEM) algorithm. It can compute maximum likelihood estimates and posterior mode estimates</span> <span>// for the parameters (transition and emission probabilities) of an HMM, when given only</span> <span>// emissions as training data.</span> <span>// The algorithm has two steps:</span> <span>// - Calculating the forward probability and the backward probability for each HMM state;</span> <span>// - On the basis of this, determining the frequency of the transition-emission pair values</span> <span>// and dividing it by the probability of the entire string. This amounts to calculating</span> <span>// the expected count of the particular transition-emission pair. Each time a particular</span> <span>// transition is found, the value of the quotient of the transition divided by the probability</span> <span>// of the entire string goes up, and this value can then be made the new value of the transition.</span> <span>int</span> N = observations.Length; <span>int</span> currentIteration = 1; <span>bool</span> stop = <span>false</span>; <span>double</span>[] pi = Probabilities; <span>double</span>[,] A = Transitions; <span>// Initialization</span> <span>double</span>[][, ,] epsilon = <span>new</span> <span>double</span>[N][, ,]; <span>// also referred as ksi or psi</span> <span>double</span>[][,] gamma = <span>new</span> <span>double</span>[N][,]; <span>for</span> (<span>int</span> i = 0; i < N; i++) { <span>int</span> T = observations[i].Length; epsilon[i] = <span>new</span> <span>double</span>[T, States, States]; gamma[i] = <span>new</span> <span>double</span>[T, States]; } <span>// Calculate initial model log-likelihood</span> <span>double</span> oldLikelihood = Double.MinValue; <span>double</span> newLikelihood = 0; <span>do</span> <span>// Until convergence or max iterations is reached</span> { <span>// For each sequence in the observations input</span> <span>for</span> (<span>int</span> i = 0; i < N; i++) { var sequence = observations[i]; <span>int</span> T = sequence.Length; <span>double</span>[] scaling; <span>// 1st step - Calculating the forward probability and the</span> <span>// backward probability for each HMM state.</span> <span>double</span>[,] fwd = forward(observations[i], <span>out</span> scaling); <span>double</span>[,] bwd = backward(observations[i], scaling); <span>// 2nd step - Determining the frequency of the transition-emission pair values</span> <span>// and dividing it by the probability of the entire string.</span> <span>// Calculate gamma values for next computations</span> <span>for</span> (<span>int</span> t = 0; t < T; t++) { <span>double</span> s = 0; <span>for</span> (<span>int</span> k = 0; k < States; k++) s += gamma[i][t, k] = fwd[t, k] * bwd[t, k]; <span>if</span> (s != 0) <span>// Scaling</span> { <span>for</span> (<span>int</span> k = 0; k < States; k++) gamma[i][t, k] /= s; } } <span>// Calculate epsilon values for next computations</span> <span>for</span> (<span>int</span> t = 0; t < T - 1; t++) { <span>double</span> s = 0; <span>for</span> (<span>int</span> k = 0; k < States; k++) <span>for</span> (<span>int</span> l = 0; l < States; l++) s += epsilon[i][t, k, l] = fwd[t, k] * A[k, l] * bwd[t + 1, l] * B[l, sequence[t + 1]]; <span>if</span> (s != 0) <span>// Scaling</span> { <span>for</span> (<span>int</span> k = 0; k < States; k++) <span>for</span> (<span>int</span> l = 0; l < States; l++) epsilon[i][t, k, l] /= s; } } <span>// Compute log-likelihood for the given sequence</span> <span>for</span> (<span>int</span> t = 0; t < scaling.Length; t++) newLikelihood += Math.Log(scaling[t]); } <span>// Average the likelihood for all sequences</span> newLikelihood /= observations.Length; <span>// Check if the model has converged or we should stop</span> <span>if</span> (checkConvergence(oldLikelihood, newLikelihood, currentIteration, iterations, tolerance)) { stop = <span>true</span>; } <span>else</span> { <span>// 3. Continue with parameter re-estimation</span> currentIteration++; oldLikelihood = newLikelihood; newLikelihood = 0.0; <span>// 3.1 Re-estimation of initial state probabilities </span> <span>for</span> (<span>int</span> k = 0; k < States; k++) { <span>double</span> sum = 0; <span>for</span> (<span>int</span> i = 0; i < N; i++) sum += gamma[i][0, k]; pi[k] = sum / N; } <span>// 3.2 Re-estimation of transition probabilities </span> <span>for</span> (<span>int</span> i = 0; i < States; i++) { <span>for</span> (<span>int</span> j = 0; j < States; j++) { <span>double</span> den = 0, num = 0; <span>for</span> (<span>int</span> k = 0; k < N; k++) { <span>int</span> T = observations[k].Length; <span>for</span> (<span>int</span> l = 0; l < T - 1; l++) num += epsilon[k][l, i, j]; <span>for</span> (<span>int</span> l = 0; l < T - 1; l++) den += gamma[k][l, i]; } A[i, j] = (den != 0) ? num / den : 0.0; } } <span>// 3.3 Re-estimation of emission probabilities</span> <span>for</span> (<span>int</span> i = 0; i < States; i++) { <span>for</span> (<span>int</span> j = 0; j < Symbols; j++) { <span>double</span> den = 0, num = 0; <span>for</span> (<span>int</span> k = 0; k < N; k++) { <span>int</span> T = observations[k].Length; <span>for</span> (<span>int</span> l = 0; l < T; l++) { <span>if</span> (observations[k][l] == j) num += gamma[k][l, i]; } <span>for</span> (<span>int</span> l = 0; l < T; l++) den += gamma[k][l, i]; } <span>// avoid locking a parameter in zero.</span> B[i, j] = (num == 0) ? 1e-10 : num / den; } } } } <span>while</span> (!stop); <span>// Returns the model average log-likelihood</span> <span>return</span> newLikelihood; } |
Using the code
Lets suppose we have gathered some sequences from a system we wish to model. The sequences are expressed as a integer array such as:
1 |
<span>int</span>[][] sequences = <span>new</span> <span>int</span>[][] <br />{<br /> <span>new</span> <span>int</span>[] { 0,1,1,1,1,1,1 },<br /> <span>new</span> <span>int</span>[] { 0,1,1,1 },<br /> <span>new</span> <span>int</span>[] { 0,1,1,1,1 },<br /> <span>new</span> <span>int</span>[] { 0,1, },<br /> <span>new</span> <span>int</span>[] { 0,1,1 },<br />}; |
For us, it can be obvious to see that the system is outputting sequences that always start with a zero and have one or more ones at the end. But lets try to fit a Hidden Markov Model to predict those sequences.
1 |
<span>// Creates a new Hidden Markov Model with 2 states for</span><br /><span>// an output alphabet of two characters (zero and one)</span><br />HiddenMarkovModel hmm = <span>new</span> HiddenMarkovModel(2, 2);<br /><br /><span>// Try to fit the model to the data until the difference in</span><br /><span>// the average likelihood changes only by as little as 0.01</span><br />hmm.Learn(sequences, 0.01); |
Once the model is trained, lets test to see if it recognizes some sequences:
1 |
<span>// Calculate the probability that the given</span><br /><span>// sequences originated from the model</span><br /><span>double</span> l1 = hmm.Evaluate(<span>new</span> <span>int</span>[] { 0, 1 }); <span>// l1 = 0.9999</span><br /><span>double</span> l2 = hmm.Evaluate(<span>new</span> <span>int</span>[] { 0, 1, 1, 1 }); <span>// l2 = 0.9999</span><br /><br /><span>double</span> l3 = hmm.Evaluate(<span>new</span> <span>int</span>[] { 1, 1 }); <span>// l3 = 0.0000</span><br /><span>double</span> l4 = hmm.Evaluate(<span>new</span> <span>int</span>[] { 1, 0, 0, 0 }); <span>// l4 = 0.0000</span> |
Of course the model performs well as this a rather simple example. A more useful test case would consist of allowing for some errors in the input sequences in the hope that the model will become more tolerant to measurement errors.
1 |
<span>int</span>[][] sequences = <span>new</span> <span>int</span>[][] <br />{<br /> <span>new</span> <span>int</span>[] { 0,1,1,1,1,0,1,1,1,1 },<br /> <span>new</span> <span>int</span>[] { 0,1,1,1,0,1,1,1,1,1 },<br /> <span>new</span> <span>int</span>[] { 0,1,1,1,1,1,1,1,1,1 },<br /> <span>new</span> <span>int</span>[] { 0,1,1,1,1,1 },<br /> <span>new</span> <span>int</span>[] { 0,1,1,1,1,1,1 },<br /> <span>new</span> <span>int</span>[] { 0,1,1,1,1,1,1,1,1,1 },<br /> <span>new</span> <span>int</span>[] { 0,1,1,1,1,1,1,1,1,1 },<br />};<br /><br /><span>// Creates a new Hidden Markov Model with 3 states for</span><br /><span>// an output alphabet of two characters (zero and one)</span><br />HiddenMarkovModel hmm = <span>new</span> HiddenMarkovModel(2, 3);<br /><br /><span>// Try to fit the model to the data until the difference in</span><br /><span>// the average likelihood changes only by as little as 0.0001</span><br />hmm.Learn(sequences, 0.0001);<br /><br /><span>// Calculate the probability that the given</span><br /><span>// sequences originated from the model</span><br /><span>double</span> l1 = hmm.Evaluate(<span>new</span> <span>int</span>[] { 0,1 }); <span>// 0.9999 </span><br /><span>double</span> l2 = hmm.Evaluate(<span>new</span> <span>int</span>[] { 0,1,1,1 }); <span>// 0.9166</span><br /><br /><span>double</span> l3 = hmm.Evaluate(<span>new</span> <span>int</span>[] { 1,1 }); <span>// 0.0000</span><br /><span>double</span> l4 = hmm.Evaluate(<span>new</span> <span>int</span>[] { 1,0,0,0 }); <span>// 0.0000</span><br /><br /><span>double</span> l5 = hmm.Evaluate(<span>new</span> <span>int</span>[] { 0,1,0,1,1,1,1,1,1 }); <span>// 0.0342</span><br /><span>double</span> l6 = hmm.Evaluate(<span>new</span> <span>int</span>[] { 0,1,1,1,1,1,1,0,1 }); <span>// 0.0342</span> |
We can see that, despite having a very low probability, the likelihood values for the sequences containing a simulated measurement error are greater than the likelihoods for the sequences which do not follow the sequence structure at all.
In a subsequent article, we will see that those low values for the likelihoods will not be a problem because HMMs are often used in sets to form sequence classifiers. When used in such configurations, what really matters is which HMM returns the highest probability among others in the set.
Remarks
A practical issue in the use of Hidden Markov Models to model long sequences is the numerical scaling of conditional probabilities. The probability of observing a long sequence given most models is extremely small, and the use of these extremely small numbers in computations often leads to numerical instability, making application of HMMs to genome length sequences quite challenging.
There are two common approaches to dealing with small conditional probabilities. One approach is to rescale the conditional probabilities using carefully designed scaling factors, and the other approach is to work with the logarithms of the conditional probabilities. For more information on using logarithms please see the work entitled “Numerically Stable Hidden Markov Model Implementation”, by Tobias P. Mann.
Known issues
The code on this article is based on the Tutorial by Rabiner. There are, however, some problems with the scaling and other algorithms. An errata depicting all issues is available in the website “An Erratum for ‘A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition’” and is maintained by Ali Rahimi. I have not yet verified if the implementation presented here also suffers from the same mistakes explained there. This code has worked well under many situations, but I cannot guarantee its perfectness. Please use at your own risk.
Acknowledgements
Thanks to Guilherme C. Pedroso, for the help with the Baum-Welch generalization for multiple input sequences. He has also co-written a very interesting article using hidden Markov models for gesture recognition, entitled “Automatic Recognition of Finger Spelling for LIBRAS based on a Two-Layer Architecture” published in the 25th Symposium On Applied Computing (ACM SAC 2010).
See also
References
-
L. R. Rabiner, "A Tutorial on Hidden Markov Models, and Selected Applications in Speech Recognition," Proc. IEEE, Vol. 77, No. 2, pp. 257–286, Feb. 1989.
-
Warakagoda, Narada D., “A Hybrid ANN-HMM ASR system with NN based adaptive preprocessing”, Norges Tekniske Høgskole, 2006.
-
A. Rahimim, “An erratum for: A tutorial on hidden Markov models and selected applications in speech recognition”, World Wide Web. Available in: http://xenia.media.mit.edu/rahimi/rabiner/rabiner-errata/rabiner-errata.html.
-
T. P. Mann, "Numerically stable hidden markov model implementation," World Wide Web. Available in: http://www.phrap.org/compbio/mbt599/hmm_scaling_revised.pdf
-
Wikipedia contributors, “Hidden Markov model,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Hidden_Markov_model (accessed March, 2010).
Hello Cesar and thanks a lot for this great piece of information and your crystal clear code! Can you please give me some information about how your HiddenMarkovModel class could be helpfull in a speech recognition project (speaker independant)?
Thanks again!
Hi Mad,
I am glad you found it useful. For the field of speech recognition perhaps it would be better to use Continuous-density hidden Markov models instead (which could be trained directly on Cepstrum coefficients of spoken words). Please note that I am not very familiar with the field, but I believe this is one of the most popular approaches for this problem.
Continuous-density models are already available in the latest version of the Accord.NET Framework. Please note, however, that a major refactoring is planned for the HMM namespace in order to allow for different training algorithms other than Baum-Welch. Thus the class usage may change a little in a next release of the framework. I hope you also find it useful in your research.
Happy new year,
César
Hey Cesar, I have implemented an MFCC to get cepstral coefficients but results are like “4.852438240110826, 6.716542468921553, 6.289309087157678 , 5.6106505660776484, …. ” These vectors of decimal values are not accepted from the continuous HMM in Accord.NET Framework (i understand values must be like 4,5,6,7…) in order to be trained correctly…Is there something I am missing or I should normalize these values into something “continuous” in order to pass them directly in the CHMM?
Also, why you are using 2 states in your continuous hmm classfier example? How can I find the correct number of states needed?
Thanks for your time.
Hi John,
You mentioned “decimal” values. Are you using the “decimal” type from .NET? In this case, please use doubles instead of decimals. In continuous-density HMMs there is no need to discretize your values into integers as you mentioned, so you can just feed an double[] (or double[][]) directly.
By the way, the code on this page is for discrete-density models only. Please make sure you are using the latest version of the Accord.NET Framework, which has classes for continuous-density models.
And finally, if you can wait a little, a new version of the Accord.NET will have the HMM namespace redesigned to allow easier creation of specific model architectures, such as ergodic or forward-only models which are commonly used in speech recognition problems.
Feel free to send me an email if you wish to take a look on the latest version before it is officially released.
Best regards,
César
Hi
I’m not really sure i understand the reason for picking two and three states respectively in the “Usage of code”-section!
Hello Cesar,
Thanks for your effort in this library,it is really very helpful.
I have some questions and hope to get answers or any guiding information from you.I don’t now what is the third parameter in HiddenMarkovClassifier constructor, is it the density for observations??
I am trying to use the HMM classifier for classifying image frames sequences, the misclassification rate is too high for testing with the same training data when I used NormalDistribution so how can I fit observed feature vectors in a distribution. I want also to make use of information about possible hidden states for visemes for sequences and also fit this info to the best distribution instead of using Forward Topology that generates a uniform distribution. Please provide me with any clue or guiding info.
Best regards
Hi there,
Are you using the complete version of the Accord.NET Framework? It contains the latest bugfixes and enhancements and most likely will perform better. The framework is available in
http://code.google.com/p/accord/
Regarding your question about parameters, please take a look on the code examples given in the Accord.NET documentation for the continuous models:
http://accord.googlecode.com/svn/docs/html/T_Accord_Statistics_Models_Markov_Learning_HiddenMarkovClassifierLearning_1.htm
The third parameter is your initial guess for the observation probability densities. In the case of multivariate observations, it should be initialized carefully. A good approach to perform this initialization is using K-means or segmental K-Means. Basically, it involves clustering all observations and using the cluster information to initialize the states.
This will most likely be available in a future version of the framework.
Best regards,
Cesar
Thanks Cesar for your reply.
I have been into the documentation before but wasn’t very clear for me that the third parameter is for observation probability density.
I am using the latest version of accord, I successfully fit my observation into a mixture of mutlivariate gaussian but I have a problem when using it in HiddenMarkovClassifierLearning with BaumWelchLearning, i received an exception: “Covariance matrix is not positive “
+ “definite. Try specifying a regularization constant in the fitting options.”. So where should I specify this regularization constant? And why this is needed?
Another question that I asked before in my previous post but wasn’t clear enough. If I know the sequences of states of the training data, can I pass it to the HiddenMarkov classifier in some way, instead of using “new Forward(nstates)”.
Thanks a lot and sorry for bothering you.
Hi CECUEG,
Thanks for the note. I will try to make the documentation more clear.
About the positive definite issue, the problem is that sometimes the data is either insufficient or it contains a constant variable, so the estimated Covariance matrix may become non-positive definite. To sidestep this issue, one adds a small regularization parameter (a scalar value) to the diagonal elements of the covariance matrix. To use such parameter, consider the following example:
// Configure the learning algorithms to train the sequence classifier
var teacher = new HiddenMarkovClassifierLearning(
classifier,
// Train each model until the log-likelihood changes less than 0.0001
modelIndex => new BaumWelchLearning(
classifier.Models[modelIndex])
{
Tolerance = 0.0001,
Iterations = 0,
FittingOptions = new NormalOptions()
{
Diagonal = true, // only diagonal covariance matrices
Regularization = 1e-5 // avoid non-positive definite errors
}
}
);
In the example above, the regularization is passed as a FittingOptions object for the underlying Normal distributions.
I will work on the documentation to make those steps more clear. Thanks for letting me know about it.
About your last question, if you have the state path for the training sequences, then you could a supervised learning algorithm instead of Baum-Welch. Unfortunately one is not available in the framework at the current time, but may be in future versions.
Best regards,
Cesar
César Souza, thank you for your article!
But where is the code of “backward” method?
Accord.NET contains this code without scaling coefficients and have many dependencies.
Could you post it here?
Hi Kvanttt!
The “backward” algorithm can be seen on the file ForwardBackwardAlgorithm.cs. It contains the code with and without scaling coefficients. For the code with scaling coefficients you can check line 296; and for the code without scaling you can take a look on line 335. Those methods have only a single dependency, a Hidden Markov Model object passed as a parameter. But please note that the HMM object is simply a container for the transition matrix A, emission matrix B matrices and initial probabilities vector pi. Those are extracted on the first lines of the algorithm, so if you wish, you can pass those directly if you do not want to depend on the other parts of the Accord.NET Framework!
Hope it helps!
Best regards,
Cesar
Hi Cesar,
I am using the latest version of Accord.net(v2.13) released on 30.08.2014,
there is no method like “hmm.Learn” in it, they have made sperate libraries for learning, which are difficult for me to understand, some how i came up with this code
”
static void Main(string[] args)
{
int[][] sequences = new int[][]
{
new int[] { 0,1,1,1,1,1,1 },
new int[] { 0,1,1,1 },
new int[] { 0,1,1,1,1 },
new int[] { 0,1, },
new int[] { 0,1,1 },
};
// Creates a new Hidden Markov Model with 3 states for
// an output alphabet of two characters (zero and one)
HiddenMarkovModel hmm = new HiddenMarkovModel(2, 2);
BaumWelchLearning Learn = new BaumWelchLearning(hmm);
Learn.Run(sequences);
// Calculate the probability that the given
// sequences originated from the model
double l1 = hmm.Evaluate(new int[] { 0, 1, 1, 1, 1, 1, 1 }); // l1 = 0.9999
double l2 = hmm.Evaluate(new int[] { 0, 1, 1, 1 }); // l2 = 0.9999
double l3 = hmm.Evaluate(new int[] { 1, 1 }); // l3 = 0.0000
double l4 = hmm.Evaluate(new int[] { 1, 0, 0, 0 }); // l4 = 0.0000
Console.WriteLine(
“A :{0}t B :{1}t C :{2}t D :{3}”,l1, l2, l3, l4);
Console.ReadLine();
}
”
The Results are awful, Please guide me how to use the new version, I am doing my FYP and I really need help, as i have to submit it in 10 days and I am really tensed, please guide me, I will be really really thankful to you if you guide me in detail, Thankyou.
Best Regards,
Shehroz
P.S,
Results are
A:0 B:0 C:-2.01642771949625E+28 D:-8.06571087798499E+28
Hi Sheroz!
The results that you are getting are the expected values, it is just that they are being reported as log-probabilities rather than likelihoods. Just apply a Math.Exp() function to your values and you will recover values in the 0…1 range. For example:
A: Math.Exp(0) = 1
B: Math.Exp(0) = 1
C: Math.Exp(-2.01642771949625E+28) = 0
D: Math.Exp(-8.06571087798499E+28) = 0
Hope it helps! If you need further assistance, please ask!
Best regards,
Cesar
Thank you Very much Cesar it worked.
One more thing I need to ask, As I have big data set, I can’t do learning at every compile time, so is there any method to save The learning so that I don’t have to learn again and again.
Hope to see your reply soon.
Once again Thank you 🙂
Hi Shehroz!
Sure! After you perform the learning, you can save the hidden Markov models to disk using the .Save method of the HiddenMarkovModel class. The documentation page for the hidden Markov model save method is given here, although all you have to do is to pass the path on the disk where you would like to Markov model to be stored. Afterwards, you can load the model again by using the HiddenMarkovModel.Load method.
Hope it helps!
Best regards,
Cesar
Hello Cesar,
Thank you for your excellent articles and code.
I was wondering if it’s also possible to build Hierarchical Hidden Markov Models with the Accord framework and if so, how to go about it?
Thanks in advance,
-Maarten
can u send me the project file
Hello Cesar,
Thank you for your excellent Accord.NET project.
I have a question, can your Hidden Markov model be applied to regime switching identification for time series? such as
http://blogs.mathworks.com/pick/2011/02/25/markov-regime-switching-models-in-matlab/
can I use your model to reproduce the above results and how?
Thanks.
Jack