Dynamic programming was always one of my favorite topics, and so I thought I'd spend some time visualizing the transformation of an exponential time algorithm into a polynomial time algorithm. A lot of dynamic programming algorithms are fairly straight-forward translations of recursions via memoization. The real key, it turns out, is getting the recursion right.
The Viterbi Algorithm and Hidden Markov Models
A Markov model is a probabilistic model that can be imagined with the help of a state machine. Each state has some probability emitting an output. After an output, each state then has some probability of transitioning to other states. Then, you can construct a stream of data by repeatedly querying this model. Of course, you can also attempt to go in the opposite direction, too, and attempt to explain some series of data using a Markov model. In this case, you'll need to construct the most probable series of states for a given series. How can you do this?
The Viterbi algorithm provides us with a recursion relationship that you can use to figure out the the most likely series of transitions. You base this recursion on Vt,k, which is the probability of the most probable state sequence given the first t observations and ends at the state k. From the wikipedia entry:
Vt,k = Pr[ yt | k ] * maxforall x in states[ax,k Vt-1,x]
Where yt is the tth observation, and ax,k is the probability of transitioning from state x to state k. As you can see, this gives a recursion which can be solved directly.
Solving the Recursion Directly
Okay, so how does a direct solution look in the code?
float p_observations_to_state(int t, int k){ /* P_observations_to_state (V_t_k) := Pr(the most probable sequence for t observations which arrive at state k) */ visit(t, k); float ret; if(t == 1){ // base case ret = p(obs(0), k) * pi(k); }else{ // recursion float max_p_rec = -1; for(int x = 0; x < states(); x++){ float p_x_to_k = transition(x, k) * p_observations_to_state(t-1, x); if(p_x_to_k > max_p_rec) max_p_rec = p_x_to_k; } ret = p(obs(t-1), k) * max_p_rec; } return ret; }
Great! So what's going on here? First of all, I use a bunch of function calls to actually compute the transition and emission probabilities. That's fine. What else is happening? I make a visit() call so that I can track the runtime of the algorithm. You can also see the exponential nature of the recursion more directly once it is translated into code. Each call actually generates K recursions: one for each state in the state-space of the Markov model.
What does this actually look like in practice? Let's imagine each function call of this algorithm as a pair of integers. How many times is each pair visited? Are pairs being visited many times? Or is there some fundamental reason for the exponential recursion that cannot be fixed with memoization? For Markov model with 5 states and a series of 10 observations, I'll plot the number of times each pair (t,k) is visited.
As you can see, pairs on the left edge of the graph which comprise the "base cases" of the recursion, are visited very frequently. Pairs to the right are visited less frequently, and there's a 5x shrinking going on. Of course, this isn't too surprising, given the implementation. What this does indicate, though, is that there is a lot to be gained from memoization. Why? Well, what happens is that the pairs to the right are actually generating a whole slew of calls to the left, so the memoization isn't only eliminating a bunch of recomputation in the "body" of the recursive calls, it's actually eliminating further recursive calls. Essentially, it is going to reduce the number of times the question "What is the Vt,k?" is even asked. This is an important requirement that really changes the runtime of the algorithm
Memoization and Dynamic Programming
Often times, dynamic programming is presented as the technique involving matrices, and operations updating cells, etc. That's great, but it kind of obscures what I believe to be the real underlying principle, namely, memoization. The general idea, is to look at the "call graph" of a recursive algorithm and to say, well, if I cache results, does this thin out the call graph significantly?
First let's see how memoization changes our implementation:
float p_observations_to_state(int t, int k){ /* P_observations_to_state (V_t_k) := P(the most probable sequence for t observations which arrive at state k) */ visit(t, k); if(single.memoizing && single.memoized[t][k] != -1){ return single.memoized[t][k]; } float ret; if(t == 1){ // base case ret = p(obs(0), k) * pi(k); }else{ // recursion float max_p_rec = -1; for(int x = 0; x < states(); x++){ float p_x_to_k = transition(x, k) * p_observations_to_state(t-1, x); if(p_x_to_k > max_p_rec) max_p_rec = p_x_to_k; } ret = p(obs(t-1), k) * max_p_rec; } if(single.memoizing) single.memoized[t][k] = ret; return ret; }
Pretty straight forward. It's a little hack-y in that I use "-1" to indicate that a variable hasn't been cached, but I hack, so hack-y is what you get. Notice also that the visit() call occurs before the cache is consulted: this ensures that every function entry is being counted, even ones which only do a cached read. This keeps our runtime numbers honest. So how does this change the number of visits each pair receives?
Woah. As you can see, the left most pairs are visited vastly less than before. And this, of course, is why dynamic programming is so powerful. It can drastically reduce the cost of an algorithm, without really changing the reasoning required. Another thing to note is that pairs are still being visited multiple times, however, the body of the function is only being evaluated once for each pair. Why is there this discrepancy, then? Well, the pair (1, 1) is visited when any pair (2, k) is evaluated. Each of those pairs is evaluated once, and there are five such pairs, so therefore (1, 1) will be visited five times. This