## The Viterbi Algorithm

Jan. 26, 2018

The Viterbi algorithm is a dynamic programming algorithm that efficiently computes the the most likely states of the latent variables of a Hidden Markov Model (HMM), given an observed sequence of emissions. The co-founder of Qualcomm and University of Southern California alum Andrew Viterbi (go Trojans!) proposed the algorithm in 1967. This post will gradually work from theory to a Python implementation.

### An Introduction to Hidden Markov Models

Hidden Markov Models (HMMs) model a system of discrete temporal unobserved (hidden) variables and discrete temporal observed variables. The observed and unobserved variables are related through emission probabilities.

The diagram above depicts the relations between hidden variables (states) \( \{z_1, z_2, \dots, z_m \} \) and observed emissions \( \{ x_1, x_2, \dots, x_m \} \). At each timestep \( t \) we only observe \( x_t \). We are given the conditional probability distribution \( P(x_t | z_t) \), which gives the probability of an observation, given the hidden state. We are also given the conditional probability distribution \( P(z_t | z_{t - 1}) \) which models hidden state transitions over discrete timesteps.

Formally, an HMM consists of the following:

A set of states \( Q = \{q_1, q_2, \dots, q_n\} \), where \( n \) is the number of states.

A transition probability matrix \( A \), where each \( a_{ij} \) represents the probability of transitioning from state \( q_i \) to state \( q_j \). A constraint is also enforced such that \( \forall i, \sum_{j=1}^n a_{ij} = 1 \), that is, the rows of \( A \) are well formed probability distributions. $$ A = \begin{bmatrix} a_{11} & \dots & a_{1n} \\ \vdots & \ddots & \vdots \\ a_{n1} & \dots & a_{nn} \end{bmatrix} $$

A set of possible observations \( V = \{v_1, v_2, \dots, v_o\} \), where \( o \) is the number of unique observations.

An observed sequence of emissions \( X = \{ x_1, x_2, \dots, x_m \} \) where \( m \) is the number of timesteps.

An emission probability matrix \( B \), where each \( b_{ij} \) represents the probability of state \( q_i \) emitting the observation \( x_j \), such that \( \forall i, \sum_{j=1}^m b_{ij} = 1 \). $$ B = \begin{bmatrix} b_{11} & \dots & b_{1m} \\ \vdots & \ddots & \vdots \\ b_{n1} & \dots & b_{nm} \end{bmatrix} $$

A start state \( q_0 \) or a probability distribution over start states \( \pi = \{ \pi_0, \pi_1, \dots, \pi_n \} \), such that \( \sum_{i=1}^n \pi_i = 1 \). For simplicity we will assume the start state is known, that is, we know \( q_0 \) ahead of time.

Notice that with this information we can use Bayes' rule to extract the probability distribution of \( P(z_t | x_t) \) as

$$ P(z_t | x_t) = \frac{P(x_t | z_t) P(z_t)}{P(x_t)}. $$

We can rewrite the terms in the above equation in terms of the variables given for our HMM.

$$ \begin{equation} P(x_t | z_t) = b(q, x_t), \,\, \text{where $q$ is the state of hidden variable $z_t$.} \\ \end{equation} $$

$$ \begin{equation} P(z_t) = \sum_{q' \in Q} a(q', q) P(q', x_{t-1}), \,\, \text{where $q$ is the state of hidden variable $z_t$.} \end{equation} $$

Now we can rewrite this in terms of the variables defined above

$$ P(q | x_t) = \frac{b(q, x_t) \sum_{q' \in Q} a(q', q) P(q', x_{t-1})}{\sum_{q' \in Q} b(q', x) \sum_{q'' \in Q} a(q'', q') P(q'', x_{t-1})}. $$

That denominator term is complicated, but it turns out that we don't actually need it to calculate the most likely path, as we can model the joint distribution \( P(q, x_t) \). An alternative way of thinking about it is comparing the terms \( P(q_a | x_t) \) and \( P(q_b | x_t) \), according to Bayes' rule, they must have the same denominator \( P(x_t) \). This means dropping the denominator will not change the result of the comparison between the two terms. Dropping the denominator means we are comparing the two joint probabilities \( P(q_a, x_t) \) and \( P(q_b, x_t) \). We'll see in the next section that this is precisely the way Viterbi models the probability of state and emission.

### The Viterbi Algorithm

Naively computing the most likely path requires considering all possible paths through the state lattice and evaluating probabilities for each path. The problem with this approach is the exponential growth in the number of possible paths with respect to the number of timesteps \( m \).

Instead, the Viterbi algorithm computes incremental path probabilities, working from left to right. Consider a timestep \( t \) and a state \( q \), in order to find the most probable path through \( q \) at \( t \) we only require the path of highest probability that reaches \( q \). All other paths to \( q \) at time \( t \) can be discarded. This is because HMMs satisfy the Markov property, that is, the state at timestep \( t + 1 \) depends only on timestep \( t \), and is independent of timesteps \( \{1, 2, \dots, t - 1 \} \).

To incrementally calculate these probabilities, the Viterbi algorithm creates two in memory matrices \( P \) and \( \text{Back} \). \( P \) lives in \( \mathbb{R}^{n \times m} \). The elements of \( P \) are the probabilities of a state at a given point in time, and \( \text{Back} \) represents backpointers to the most likely previous state.

This implementation of Viterbi was sourced from Ron Artstein, CSCI 544 – Applied Natural Language Processing, Spring 2018, Written Homework 2.

### Implementing Viterbi in Python

Implementing Viterbi is now relatively straightforward. Since we know how the number of states \( m \) and the number of timesteps \( T \), we can allocate arrays ahead of time using numpy's array initialization methods. The only thing left to do is implement a `backtrace`

function that traces backwards through the `back`

array and constructs the final predictions for \( \{z_1, z_2, \dots, z_m\} \), and places the results in `z_pred`

.

```
import numpy as np
def backtrace(z_pred, back, T, labels):
prev_cursor = z_pred[-1]
for m in np.arange(T)[::-1]:
prev_cursor = back[prev_cursor, m]
z_pred[m] = prev_cursor
if labels is None:
return z_pred
return [labels[z] for z in z_pred]
def viterbi(q_size, A, X, B, q_init, labels=None):
Q = np.arange(q_size)
T = len(X)
p = np.zeros(shape=(len(Q), T))
back = np.zeros(shape=(len(Q), T), dtype=np.int)
z_pred = np.zeros(shape=(T, ), dtype=np.int)
for q in Q:
p[q, 0] = A[q_init, q] * B[q, X[0]]
back[q, 0] = q_init
for t in np.arange(T)[1:]:
for q in Q:
p[q, t] = np.max([p[qp, t - 1] * A[qp, q] * B[q, X[t]] for qp in Q])
back[q, t] = np.argmax([p[qp, t - 1] * A[qp, q] for qp in Q])
z_pred[T - 1] = np.argmax([p[q, T - 1] for q in Q])
return backtrace(z_pred, back, T, labels)
```

After writing the algorithm, a good next step might be finding the most probable state at the next timestep \( z_{m + 1} \). Also left out of this blog post are the forward-backward algorithm and extensions of HMMs. If mistakes are found, please comment in the section below.