# Jarrod Kahn

## Part-Of-Speech Tagging

Jan. 28, 2018

A Part-of-Speech (POS) tagger is a tool that assigns parts of speech to each word of a sentence, such as verb, noun, adjective, etc. This post will summarize a POS tagger introduced in 1996 by Adwait Ratnaparkhi in his seminal paper A Maximum Entropy Model for Part-Of-Speech Tagging, which you can find here.

This method of POS tagging uses a probabilistic model that is optimized using maximum likelihood estimation. The model is defined on a set of word and tag contexts $$\mathcal{H}$$ and the set of possible tags $$\mathcal{T}$$. We want to represent a probability distribution over all the combinations of $$h$$ and $$t$$, that is, $$\mathcal{H} \times \mathcal{T}$$. We can write this probability as

$$P(h, t) = \pi \mu \prod_{j=1}^k \alpha_j^{\,\,f_j (h, t)}$$

where $$\pi$$ is the normalization constant, $$\{\mu, \alpha_1, \alpha_2, \dots, \alpha_k \}$$ are model parameters (or weights), and $$\{ f_1, f_2, \dots, f_k\}$$ are features. Features are binary, that is, $$f_j(h, t) \in \{ 0, 1 \}$$. There is a one to one correspondence between parameters $$\{ \alpha_j \}_{j=1}^k$$ and features $$\{ f_j \}_{j=1}^k$$, thus these model parameters can be thought of as feature weights.

To train the model, we must maximize the expression

$$L(P) = \prod_{i=1}^n P(h_i, t_i) = \prod_{i=1}^n \pi \mu \prod_{j=1}^k \alpha_j^{\,\, f_j (h_i, t_i)}$$

over a set of training data $$\{ (h_i, t_i) \}_{i=1}^n$$ with respect to the parameters of the model.

Features are said to be active when $$f_j(h, t) = 1$$. In this case, the joint probability of history $$h$$ and tag $$t$$ is equal to the corresponding model parameter $$\alpha_j$$. A feature may activate on any word or tag in the histry $$h$$. The definition of a history $$h_i$$ that is feature is given is defined as

$$h_i = \{ w_i, w_{i+1}, w_{i+2}, w_{i-1}, w_{i-2}, t_{i-1}, t_{i-2} \}$$

where $$w_i$$ represent words at index $$i$$ and $$t_{i-1}$$ represent previous tags. The following example is copied directly from Ratnaparkhi's paper.

$$\begin{equation} f_j (h_i, t_i)=\begin{cases} 1 \,\, \text{if suffix(w_i) =  "ing" & t_i =  VBG}\\ 0 \,\, \text{otherwise.} \end{cases} \end{equation}$$

In the case that this feature is active of some history tag pair $$h_i, t_i$$ then the model parameter $$\alpha_j$$ will contribute to the joint probability $$P(h_i, t_i)$$. When the model is performing inference, beam search is used to find the most likely tag sequence $$\{ t_1, t_2, \dots, t_n \}$$ for some sentence $$\{ w_1, w_2, \dots, w_n \}$$. Beam search uses the conditional tag probability expression

$$P(t | h) = \frac{P(h| t)}{\sum_{t' \in \mathcal{T}} P(h, t')}.$$

Beam search is thus approximating the expression

$$\arg \max_{T \in \mathcal{T}} P(T | w_1, w_2, \dots, w_n) = \arg \max_{T \in \mathcal{T}} \prod_{i=1}^n P(t_i | h_i)$$

where $$T$$ is a candidate tag sequence, such as $$T = \{ t_1, t_2, \dots, t_n \}$$. This is an approximation as beam search only keeps the $$N$$ most probable tag sequence candidates in memory as it traverses the sentence.