## Part-Of-Speech Tagging

Jan. 28, 2018

A Part-of-Speech (POS) tagger is a tool that assigns parts of speech to each word of a sentence, such as verb, noun, adjective, etc. This post will summarize a POS tagger introduced in 1996 by Adwait Ratnaparkhi in his seminal paper *A Maximum Entropy Model for Part-Of-Speech Tagging*, which you can find here.

This method of POS tagging uses a probabilistic model that is optimized using maximum likelihood estimation. The model is defined on a set of word and tag contexts \( \mathcal{H} \) and the set of possible tags \( \mathcal{T} \). We want to represent a probability distribution over all the combinations of \( h \) and \( t \), that is, \( \mathcal{H} \times \mathcal{T} \). We can write this probability as

$$ P(h, t) = \pi \mu \prod_{j=1}^k \alpha_j^{\,\,f_j (h, t)} $$

where \( \pi \) is the normalization constant, \( \{\mu, \alpha_1, \alpha_2, \dots, \alpha_k \} \) are model parameters (or weights), and \( \{ f_1, f_2, \dots, f_k\} \) are features. Features are binary, that is, \( f_j(h, t) \in \{ 0, 1 \} \). There is a one to one correspondence between parameters \( \{ \alpha_j \}_{j=1}^k \) and features \( \{ f_j \}_{j=1}^k \), thus these model parameters can be thought of as feature weights.

To train the model, we must maximize the expression

$$ L(P) = \prod_{i=1}^n P(h_i, t_i) = \prod_{i=1}^n \pi \mu \prod_{j=1}^k \alpha_j^{\,\, f_j (h_i, t_i)} $$

over a set of training data \( \{ (h_i, t_i) \}_{i=1}^n \) with respect to the parameters of the model.

Features are said to be active when \( f_j(h, t) = 1 \). In this case, the joint probability of history \( h \) and tag \( t \) is equal to the corresponding model parameter \( \alpha_j \). A feature may activate on any word or tag in the histry \( h \). The definition of a history \( h_i \) that is feature is given is defined as

$$ h_i = \{ w_i, w_{i+1}, w_{i+2}, w_{i-1}, w_{i-2}, t_{i-1}, t_{i-2} \} $$

where \( w_i \) represent words at index \( i \) and \( t_{i-1} \) represent previous tags. The following example is copied directly from Ratnaparkhi's paper.

$$ \begin{equation} f_j (h_i, t_i)=\begin{cases} 1 \,\, \text{if suffix$(w_i) = $ "ing" & $t_i = $ VBG}\\ 0 \,\, \text{otherwise.} \end{cases} \end{equation} $$

In the case that this feature is active of some history tag pair \( h_i, t_i \) then the model parameter \( \alpha_j \) will contribute to the joint probability \( P(h_i, t_i) \). When the model is performing inference, beam search is used to find the most likely tag sequence \( \{ t_1, t_2, \dots, t_n \} \) for some sentence \( \{ w_1, w_2, \dots, w_n \} \). Beam search uses the conditional tag probability expression

$$ P(t | h) = \frac{P(h| t)}{\sum_{t' \in \mathcal{T}} P(h, t')}. $$

Beam search is thus approximating the expression

$$ \arg \max_{T \in \mathcal{T}} P(T | w_1, w_2, \dots, w_n) = \arg \max_{T \in \mathcal{T}} \prod_{i=1}^n P(t_i | h_i) $$

where \( T \) is a candidate tag sequence, such as \( T = \{ t_1, t_2, \dots, t_n \} \). This is an approximation as beam search only keeps the \( N \) most probable tag sequence candidates in memory as it traverses the sentence.

To learn more about this POS tagger, how it was developed, and the preceding research, please read Adwait Ratnaparkhi's paper *A Maximum Entropy Model for Part-Of-Speech Tagging*.