Learning What’s in a Name with Graphical Models
Abstract
“The UK” is a country, but “The UK Department of Transport” is an organization within that country. In a named entity recognition (NER) task, where we want to label each word with a name tag (organization/person/location/other/not a name), how can a computer model know one from the other?
In this article, we’ll explore three model families that are remarkably successful at NER: Hidden Markov Models (HMMs), Maximum-Entropy Markov Models (MEMMs), and Conditional Random Fields (CRFs). We’ll use interactive visualizations to explain the graphical structure of each. Our overarching goal is to demonstrate how visualizations can be effective tools for communicating and clarifying complex, abstract concepts. The visualizations will allow us to compare and contrast between model families, and understand how each builds on and addresses key issues affecting its predecessors.
Graphical modeling is a robust framework for representing probabilistic models. Complex multivariate probability distributions can be expressed with compact graphs that are easier to understand and interpret.
Let’s start with a simple example with two random variables, and . Assume that is conditionally dependent on . Through a canonical application of the chain rule, the joint distribution of and is:
The shorthand p(a) means p(A = a), that is, the probability of variable A taking value a.
This is a simple example with two factors in the right-hand side. Add more variables, however, and the result can get messy fast. To see this, assume that there are two more variables, and , and that is conditionally dependent on , , and . The factorization becomes:
The relationship between variables is more opaque, hidden behind second-order dependencies. For example, while it’s clear that is directly dependent on , we may miss the fact that there is another, second-order dependency between the two ( is dependent on , which in turn is dependent on ).
Directed Acyclic Graphs, or DAGs, offer a natural remedy to this problem. Each factor in the equation can be represented by a node. An arrow indicates conditional dependence. The resulting graph would look like:
With this graph, it’s easier to construct a story of how , , and are sampled. The process proceeds in topological order, for example → → → , to ensure that all dependencies have been resolved by the time each variable is sampled.
Below is what a sampled population of the given distributions would look like. For the sake of demonstration, many distribution parameters are modifiable — in reality these are the quantities that need to be learned from training data.
For more detailed accounts of probabilistic graphical models, consider reading the textbooks Probabilistic Graphical Models: Principles and Techniques by Daphne Koller and Nir Friedman [1] and Probabilistic Reasoning in Intelligent Systems by Judea Pearl [2].
Hidden Markov Models (HMMs) are an early class of probabilistic graphical models representing partially hidden (unobserved) sequences of events. Structurally, they are built with two main layers, one observed () and one hidden ():
HMMs have been successfully applied to a wide range of problems, including gene analysis [3], information extraction [4, 5], speech recognition [6], and named entity recognition [7].
The hidden layer is assumed to be a Markov process: a chain of events in which each event’s probability depends on the state of only the preceding event. More formally, given a sequence of random events , ,…, , the Markov assumption states:
In a graph, this translates to a linear chain of events where each event has one arrow pointing towards it (except for the first event) and one pointing away from it (except for the last):
A second assumption that HMMs make is time-homogeneity: that the probability of transitioning from one state to the next is independent of time. For example, if the last state was “O”, then the probability of the current state being “B-LOC” is the same value regardless of what time step we are at. In formal terms:
is called the transition probability and is one of the two key parameters to be learned during training.
The assumptions about the hidden layer — Markov and time-homogeneity — hold up in various time-based systems where the hidden, unobserved events occur sequentially, one after the other. Together, they meaningfully reduce the computational complexity of both learning and inference.
The hidden and observed layers are connected via a one-to-one mapping. We assume the probability of each observation depends only on the state of the hidden event at the same time step. Given a sequence of observed events , ,…, and hidden events , ,…, we have:
In a graph, this one-to-one relationship looks like:
The conditional probability is called the "emission probability." We also assume this is time-homogenous, further reducing the model's complexity. It is the second key parameter to be learned, alongside the transition probability.
Throughout this article, we’ll work with the CoNLL-2003 English dataset [8]. The dataset recognizes 9 possible name tags. Each, apart from the “O” tag, has either a “B-” (for beginning) or “I-” (for internal/inside) prefix, to eliminate confusion about when an entity stops and the next one begins. For example:
HMMs represent the input word tokens with the observed layer, with the respective name tags constitue the hidden layer:
Rather than labeling each node using the name of the variable it represents (X₁, Y₁) as we have until this point, we'll instead display the value of that variable (“O”, “Great”). This helps make the graphs easier to read.
Between any two consecutive hidden states, there are 9² = 81 possible transitions. Each transition has its own probability, :
Higher opacity indicates higher relative probability.
In the observed layer, each node can have any value from the vocabulary, whose size ranges anywhere from the thousands to the hundreds of thousands. The vocabulary created for the HMM in this article contains 23,622 tokens. Let N be the number of tokens in the vocabulary. The number of possible emission probabilities is 9N ().
Higher opacity indicates higher relative probability. Dashed lines are used for emission paths to other words in the vocabulary.
There are three sets of parameters to be learned during training: the transition, emission, and start probabilities. When the hidden states are known in the training set, which is the case with CoNLL-2003, all three can be computed as normalized rates of occurrence from the training data. For example, to get the transition probability from state “O” to state “B-LOC”, , we need two numbers: the number of times state “O” is followed by any other state (that is, it isn't the last state in the sequence), as , and the number of times state “O” is followed by state “B-LOC”, as . The desired transition probability is . The same calculation can be done for each of the remaining probabilities.
If however the hidden states are unknown, then the standard approach is to use the Baum-Welch (forward-backward) algorithm. Baum-Welch is an iterative algorithm, where we start with an initial estimate of the probabilities to be learned and use them to compute progressively better estimates. An excellent explanation of the algorithm can be found in the the book Speech and Language Processing by Jurafsky & Martin [9].
In the context of HMMs, inference involves answering useful questions about hidden states given observed values, or about missing values given a partially-observed sequence. In NER, we are focused on the first type of inference. Specifically, we want to perform maximum a posteriori (MAP) inference to identify the most likely state sequence conditioned on observed values.
There is usually an intractably large number of candidate state sequences. For any two consecutive states, there are 9² = 81 potential transition paths. For three states there are 9³ = 729 paths. This number continues to grow exponentially as the number of states increases.
Luckily, there is an efficient dynamic algorithm that returns the most likely path with relatively low computational complexity: the Viterbi algorithm [10]. It moves through the input sequence from left to right, at each step identifying and saving the most likely path in a trellis-shaped memory structure. For more details, refer to the Jurafsky & Martin book [11].
An HMM with the structure outlined above was trained on the CoNLL-2003 dataset. The training set contains 14,987 sentences and a total of 203,621 word tokens. Here's the model in action:
Evaluated against a test set, the model achieves satisfactory per-word accuracy:
Accuracy
90.1%
However, precision and recall — calculated per entity [8] — are decidedly low:
Precision
64.2%
Recall
55.8%
F₁ Score
59.7%
These metrics are lower than per-word accuracy because they are entity-level evaluations that count only exact matches as true positives. Long, multi-word entities are considered incorrect if one or more of their constituent words are misidentified, in effect, ignoring the other correctly identified words in the entity.
A closer look at the results reveals a discrepancy between entities with known words versus those containing at least one out-of-vocabulary (OOV) word. It is common for proper names to contain one or more OOV words, such as the "Pertile" in "Javier Pertile". Both precision and recall are significantly worse in those cases:
Precision
| Tag | No OOV | 1+ OOV |
|---|---|---|
| ORG | 0.8 | 0.21 |
| PER | 0.85 | 0.62 |
| LOC | 0.87 | 0.06 |
| MISC | 0.78 | 0.12 |
| ALL | 0.84 | 0.39 |
| Tag | No OOV | 1+ OOV |
Recall
| Tag | No OOV | 1+ OOV |
|---|---|---|
| ORG | 0.64 | 0.33 |
| PER | 0.58 | 0.59 |
| LOC | 0.71 | 0.05 |
| MISC | 0.54 | 0.06 |
| ALL | 0.64 | 0.41 |
| Tag | No OOV | 1+ OOV |
This makes sense: we expect the model to perform worse on tokens that it wasn't trained on. And because the CoNLL-2003 training set has a large number of OOV tokens — 80.6% of all test entities contain at least one OOV token — across-the-board precision and recall are heavily impacted.
HMMs are by no means perfect candidates for NER or text labeling problems in general, for at least three reasons.
First is the Markov assumption. We don't compose words and their respective labels (“Noun”, “Adjective”, “B-ORG”, “I-ORG”) in a tidy, unidirectional manner. A word may link to another word many positions in the sentence before or after. In “South African Breweries Ltd”, both “South African” and “Ltd” provide crucial context to “Breweries”, which would otherwise register as not a name. HMMs fail to capture these tangled interdependencies, instead assuming that information passes from left to right in a single, neat chain.
An indication of such limitation lies in the fast deterioration of the precision score as the entity length — counted as the number of tokens inside each entity — increases (the recall scores are much more noisy and thus less conclusive):
| Tag | Entity Length 1 | Entity Length 2 | Entity Length 3 | Entity Length 4 | Entity Length 5 |
|---|---|---|---|---|---|
| ORG | 0.61 | 0.68 | 0.3 | 0.12 | 0.28 |
| PER | 0.96 | 0.67 | 0.7 | ||
| LOC | 0.88 | 0.36 | |||
| MISC | 0.9 | 0.46 | 0.24 | 0.12 | |
| ALL | 0.77 | 0.61 | 0.31 | 0.12 | 0.29 |
| Tag | 1 | 2 | 3 | 4 | 5 |
An empty cell means there were not enough (fewer than five) relevant entities. E.g. in the precision tab, there were not enough entities predicted to be a location name that were five words long.
The second reason why HMMs fail to do well on text labeling problems is the emissions assumption. When composing sentences, we don’t go about forming a chain of labels (such as “B-ORG” – “I-ORG” – “I-ORG” – “O”…) before generating words from each label in that chain.
Semantic and grammatical rules may restrict the range of words that can appear in any given observation, but those restrictions are far from the strong emissions assumption made by HMMs. Beyond the current word label, there is a host of other features that, together, help determine which words are chosen. Additionally, while there is a link between part-of-speech tags and the words that are chosen, that link is less clear with name tags.
The third reason: HMMs rely solely on word identity (the exact characters that make up the word) in the observed layer. There are various word features that can provide additional information on the hidden label. Capitalization is a strong signal of named entities. Word shape may help account for common name patterns. HMMs take none of these features into account (in fact, they can’t: their generative chain structure requires that all observations be independent of each other).
Since the only information available are word identities, HMMs falter on out-of-vocabulary words. While the models can wring bits of useful information out of nearby predicted name tags — the Viterbi algorithm maximizes likelihood over the whole sequence — the results are nonetheless discouraging:
| Entity Length | OOV Rate 0 | OOV Rate 0.2 | OOV Rate 0.25 | OOV Rate 0.33 | OOV Rate 0.4 | OOV Rate 0.5 | OOV Rate 0.6 | OOV Rate 0.66 | OOV Rate 0.75 | OOV Rate 0.8 | OOV Rate 1 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.86 | 0.3 | |||||||||
| 2 | 0.8 | 0.5 | 0.52 | ||||||||
| 3 | 0.78 | 0.15 | 0.06 | 0 | |||||||
| 4 | 0.42 | 0.22 | 0 | 0 | 0 | ||||||
| 5 | 0.67 | 0.22 | 0 | 0.14 | |||||||
| Entity Length | 0 | 0.2 | 0.25 | 0.33 | 0.4 | 0.5 | 0.6 | 0.66 | 0.75 | 0.8 | 1 |
An empty cell means either it is impossible (e.g. a two-word entity can’t have an OOV rate of 0.25) or there were not enough (fewer than five) relevant entities to form a reliable score.
The concerns detailed above are a consequence of HMMs’ generative graph structure. Below, we’ll consider another class of probabilistic graphical models with a different, discriminative structure that will hopefully address some of those concerns and deliver better performance on long, OOV-heavy entities.
Maximum Entropy Markov Models (MEMMs) [12] resolve some of the concerns we had with HMMs by way of a simple yet meaningful modification to their graphical structure:
The arrows connecting observations with their respective states have switched directions. We'll discuss the implications of such a change in the sections below.
Similar to HMMs, MEMMs have been successfully applied to a wide range of sequence modeling problems [12, 13, 14].
There are two main approaches to building classification models: generative and discriminative [15]. Suppose there is a system with two variables, and . We want to make predictions about based on observations on . A generative classifier would do that by first learning the prior distribution and then applying Bayes' rule to find the posterior . This can be thought of as reconstructing the process that generated the observed data. A discriminative classifier, on the other hand, would model the posterior directly based on training data.
HMMs are generative classifiers, while MEMMs are discriminative. The former are generative because they model the joint distribution over both observations and hidden states (as a product of transition and emission probabilities) before using that joint distribution to find the most likely state sequence given observations (or solve some other inference problem). MEMMs, on the other hand, directly model the conditional probabilities without any intermediary.
A major advantage of MEMMs’ discriminative structure is that it permits overlapping word features as inputs. Two features can overlap when they contain the same or similar pieces of information, such as capitalization and word shape (e.g. “Xxxx”, for four-letter words where the first character is capitalized). HMMs don’t allow overlapping features. As a result of their generative structure, they require that all events in the observation layer be independent of one another. MEMMs, on the other hand, are discriminative and able to relax the independence requirement, so they can use arbitrary overlapping features [12].
Common practice is to represent features as binary functions, such as:
Applied to the example above, we have:
| ot | bshape=Xxxx(ot) |
|---|---|
| Stately | 0 |
| , | 0 |
| plump | 0 |
| Buck | 1 |
| Mulligan | 0 |
| came | 0 |
Binary features are paired with the current state to form feature-state pairs , also expressed as binary functions:
Feature-state pairs help MEMMs learn which features and states go together and which don’t. For example, we can expect pairs like "is_capitalized" + “B-ORG” to frequently co-occur, capturing the fact that in English named entities are often capitalized.
MEMMs’ state transition distributions have exponential form and contain a weighted sum of all feature-state pairs:
where and are the previous and current state, is the current observation, is a feature-state pair, is the learned weight for , and is a normalizing term to make the distribution sum to one across all next states .
Those familiar with neural networks will recognize that the function above is a softmax. Its exponential form is a result of the core principle of maximum entropy that underlies MEMMs’ statistical structure and gives them their name. Maximum entropy states that the model that best represents our knowledge about a system is one that makes the fewest possible assumptions except for certain constraints derived from prior data from that system [12, 16].
The training step involves learning the weights that satisfy MEMMs’ maximum entropy constraint [12]. Learning is done through Generalized Iterative Scaling, which iteratively updates the values in order to nudge the expected value of all features closer to their training set average. Convergence at a global optimum is guaranteed given the exponential form of the transition distribution.
As with HMMs, the Viterbi algorithm makes MAP inference tractable [12, 10]. The variable transition probability takes the place of HMMs’ fixed transition and emission probabilities.
A MEMM was trained on the CoNLL-2003 English dataset [8]. In addition to word identity, features used for training include the word’s lowercase version (“Algeria” → “algeria”), shape (“Xxxx”), whether it’s in title/upper case, and whether it contains only digits.
A list of the most informative features — those with the largest absolute weights — offers valuable insights into how the model found and remembers linguistic patterns:
Many of these features are word identities. This makes intuitive sense: certain words, like “Germany”, are almost always used as names irrespective of what comes before or after them.
Other features relate to established linguistic patterns. For example, if the current word has shape “X.X.”, such as “U.S.” and “U.N.”, it’s unlikely to have the “O” tag — the feature-state pair’s weight is a large negative number. This means the word is likely a named entity, most probably two-letter initialisms.
Here’s a live version of the trained model:
The model has better performance than its HMM counterpart. Per-word accuracy is higher than the HMM’s 90.1%:
Accuracy
93.1%
Per-entity precision and recall are notably higher, up from the HMM’s 64.2% and 55.8%, respectively:
Precision
72.9%
Recall
63.5%
F₁ Score
67.9%
A large part of the performance boost is attributable to higher precision on entities with at least one OOV word:
Precision
| Tag | No OOV | 1+ OOV |
|---|---|---|
| ORG | 0.81 | 0.36 |
| PER | 0.82 | 0.8 |
| LOC | 0.82 | 0.17 |
| MISC | 0.74 | 0.14 |
| ALL | 0.8 | 0.54 |
| Tag | No OOV | 1+ OOV |
Recall
| Tag | No OOV | 1+ OOV |
|---|---|---|
| ORG | 0.68 | 0.12 |
| PER | 0.72 | 0.57 |
| LOC | 0.89 | 0.29 |
| MISC | 0.78 | 0.02 |
| ALL | 0.79 | 0.37 |
| Tag | No OOV | 1+ OOV |
The ability to model word features allows MEMMs to fare better with OOV-dense name entities than HMMs. Faced with words that they have never seen before during training, these models can easily stumble. Word identity alone provides no useful information. In those cases, derived features such as word shape and capitalization can function as imperfect yet doubtlessly helpful proxies for word identity, allowing MEMMs to make better guesses at the name tag, resulting higher precision and recall scores:
| Entity Length | OOV Rate 0 | OOV Rate 0.2 | OOV Rate 0.25 | OOV Rate 0.33 | OOV Rate 0.4 | OOV Rate 0.5 | OOV Rate 0.6 | OOV Rate 0.66 | OOV Rate 0.75 | OOV Rate 0.8 | OOV Rate 1 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.83 | 0.34 | |||||||||
| 2 | 0.76 | 0.72 | 0.55 | ||||||||
| 3 | 0.6 | 0.56 | 0.59 | 0.5 | |||||||
| 4 | 0.16 | 0.2 | 0 | ||||||||
| 5 | 0.6 | 0.5 | |||||||||
| Entity Length | 0 | 0.2 | 0.25 | 0.33 | 0.4 | 0.5 | 0.6 | 0.66 | 0.75 | 0.8 | 1 |
An empty cell means either it is impossible (e.g. a two-word entity can’t have an OOV rate of 0.25) or there were not enough (fewer than five) relevant entities to form a reliable score.
With stronger predictive power on OOV words, we can additionally expect better performance on long, multi-word entities. That’s because OOV words are dangerous information gaps inside named entities. They’re easy to misclassify, and when they are the entire entity prediction is counted as incorrect. MEMMs are able to fill those gaps to an extent by using word features. As a result, we don’t see the drastic performance deterioration for longer entities observed with the HMM:
| Tag | Entity Length 1 | Entity Length 2 | Entity Length 3 | Entity Length 4 | Entity Length 5 |
|---|---|---|---|---|---|
| ORG | 0.76 | 0.69 | 0.84 | 0.36 | 0.8 |
| PER | 0.59 | 0.91 | 0.66 | 0.25 | |
| LOC | 0.8 | 0.33 | 0.35 | 0 | |
| MISC | 0.82 | 0.57 | 0.29 | 0.18 | |
| ALL | 0.77 | 0.7 | 0.58 | 0.16 | 0.43 |
| Tag | 1 | 2 | 3 | 4 | 5 |
An empty cell means there were not enough (fewer than five) relevant entities. E.g. in the precision tab, there were not enough entities predicted to be a location name that were five words long.
MEMMs’ discriminative structure offers great benefits as demonstrated above, but it comes with a downside: the label bias problem. First reported by Bottou [17], label bias mostly affects discriminative models, causing certain states to effectively ignore their observations, leading to demonstrably higher error rates [18].
Here’s how it happens. Recall that MEMMs model state transitions as probability distributions . For these probabilities to be valid, the distributions must sum up to one:
This requirement is satisfied by local normalization via the term baked directly into the definition of :
The normalization is “local” since it’s applied to individual parts of the graph (in this case, the transition between two states) as opposed to the whole structure. It’s here that the problem arises: scores leaving a state are rescaled to become proper probabilities, and in the process information about how likely the transitions are under the given observations is erased. A simple example can help illustrate this effect.
Let’s assume that there are only two hidden states in our model: “LOC” if the word is part of a location name, and “PER” if it’s part of a person’s name. There’s a special state “S” for the start of the sentence. We observe the two-word sequence “Stone Creek”. The transition graph is the following:
At inference time we proceed from left to right. The first observation is “Stone”. We need to find and . Let’s say that they both have the same value, :
We then go to the second observation, “Creek”. The first pair of transition probabilities that we need to calculate is and . Under local normalization, the two must sum up to one. Let’s say that they come out to and respectively.
So far so good. Now we need to calculate and . Let’s say probabilities are, in turn, and .
Now, intuitively the word “Creek” is much more likely to be part of a location name than a person’s name. But here the first set of transition probabilities, — , have the same total mass as the second set, — . Both sum up to one because of local normalization. The result is that taken together, the model judges paths that go through the “LOC” state first (i.e. “S” → “LOC” → “LOC” and “S” → “LOC” → “PER”) to be about as likely as those that go through the “PER” state first (“S” → “PER” → “LOC” and “S” → “PER” → “PER”), contradicting our intuition about “Creek” being more consistent with the former than the latter. The model has in effect ignored crucial information contained within the observation “Creek”.
More generally, due to MEMMs’ discriminative structure, observations can only affect how the incoming probabilities are distributed across the possible current states, but not how much total probability mass to pass on. That total mass must always equal one, due to local normalization. In the example above, given the previous state “PER”, the observation “Creek” may shift the distribution over the current state toward rather than some other split like , but it cannot change the total probability mass that the previous state “PER” passes on. We can’t have the transition probability pair be, say, , which would better reflect our intuition about “PER” being less likely under “Creek”.
So that is the crux of the label bias problem. Local probability normalization causes MEMMs to discard useful information about how likely state transitions are under a given observation. The problem extends beyond the simple case above. Hannun [19] offers an excellent, more detailed explanation of the problem and how it generalizes to more scenarios and model structures.
There have been various proposed solutions to the label bias problem. One particularly successful strategy is to eschew local normalization in favor of normalization at the global, graph-wide level. That brings us to Conditional Random Fields, a class of globally normalized, undirected probabilistic models, which we’ll cover in the next section.
Conditional Random Fields (CRFs) are a class of undirected probabilistic models. They have proved to be powerful models with a wide range of applications, including text processing [18, 20, 21], image recognition [22, 23, 24, 25], and bioinformatics [26, 27].
While CRFs can have any graph structure, in this article we’ll focus on the linear-chain version:
CRFs are a type of Markov Random Fields (MRFs) — probability distributions over random variables described by undirected graphs [28], for example:
Undirected graphs are appropriate when it’s difficult or implausible to establish causal, generative relationships between random variables. Social networks are a good example of undirected relationships. We can think of , , , and in the graph above as people in a simple network. and are friends and tend to share similar beliefs. The same goes for and as well as and . We might, for example, want to model how each person in the network thinks about a specific topic.
Acyclic Directed Graphs fail to adequately represent the mutual belief propagation that occurs within the group. For example, we might have an edge from to but no path from back to — there will always be at least one such exclusion in an acyclic directed graph.
Rather than assuming a generative relationship between variables, MRFs model their mutual relationships with non-negative scoring functions , called factors, that assign higher scores if the variables’ values are in agreement, for example: