Over the past several months, Large Language Models (LLMs) such as ChatGPT, GPT-4, and AutoGPT, have flooded the Internet with all kinds of different applications and use-cases. These are regarded as language models that can remember context as well as understand their own capabilities. They’re often treated as black-boxes where the majority of the implementation details are left to the researchers. However, having some understanding of how they work can also help people more clearly and concisely instruct the model to get the desired output.

Rather than jumping straight to how LLMs work, I think it’s helpful to cover some prerequisite knowledge to help us demystify LLMs. In this post, we’ll go back in time before neural networks and talk about language, language modeling, and n-gram language models since they’re simple to understand and we can do an example by hand.

Language

Before we start with n-gram models, we need to understand the kind of data we’re working with. If we were going to delve into convolutional neural networks (CNNs), we’d start our discussion with images and image data. Since we’re talking about language modeling, let’s talk about language so we can better motive why language modeling is very hard. One definition of language that’s particularly relevant to language modeling is a structured system of communication with a grammar and vocabulary (note this applies for spoken, written, and sign language). Given you’re reading this post in the English language, you’re probably already familiar with vocabulary and grammar so let me present to you a sentence.

The quick brown fox jumps over the lazy dog.

You might recognize this sentence as being one that uses each letter of the English/Latin alphabet at least once. Immediately we see the words belonging to the vocabulary and their part-of-speech: nouns like “fox” and “dog”; adjectives like “quick”, “brown”, “lazy”; articles like “the”; verbs like “jumps”; and prepositions like “over”.

Grammar is what dictates the ordering of the words in the vocabulary: the subject “fox” comes before the verb “jumps” and the direct object “dog”. This ordering depends on the language however. For example, if I translated the above sentence into Japanese, it would read: 素早い茶色のキツネが怠惰な犬を飛び越えます。A literal translation would go like “Quick brown fox lazy dog jumped over”. Notice how the verb came at the end rather than between the subject and direct object.

These problems help illustrate why we can’t simply have a model that performs a one-to-one mapping when we try to model languages. We might end up with more words, e.g., if the target language uses particles words, or fewer words, e.g., if the target language doesn’t have article words. Even if we did have the same number of words, the ordering might change. For example, in English, we’d say “red car” but in Spanish we’d say “carro rojo” which literally translates to “car red”: the adjective comes after the noun it describes.

To summarize, language is very difficult! Even for humans! So it’s going to be a challenge for computers too.

Applications of Language Modeling

With that little aside on languages, before we formally define language modeling, let’s look at a few applications that use some kind of language modeling under-the-hood.

Sentiment Analysis. When reading an Amazon review, as humans, we can tell if they’re positive or negative. We want to have a language model that can do the same kind of thing. Given a sequence of text, we want to see if the sentiment is good or bad. Cases like “It’s hard not to hate this movie” are particularly challenging and need to be handled correctly. This particular application of language modeling is used in “Voice of the Customer” style analysis to gauge perceptions about a company or their products.

Automatic Speech Recognition. Language modeling can be useful for speech recognition by being able to correctly model sentences, especially for words that sound the same but are written differently like “tear” and “tier”.

Neural Machine Translation. Google Translate is a great example of this! If we have language models of different languages, implicitly or explicitly, we can translate between the languages that they model!

Text Generation. This is what ChatGPT has grown famous for: generating text! This application of language modeling can be used for question answering, code generation, summarization, and a lot more applications.

Language Modeling

Now that we’ve seen a few applications, what do all of these haven in common? It seems like one point of commonality is that we want to understand and analyze text against the trained corpus to ensure that we’re consistent with it. In other words, if our model was trained on a dataset of English sentences, we don’t want it generating grammatically incorrect sentences. In other words, we want to ensure that the outputs “conform” to the dataset.

One way to measure this is to compute a probability of “belonging”. For a some random given input sequence, if the probability is high, then we expect that sequence to be close to what we’ve see in the dataset. If that probability is low, then that sequence is likely something that doesn’t make sense in the dataset. For example, a good language model would score something like $p(\texttt{The quick brown fox jumps over the lazy dog})$ high and something like $p(\texttt{The fox brown jumped dog laziness over lazy})$ low because the former has proper grammar and uses known words in the vocabulary.

This is what a language model does: given an input sequence $x_1,\cdots,x_N$, it assigns a probability $p(x_1,\cdots,x_N)$ that represents how likely it is to appear in the dataset. That seems a little strange given we’ve just discussed the above applications. What does something like generating text have to do with assigning probabilities to sequences? Well we want the generated text to match well with the dataset, don’t we? In other words, we don’t want text with poor grammar or broken sentences. This also explains why those phenomenal LLMs are trained on billions of examples: they need diversity in order to assign high probabilities to sentences that encode facts and data of the dataset.

So how do we actually compute this probability? Well the most basic definition of probability is “number of events that happened” / “number of all possible events” so we can try to do the same thing with this sequence of words.

\[p(w_1,\dots, w_N)=\displaystyle\frac{C(w_1,\dots, w_N)}{\displaystyle\sum_{w_1,\dots,w_N} C(w_1,\dots, w_N)}\]

So for a word sequence $w_1,\dots, w_N$, in our corpus, we count how many times we find that sequence divide by all possible word sequences of length $N$. There are several problems with this. To compute the numerator, we need to count a particular sequence in the dataset but notice that this gets harder to do the longer the sequence is. For example, finding the sequence “the cat” is far easier than finding the sequence “the cat sat on the mat wearing a burgundy hat”. To compute the denominator, we need the combination of all English words up to length $N$. To give a sense of scale, Merriam Webster estimates there are about ~1 million words so this becomes the combinatorial problem.

\[\binom{1\mathrm{e}6}{N} = \displaystyle\frac{1\mathrm{e}6!}{N!(1\mathrm{e}6-N)!}\]

In other words, for each word up to $N$, there are about a million possibilities we have to account for until we get up to the desired sequence length. The factorial of a million is an incredibly large number! So these reasons make it difficult to compute language model probabilities in that form so we have to try something else. If we remember some probability theory, we can try to rearrange the terms using the chain rule of probability.

\[\begin{align*} p(w_1,\dots, w_N) &= p(w_N|w_1,\dots,w_{N-1})p(w_1,\dots,w_{N-1})\\ &= p(w_N|w_1,\dots,w_{N-1})p(w_{N-1}|w_1,\dots,w_{N-2})p(w_1,\dots,w_{N-2})\\ &= \displaystyle\prod_{i=1}^N p(w_i|w_1,\dots,w_{i-1})\\ \end{align*}\]

So we’ve decomposed the joint distribution of the language model into a product of conditionals $p(w_i\vert w_1,\dots,w_{i-1})$. Intuitively, this measures the probability that word $w_i$ follows the previous sequence $w_1,\dots,w_{i-1}$. Basically for a word, we depend on all previous words. So let’s see if this is any easier to practically count up the sequences.

\[p(w_i|w_1,\dots,w_{i-1})=\displaystyle\frac{C(w_1,\dots,w_i)}{C(w_1,\dots,w_{i-1})}\]

This looks a little better! Intuitively, we count a particular sequence up to $i$: $w_1,\dots,w_i$ in the corpus. But the denominator, we only count up to the previous word $w_1,\dots,w_{i-1}$. This is a bit better than going up to the entire sequence length $N$ but still a problem. Particularly, the biggest problem is the history $w_1,\dots,w_{i-1}$. How do we deal with it?

n-gram Model

Rather than dealing with the entire history up to a certain word, we can approximate it using only the past few words! This is the premise behind n-gram models: we approximate the entire past history using the past $n$ words.

\[p(w_i|w_1,\dots,w_{i-1})\approx p(w_i|w_{1-(n-1)},\dots,w_{i-1})\]

A unigram model looks like $p(w_i)$; a bigram model looks like $p(w_i\vert w_{i-1})$; a trigram model looks like $p(w_i\vert w_{i-1},w_{i-2})$. Intuitively, a unigram model looks at no prior words; a bigram models looks only at the previous word; a trigram model looks at only the past two words. Now let’s see if it’s easier to compute these conditional distributions using the same counting equation.

\[\begin{align*} p(w_i|w_{i-1})&=\displaystyle\frac{C(w_{i-1}, w_i)}{\displaystyle\sum_w C(w_{i-1}, w)}\\ &\to\displaystyle\frac{C(w_{i-1}, w_i)}{C(w_{i-1})} \end{align*}\]

We go to the second line by using maximum likelihood estimation. Computing these counts is much easier! To see this, let’s actually compute an n-gram model by hand using a very small corpus.

\[\texttt{<SOS>}\text{I am Sam}\texttt{<EOS>}\] \[\texttt{<SOS>}\text{Sam I am}\texttt{<EOS>}\]

Practically, we use special tokens that denote the start of the sequence (<SOS>) and end of sequence (<EOS>). The <EOS> token is required to normalize the conditional distribution into a true probability distribution. The <SOS> token is optional but it becomes useful for sampling the language model later so we’ll add it. Treating these as two special tokens, let’s compute the bigram word counts and probabilities by hand.

$w_i$	$w_{i-1}$	$p(w_i\vert w_{i-1})$
I	<SOS>	$\frac{1}{2}$
Sam	<SOS>	$\frac{1}{2}$
<EOS>	Sam	$\frac{1}{2}$
I	Sam	$\frac{1}{2}$
Sam	am	$\frac{1}{2}$
<EOS>	am	$\frac{1}{2}$
am	I	$1$

Concretely, let’s see how to compute $p(\text{I}\vert\text{Sam})$. Intuitively, this is asking for the likelihood that “I” follows “Sam”. In our corpus, we have two instances of “Sam” and the words after are “<EOS>” and “I”. So overall, the likelihood is $\frac{1}{2}$. Notice how the conditionals form a valid probability distribution, e.g., $\sum_w p(w\vert\text{Sam}) = 1$.

With this model, we can approximate the full language model with a product of n-grams. Consider bigrams:

\[\begin{align*} p(w_1,\dots, w_N)&\approx p(w_2|w_1)p(w_3|w_2)\cdots p(w_N|w_{N-1})\\ p(\text{the cat sat on the mat}) &\approx p(\text{the}|\texttt{<SOS>})p(\text{cat}|\text{the})\cdots p(\texttt{<EOS>}|\text{mat}) \end{align*}\]

This is a lot more tractable! So now we have an approximation of the language model! What other kinds of things can we do? We can sample from language models. We start with the <SOS> and then use the conditionals to sample. We can either keep sampling until we hit a <EOS> or we can keep sampling for a fixed number of words. This is why we have a <SOS>: if we didn’t, we’d need to specific a start token. But since we used <SOS>, we have a uniform start token.

Practical Language Modeling

Now that we’ve covered the maths, let’s talk about some practical aspects of language modeling. The first problem we can address is what we just talked about: approximating a full language model with the product of n-grams.

\[p(w_1,\dots, w_N)\approx p(w_2|w_1)p(w_3|w_2)\cdots p(w_N|w_{N-1})\]

What’s the problem with this? Numerically, when we multiply a bunch of probabilities together, we’re multiplying together numbers that are in $[0, 1]$ which means the probability gets smaller and smaller. This has a risk of underflowing to 0. To avoid this, we use a trick called the exp-log-sum trick:

\[\exp\Big[\log p(w_2|w_1)+\log p(w_3|w_2)+\cdots+\log p(w_N|w_{N-1})\Big]\]

In the log-space, multiplying is adding so the number just gets increasingly negative rather than increasingly small. Then we can take the exponential to “undo” the log-space.

Going beyond the numerical aspects, practically, language models need to be trained on a large corpus because of sparsity. After we train, two major problems we encounter in the field are unknown words not in the training corpus and words that are known but used in an unknown context.

For the former, when we train language models, we often construct a vocabulary during training. This can either be an open vocabulary where we add words as we see them or a closed vocabulary where we agree on the words ahead of time (perhaps the most common $k$ words for example). In either case, during inference, we’ll encounter out-of-vocabulary (OOV) words. One solution to this is to create a special token called <UNK> that represents unknown words. For any OOV word, we map it to the <UNK> token and treat it like any other token in our vocabulary.

Smoothing

What about known words in an unknown context? Let’s consider how we compute bigrams.

\[p(w_i|w_{i-1})=\displaystyle\frac{C(w_{i-1},w_i)}{C(w_{i-1})}\]

Mathematically, the problem is that the numerator can be zero. So the simplest solution is to make it not zero by adding $1$. But we can’t simply add $1$ without correcting the denominator since we want a valid probability distribution. So we also need to add something to the denominator. Since we’re adding $1$ to each count for each word, we need to add a count for the total number of words in the vocabulary $V$.

\[p(w_i|w_{i-1})=\displaystyle\frac{C(w_{i-1},w_i)+1}{C(w_{i-1})+V}\]

With this, we’re guaranteed not to have zero counts! This is called Laplace Smoothing. The issue with this kind of smoothing is that the probability density moves too sharply since we’re just blindly adding $1$. We can generalize this so that we actually add some $k$ (and normalize by $kV$) to help better ease the probability density less sharply towards the unknown context event.

\[p(w_i|w_{i-1})=\displaystyle\frac{C(w_{i-1},w_i)+k}{C(w_{i-1})+kV}\]

This is called Add-$k$ Smoothing. It can perform better than Laplace Smoothing in most cases, with the appropriate choice of $k$ tuned.

Backoff and Interpolation

One alternative to smoothing is to try to use less information if it’s available. The intuition is that if we can’t find a bigram $p(w_{i-1},w_i)$, we can see if a unigram exists $p(w_i)$ that we can use in its place. This technique is called backoff because we back off to a smaller n-gram.

Going a step further, we don’t have to necessarily choose between using backing off to only the $(n-1)$-gram. We can choose to always consider all previous n-gram, but create a linear combination of them.

\[\begin{align*} p(w_i|w_{i-2},w_{i-1})&=\lambda_1 p(w_i)+\lambda_2 p(w_i|w_{i-1})+\lambda_3 p(w_i|w_{i-2},w_{i-1})\\ \displaystyle\sum_i \lambda_i &= 1 \end{align*}\]

Here the $\lambda_i$s are the interpolation coefficients and they have to sum to $1$ to create a valid probability distribution. This allows us to consider all previous n-grams in the absence of data. Backoff with interpolation works pretty well in practice.

Code

We’ve been talking about the theory of language models and n-gram models for a while but let’s actually try training one on a dataset and use it to generate text! Fortunately since they’ve been around for a while, training them is very simple with existing libraries.

from torchtext.datasets import AG_NEWS
import re

from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline

N = 6

data = AG_NEWS(root='.', split='train')
train, vocab = padded_everygram_pipeline(N, 
    [re.sub(r'[^A-Za-z0-9 ]+', '', x[1]).split() for x in data])
lm = MLE(N)
lm.fit(train, vocab)
print(' '.join(lm.generate(20, random_seed=4)))

We’re using the AG_NEWS dataset that contains 120,000 training examples of news articles across World, Sports, Business, and Science/Tech. The padded_everygram_pipeline adds the <SOS> and <EOS> tokens and creates n-grams and backoff n-grams; we’re using 6-grams which tend to work well in practice. For simplicity, we ignore any non-alphanumeric character besides spaces. Then we use a maximum likelihood estimator (similar to the conditional distribution tables we created above) to train our model. Finally, we can generate some examples of length 20.

I tried a bunch of different seeds and here are a few cherry-picked examples (I’ve truncated them after the <EOS> token):

Belgian cancer patient made infertile by chemotherapy has given birth following revolutionary treatment
Two US citizens were killed when a truck bomb exploded in downtown Kabul in the second deadly blast to strike
This year the White House had rejected a similar request made by 130 Republican and Democratic members of Congress
Greatly enlarged museum is expected to turn into a cacophony on Saturday

These look pretty good for just an n-gram model! Notice they retain some information, probabilistically, across the sequence. For example, in the first one, the word “infertile” comes before “birth” since, when generating “birth”, we could see “infertile” in our previous history.

But I also found scenarios where the generated text didn’t really make any. Here are some of those lemon-picked examples:

For small to medium businesses
Can close the gap with SAP the world 39s biggest software company after buying US rival PeopleSoft Oracle 39s Chairman
British athletics appoint psychologist for 2008 Olympics British athletics chiefs have appointed sports psychologist David Collins
Can close the gap with SAP the world 39s biggest software company after buying US rival PeopleSoft Oracle 39s Chairman

These are sometimes short phrases or nonsensical with random digits. In one case, the language model just generated a bunch of <EOS> tokens! These examples also help show why neural language models tend to outperform simplistic n-gram models in general. Feel free to change the dataset and generate your own sentences!

Conclusion

Large Language Models (LLMs) are gaining traction online as being able to perform complex and sequential reasoning tasks. They’re often treated as black-box models but understanding a bit about how they work can make it easier to interact with them. Starting from the beginning, we learned a bit about language itself and why this problem is so difficult and why it wasn’t solved decades ago. We introduced language modeling as a task of assigning a probability to a sequence of words based on how likely it is to appear in the dataset. Then we learned about how $n$-gram models approximate this full previous history of a particular word using only the past $n$ words. We can use these models for language modeling and sampling. We finally discussed some practical considerations when training language models including handing unknown words and backoff and interpolation.

There’s still a lot more to cover! This is just the start of our journey to the precipice of language modeling 🙂