A newer technology, large language models (LLMs) predict a token or sequence of tokens, sometimes many paragraphs worth of predicted tokens. Remember that a token can be a word, a subword (a subset of a word), or even a single character. LLMs make much better predictions than N-gram language models or recurrent neural networks because:
This section introduces the most successful and widely used architecture for building LLMs: the Transformer.
What's a Transformer?Transformers are the state-of-the-art architecture for a wide variety of language model applications, such as translation:
Figure 1. A Transformer-based application that translates from English to French.Full transformers consist of an encoder and a decoder:
For example, in a translator:
This module focuses on full Transformers, which contain both an encoder and a decoder; however, encoder-only and decoder-only architectures also exist:
To enhance context, Transformers rely heavily on a concept called self-attention. Effectively, on behalf of each token of input, self-attention asks the following question:
"How much does each other token of input affect the interpretation of this token?"
The "self" in "self-attention" refers to the input sequence. Some attention mechanisms weight relations of input tokens to tokens in an output sequence like a translation or to tokens in some other sequence. But self-attention only weights the importance of relations between tokens in the input sequence.
To simplify matters, assume that each token is a word and the complete context is only a single sentence. Consider the following sentence:
The animal didn't cross the street because it was too tired.
The preceding sentence contains eleven words. Each of the eleven words is paying attention to the other ten, wondering how much each of those ten words matters to itself. For example, notice that the sentence contains the pronoun it. Pronouns are often ambiguous. The pronoun it typically refers to a recent noun or noun phrase, but in the example sentence, which recent noun does it refer to—the animal or the street?
The self-attention mechanism determines the relevance of each nearby word to the pronoun it. Figure 3 shows the results—the bluer the line, the more important that word is to the pronoun it. That is, animal is more important than street to the pronoun it.
Figure 3. Self-attention for the pronoun it. From Transformer: A Novel Neural Network Architecture for Language Understanding.Conversely, suppose the final word in the sentence changes as follows:
The animal didn't cross the street because it was too wide.
In this revised sentence, self-attention would hopefully rate street as more relevant than animal to the pronoun it.
Some self-attention mechanisms are bidirectional, meaning that they calculate relevance scores for tokens preceding and following the word being attended to. For example, in Figure 3, notice that words on both sides of it are examined. So, a bidirectional self-attention mechanism can gather context from words on either side of the word being attended to. By contrast, a unidirectional self-attention mechanism can only gather context from words on one side of the word being attended to. Bidirectional self-attention is especially useful for generating representations of whole sequences, while applications that generate sequences token-by-token require unidirectional self-attention. For this reason, encoders use bidirectional self-attention, while decoders use unidirectional.
What is multi-head multi-layer self-attention?Each self-attention layer is typically comprised of multiple self-attention heads. The output of a layer is a mathematical operation (for example, weighted average or dot product) of the output of the different heads.
Since the parameters of each head are initialized to random values, different heads can learn different relationships between each word being attended to and the nearby words. For example, the self-attention head described in the previous section focused on determining which noun the pronoun it referred to. However, other self-attention heads within the same layer might learn the grammatical relevance of each word to every other word, or learn other interactions.
A complete transformer model stacks multiple self-attention layers on top of one another. The output from the previous layer becomes the input for the next. This stacking allows the model to build progressively more complex and abstract understandings of the text. While earlier layers might focus on basic syntax, deeper layers can integrate that information to grasp more nuanced concepts like sentiment, context, and thematic links across the entire input.
Click the icon to learn about Big O for LLMs.Self-attention forces every word in the context to learn the relevance of all the other words in the context. So, it is tempting to proclaim this an O(N2) problem, where:
As if the preceding Big O weren't disturbing enough, Transformers contain multiple self-attention layers and multiple self-attention heads per self-attention layer, so Big O is actually:
O(N2 · S · D)
where:
You probably will never train an LLM from scratch. Training an industrial-strength LLM requires enormous amounts of ML expertise, computational resources, and time. Regardless, you clicked the icon to learn more, so we owe you an explanation.
The primary ingredient in building a LLM is a phenomenal amount of training data (text), typically somewhat filtered. The first phase of training is usually some form of unsupervised learning on that training data. Specifically, the model trains on masked predictions, meaning that certain tokens in the training data are intentionally hidden. The model trains by trying to predict those missing tokens. For example, assume the following sentence is part of the training data:
The residents of the sleepy town weren't prepared for what came next.
Random tokens are removed, for example:
The ___ of the sleepy town weren't prepared for ___ came next.
An LLM is just a neural net, so loss (the number of masked tokens the model correctly considered) guides the degree to which backpropagation updates parameter values.
A Transformer-based model trained to predict missing data gradually learns to detect patterns and higher-order structures in the data to get clues about the missing token. Consider the following example masked instance:
Oranges are traditionally ___ by hand. Once clipped from a tree, __ don't ripen.
Extensive training on enormous numbers of masked examples enable an LLM to learn that "harvested" or "picked" are high probability matches for the first token and "oranges" or "they" are good choices for the second token.
An optional further training step called
instruction tuningcan improve an LLM's ability to follow instructions.
Why are Transformers so large?Transformers contain hundreds of billion or even trillions of parameters. This course has generally recommended building models with a smaller number of parameters over those with a larger number of parameters. After all, a model with a smaller number of parameters uses fewer resources to make predictions than a model with a larger number of parameters. However, research shows that Transformers with more parameters consistently outperform Transformers with fewer parameters.
But how does an LLM generate text?You've seen how researchers train LLMs to predict a missing word or two, and you might be unimpressed. After all, predicting a word or two is essentially the autocomplete feature built into various text, email, and authoring software. You might be wondering how LLMs can generate sentences or paragraphs or haikus about arbitrage.
In fact, LLMs are essentially autocomplete mechanisms that can automatically predict (complete) thousands of tokens. For example, consider a sentence followed by a masked sentence:
My dog, Max, knows how to perform many traditional dog tricks. ___ (masked sentence)
An LLM can generate probabilities for the masked sentence, including:
Probability Word(s) 3.1% For example, he can sit, stay, and roll over. 2.9% For example, he knows how to sit, stay, and roll over.A sufficiently large LLM can generate probabilities for paragraphs and entire essays. You can think of a user's questions to an LLM as the "given" sentence followed by an imaginary mask. For example:
User's question: What is the easiest trick to teach a dog? LLM's response: ___
The LLM generates probabilities for various possible responses.
As another example, an LLM trained on a massive number of mathematical "word problems" can give the appearance of doing sophisticated mathematical reasoning. However, those LLMs are basically just autocompleting a word problem prompt.
Benefits of LLMsLLMs can generate clear, easy-to-understand text for a wide variety of target audiences. LLMs can make predictions on tasks they are explicitly trained on. Some researchers claim that LLMs can also make predictions for input they were not explicitly trained on, but other researchers have refuted this claim.
Problems with LLMsTraining an LLM entails many problems, including:
Using LLMs to infer predictions causes the following problems:
Suppose a Transformer is trained on a billion documents, including thousands of documents containing at least one instance of the word elephant. Which of the following statements are probably true?
Acacia trees, an important part of an elephant's diet, will gradually gain a high self-attention score with the word elephant.
Yes and this will enable the Transformer to answer questions about an elephant's diet.
The Transformer will associate the word elephant with various idioms that contain the word elephant.
Yes, the system will begin to attach high self-attention scores between the word elephant and other words in elephant idioms.
The Transformer will gradually learn to ignore any sarcastic or ironic uses of the word elephant in training data.
Sufficiently large Transformers trained on a sufficiently broad training set become quite adept at recognizing sarcasm, humor, and irony. So, rather than ignoring sarcasm and irony, the Transformer learns from it.
Key terms:RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4