A language mannequin is a likelihood distribution over sequences of tokens. Once you prepare a language mannequin, you wish to measure how precisely it predicts human language use. This can be a troublesome process, and also you want a metric to judge the mannequin. On this article, you’ll study concerning the perplexity metric. Particularly, you’ll study:
- What’s perplexity, and how one can compute it
- Find out how to consider the perplexity of a language mannequin with pattern information
Let’s get began.
Evaluating Perplexity on Language Fashions
Picture by Lucas Davis. Some rights reserved.
Overview
This text is split into two components; they’re:
- What Is Perplexity and Find out how to Compute It
- Consider the Perplexity of a Language Mannequin with HellaSwag Dataset
What Is Perplexity and Find out how to Compute It
Perplexity is a measure of how properly a language mannequin predicts a pattern of textual content. It’s outlined because the inverse of the geometric imply of the chances of the tokens within the pattern. Mathematically, perplexity is outlined as:
$$
PPL(x_{1:L}) = prod_{i=1}^L p(x_i)^{-1/L} = expbig(-frac{1}{L} sum_{i=1}^L log p(x_i)massive)
$$
Perplexity is a operate of a selected sequence of tokens. In observe, it’s extra handy to compute perplexity because the imply of the log possibilities, as proven within the system above.
Perplexity is a metric that quantifies how a lot a language mannequin hesitates concerning the subsequent token on common. If the language mannequin is completely sure, the perplexity is 1. If the language mannequin is totally unsure, then each token within the vocabulary is equally doubtless; the perplexity is the same as the vocabulary dimension. You shouldn’t count on perplexity to transcend this vary.
Consider the Perplexity of a Language Mannequin with HellaSwag Dataset
Perplexity is a dataset-dependent metric. One dataset you should use is HellaSwag. It’s a dataset with prepare, take a look at, and validation splits. It’s accessible on the Hugging Face hub, and you’ll load it with the next code:
|
import datasets  dataset = datasets.load_dataset(“HuggingFaceFW/hellaswag”) print(dataset)  for pattern in dataset[“validation”]:     print(pattern)     break |
Working this code will print the next:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
DatasetDict({ Â Â Â Â prepare: Dataset({ Â Â Â Â Â Â Â Â options: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’, Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ‘source_id’, ‘split’, ‘split_type’, ‘label’], Â Â Â Â Â Â Â Â num_rows: 39905 Â Â Â Â }) Â Â Â Â take a look at: Dataset({ Â Â Â Â Â Â Â Â options: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’, Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ‘source_id’, ‘split’, ‘split_type’, ‘label’], Â Â Â Â Â Â Â Â num_rows: 10003 Â Â Â Â }) Â Â Â Â validation: Dataset({ Â Â Â Â Â Â Â Â options: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’, Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ‘source_id’, ‘split’, ‘split_type’, ‘label’], Â Â Â Â Â Â Â Â num_rows: 10042 Â Â Â Â }) }) {‘ind’: 24, ‘activity_label’: ‘Roof shingle removing’, ‘ctx_a’: ‘A person is sitting on a roof.’, ‘ctx_b’: ‘he’, ‘ctx’: ‘A person is sitting on a roof. he’, ‘endings’: [ Â Â Â Â ‘is using wrap to wrap a pair of skis.’, ‘is ripping level tiles off.’, Â Â Â Â “is holding a rubik’s cube.”, ‘starts pulling up roofing on a roof.’ ], ‘source_id’: ‘activitynet~v_-JhWjGDPHMY’, ‘cut up’: ‘val’, ‘split_type’: ‘indomain’, ‘label’: ‘3’} |
You possibly can see that the validation cut up has 10,042 samples. That is the dataset you’ll use on this article. Every pattern is a dictionary. The important thing "activity_label" describes the exercise class, and the important thing "ctx" gives the context that must be accomplished. The mannequin is predicted to finish the sequence by deciding on one of many 4 endings. The important thing "label", with values 0 to three, signifies which ending is right.
With this, you possibly can write a brief code to judge your individual language mannequin. Let’s use a small mannequin from Hugging Face for example:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
import datasets import torch import torch.nn.practical as F import tqdm import transformers  mannequin = “openai-community/gpt2”  # Load the mannequin torch.set_default_device(“cuda” if torch.cuda.is_available() else “cpu”) tokenizer = transformers.AutoTokenizer.from_pretrained(mannequin) mannequin = transformers.AutoModelForCausalLM.from_pretrained(mannequin)  # Load the dataset: HellaSwag has prepare, take a look at, and validation splits dataset = datasets.load_dataset(“hellaswag”, cut up=“validation”)  # Consider the mannequin: Compute the perplexity of every ending num_correct = 0 for pattern in tqdm.tqdm(dataset):     # tokenize textual content from the pattern     textual content = tokenizer.encode(” “ + pattern[“activity_label”] + “. “ + pattern[“ctx”])     endings = [tokenizer.encode(” “ + x) for x in sample[“endings”]]  # 4 endings     groundtruth = int(pattern[“label”])  # integer, 0 to three     # generate logits for every ending     perplexities = [0.0] * 4     for i, ending in enumerate(endings):         # run your complete enter and ending to the mannequin         input_ids = torch.tensor(textual content + ending).unsqueeze(0)         output = mannequin(input_ids).logits         # extract the logits for every token within the ending         logits = output[0, len(text)–1:, :]         token_probs = F.log_softmax(logits, dim=–1)         # accumulate the likelihood of producing the ending         log_prob = 0.0         for j, token in enumerate(ending):             log_prob += token_probs[j, token]         # convert the sum of log possibilities to perplexity         perplexities[i] = torch.exp(–log_prob / len(ending))     # print the perplexity of every ending     print(pattern[“activity_label”] + “. “ + pattern[“ctx”])     right = perplexities[groundtruth] == min(perplexities)     for i, p in enumerate(perplexities):         if i == groundtruth:             image = ‘(O)’ if right else ‘(!)’         elif p == min(perplexities):             image = ‘(X)’         else:             image = ‘  ‘         print(f“Ending {i}: {p:.4g} {image} – {pattern[‘endings’][i]}”)     if right:         num_correct += 1  print(f“Accuracy: {num_correct}/{len(dataset)} = {num_correct / len(dataset):.4f}”) |
This code hundreds the smallest GPT-2 mannequin from the Hugging Face Hub. It’s a 124M-parameter mannequin which you can simply run on a low-profile pc. The mannequin and tokenizer are loaded utilizing the Hugging Face transformers library. You additionally load the HellaSwag validation dataset.
Within the for-loop, you tokenize the exercise label and the context. You additionally tokenize every of the 4 endings. Observe that tokenizer.encode() is the tactic for utilizing the tokenizer from the transformers library. It’s completely different from the tokenizer object you used within the earlier article.
Subsequent, for every ending, you run the concatenated enter and ending to the mannequin. The input_ids tensor is a 2D tensor of integer token IDs with the batch dimension 1. The mannequin returns an object, during which you extract the output logits tensor. That is completely different from the mannequin you constructed within the earlier article as it is a mannequin object from the transformers library. You possibly can simply swap it together with your skilled mannequin object with minor modifications.
GPT-2 is a decoder-only transformer mannequin. It processes the enter with a causal masks. For an enter tensor of form $(1, L)$, the output logits tensor has form $(1, L, V)$, the place $V$ is the vocabulary dimension. The output at place $p$ corresponds to the mannequin’s estimate of the token at place $p+1$, relying on the enter at positions 1 to $p$. Subsequently, you extract the logits beginning at offset $n-1$, the place $n$ is the size of the mixed exercise label and context. You then convert the logits to log possibilities and compute the common over the size of every ending.
The worth token_probs[j, token] is the log likelihood at place j for the token with ID token. The imply log-probability of every token within the ending is used to compute the perplexity. A great mannequin is predicted to establish the right ending with the bottom perplexity. You possibly can consider a mannequin by counting the variety of right predictions over your complete HellaSwag validation dataset. Once you run this code, you will note the next:
|
… Finance and Enterprise. [header] Find out how to purchase a peridot Evaluating Perplexity on Language Fashions Take a look at a wide range of stones… Ending 0: 13.02 (X) – You’ll want to watch a number of of the gem stones, notably eme… Ending 1: 30.19 – Not solely are they among the many delicates amongst them, however they are often… Ending 2: 34.96 (!) – Familiarize your self with the completely different shades that it is available in, … Ending 3: 28.85 – Neither peridot nor many different jade or allekite stones are necess… Household Life. [header] Find out how to inform in case your teen is being abused Evaluating Perplexity on Language Fashions Take note of… Ending 0: 16.58 – Strive to determine why they’re dressing one thing that’s frowned… Ending 1: 22.01 – Learn the next as a rule for figuring out your teen’s behaviou… Ending 2: 15.21 (O) – [substeps] As an illustration, your teen might attempt to cover the indicators of a… Ending 3: 23.91 – [substeps] Ask your teen if they’ve black tights (with stripper… Accuracy: 3041/10042 = 0.3028 |
The code prints the perplexity of every ending and marks the right reply with (O) or (!) and the mannequin’s mistaken prediction with (X). You possibly can see that GPT-2 has a perplexity of 10 to twenty, even for an accurate reply. Superior LLMs can obtain perplexity under 10, even with a a lot bigger vocabulary dimension than GPT-2. Extra essential is whether or not the mannequin can establish the right ending: the one which naturally completes the sentence. It ought to be the one with the bottom perplexity; in any other case, the mannequin can not generate the right ending. GPT-2 achieves solely 30% accuracy on this dataset.
You may as well repeat the code with a distinct mannequin. Listed here are the outcomes:
- mannequin
openai-community/gpt2: That is the smallest GPT-2 mannequin with 124M parameters, used within the code above. The accuracy is 3041/10042 or 30.28% - mannequin
openai-community/gpt2-medium: That is the bigger GPT-2 mannequin with 355M parameters. The accuracy is 3901/10042 or 38.85% - mannequin
meta-llama/Llama-3.2-1B: That is the smallest mannequin within the Llama household with 1B parameters. The accuracy is 5731/10042 or 57.07%
Subsequently, it’s pure to see larger accuracy with bigger fashions.
Observe that you shouldn’t evaluate perplexities throughout fashions with vastly completely different architectures. Since perplexity is a metric within the vary of 1 to the vocabulary dimension, it extremely relies on the tokenizer. You possibly can see the rationale whenever you evaluate the perplexity within the code above after changing GPT-2 with Llama 3.2 1B: The perplexity is an order of magnitude larger for Llama 3, however the accuracy is certainly higher. It’s because GPT-2 has a vocabulary dimension of solely 50,257, whereas Llama 3.2 1B has a vocabulary dimension of 128,256.
Additional Readings
Beneath are some assets that you could be discover helpful:
Abstract
On this article, you discovered concerning the perplexity metric and how one can consider the perplexity of a language mannequin with the HellaSwag dataset. Particularly, you discovered:
- Perplexity measures how a lot a mannequin hesitates concerning the subsequent token on common.
- Perplexity is a metric delicate to vocabulary dimension.
- Computing perplexity means computing the geometric imply of the chances of the tokens within the pattern.
