12.1 C
Canberra
Tuesday, March 31, 2026

From Immediate to Prediction: Understanding Prefill, Decode, and the KV Cache in LLMs


Within the earlier article, we noticed how a language mannequin converts logits into chances and samples the subsequent token. However the place do these logits come from?

On this tutorial, we take a hands-on strategy to grasp the era pipeline:

  • How the prefill section processes your complete immediate in a single parallel move
  • How the decode section generates tokens separately utilizing beforehand computed context
  • How the KV cache eliminates redundant computation to make decoding environment friendly

By the tip, you’ll perceive the two-phase mechanics behind LLM inference and why the KV cache is crucial for producing lengthy responses at scale.

Let’s get began.

From Immediate to Prediction: Understanding Prefill, Decode, and the KV Cache in LLMs

From Immediate to Prediction: Understanding Prefill, Decode, and the KV Cache in LLMs
Picture by Neda Astani. Some rights reserved.

Overview

This text is split into three components; they’re:

  • How Consideration Works Throughout Prefill
  • The Decode Part of LLM Inference
  • KV Cache: How you can Make Decode Extra Environment friendly

How Consideration Works Throughout Prefill

Think about the immediate:

At this time’s climate is so …

As people, we are able to infer the subsequent token must be an adjective, as a result of the final phrase “so” is a setup. We additionally realize it most likely describes climate, so phrases like “good” or “heat” are extra possible than one thing unrelated like “scrumptious“.

Transformers arrive on the similar conclusion by way of consideration. Throughout prefill, the mannequin processes the whole immediate in a single ahead move. Each token attends to itself and all tokens earlier than it, build up a contextual illustration that captures relationships throughout the total sequence.

The mechanism behind that is the scaled dot-product consideration method:

$$
textual content{Consideration}(Q, Okay, V) = mathrm{softmax}left(frac{QK^prime}{sqrt{d_k}}proper)V
$$

We’ll stroll by way of this concretely under.

To make the eye computation traceable, we assign every token a scalar worth representing the data it carries:

Place Tokens Values
1 At this time 10
2 climate 20
3 is 1
4 so 5

Phrases like “is” and “so” carry much less semantic weight than “At this time” or “climate“, and as we’ll see, consideration naturally displays this.

Consideration Heads

In actual transformers, consideration weights are steady values realized throughout coaching by way of the $Q$ and $Okay$ dot product. The conduct of consideration heads are realized and normally unimaginable to explain. No head is hardwired to “attend to even positions”. The 4 guidelines under are simplified illustration to make consideration mechanism extra intuitive, whereas the weighted aggregation over $V$ is identical.

Listed below are the principles in our toy instance:

  1. Attend to tokens at even quantity positions
  2. Attend to the final token
  3. Attend to the primary token
  4. Attend to each token

For simplicity on this instance, the outputs from these heads are then mixed (averaged).

Let’s stroll by way of the prefill course of:

At this time

  1. Even tokens → none
  2. Final token → At this time → 10
  3. First token → At this time → 10
  4. All tokens → At this time → 10

climate

  1. Even tokens → climate → 20
  2. Final token → climate → 20
  3. First token → At this time → 10
  4. All tokens → common(At this time, climate) → 15

is

  1. Even tokens → climate → 20
  2. Final token → is → 1
  3. First token → At this time → 10
  4. All tokens → common(At this time, climate, is) → 10.33

so

  1. Even tokens → common(climate, so) → 12.5
  2. Final token → so → 5
  3. First token → At this time → 10
  4. All tokens → common(At this time, climate, is, so) → 9

Parallelizing Consideration

If the immediate contained 100,000 tokens, computing consideration step-by-step could be extraordinarily gradual. Thankfully, consideration might be expressed as tensor operations, permitting all positions to be computed in parallel.

That is the important thing thought of prefill section in LLM inference: Once you present a immediate, there are a number of tokens in it and they are often processed in parallel. Such parallel processing helps pace up the response time for the primary token generated.

To forestall tokens from seeing future tokens, we apply a causal masks, to allow them to solely attend to itself and earlier tokens.

Output:

Now, we are able to begin writing the “guidelines” for the 4 consideration heads.

Relatively than computing scores from realized $Q$ and $Okay$ vectors, we handcraft them on to match our 4 consideration guidelines. Every head produces a rating matrix of form (n, n), with one rating per query-key pair, which will get masked and handed by way of softmax to supply consideration weights:

Output:

The results of this step is named a context vector, which represents a weighted abstract of all earlier tokens.

From contexts to logits

Every consideration head has realized to select up on completely different patterns within the enter. Collectively, the 4 context values [12.5, 5.0, 10.0, 9.0] kind a abstract of what “At this time’s climate is so…” represents. It’ll then challenge to a matrix, which every column encodes how robust a given vocabulary is related to every consideration head’s sign, to present logit rating per phrase.

For our instance, let’s say we have now “good”, “heat”, and “scrumptious” within the vocab:

So the logits for “good” and “heat” are a lot greater than “scrumptious”.

The Decode Part of LLM Inference

Now suppose the mannequin generates the subsequent token: “good“. The duty is now to generate the subsequent token with the prolonged immediate:

At this time’s climate is so good …

The primary 4 phrases within the prolonged immediate are the identical as the unique immediate. And now we have now the fifth phrase within the immediate.

Throughout decode, we don’t recompute consideration for all earlier tokens because the end result could be the identical. As an alternative, we compute consideration just for the brand new token to save lots of time and compute assets. This produces a single new consideration row.

Output:

Now, we apply the 4 consideration heads and compute the brand new context vector:

Output:

Nevertheless, in contrast to prefill the place the whole immediate is processed in parallel, decoding should generate tokens separately (autoregressively) as a result of the longer term tokens haven’t but been generated. With out caching, each decode step would recompute keys and values for all earlier tokens from scratch, making the overall work throughout all decode steps $O(n^2)$ in sequence size. KV cache reduces this to $O(n)$ by computing every token’s $Okay$ and $V$ precisely as soon as.

KV Cache: How you can Make Decode Extra Environment friendly

To make the autoregressive docoding environment friendly, we are able to retailer the keys ($Okay$) and values ($V$) for each token individually for every consideration head. On this simplified instance we’d use just one cache. Then, throughout decoding, when a brand new token is generated, the mannequin doesn’t recompute keys and values for all earlier tokens. It computes the question for the brand new token, and attends to the cached keys and values from earlier tokens.

If we have a look at the earlier code once more, we are able to see that there is no such thing as a must recompute $Okay$ for the whole tensor:

As an alternative, we are able to merely compute Okay for the brand new place, and connect it to the Okay matrix we have now already computed and saved in cache:

Right here’s the total code for decode section utilizing KV cache:

Output:

Discover that is similar to the end result we computed with out the cache. KV cache doesn’t change what the mannequin computes, but it surely eliminates redundant computations.

KV cache is completely different from the cache in different utility that the item saved is just not changed however up to date. Each new token added to the immediate appends a brand new row to the tensor saved. Implementing a KV cache that may effectively replace the tensor is the important thing to make LLM inference sooner.

Additional Readings

Under are some assets that you could be discover helpful:

Abstract

On this article, we walked by way of the 2 phases of LLM inference. Throughout Prefill, the total immediate is processed in a single parallel ahead move and the Keys and Values for each token are computed and saved. Throughout Decode, the mannequin generates one token at a time, utilizing solely the brand new token’s Question towards the cached Keys and Values to keep away from redundant recomputation. Collectively, these two phases clarify why LLMs can course of lengthy prompts rapidly however generate output token by token, and why KV cache is crucial for making that era sensible at scale.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

[td_block_social_counter facebook="tagdiv" twitter="tagdivofficial" youtube="tagdiv" style="style8 td-social-boxed td-social-font-icons" tdc_css="eyJhbGwiOnsibWFyZ2luLWJvdHRvbSI6IjM4IiwiZGlzcGxheSI6IiJ9LCJwb3J0cmFpdCI6eyJtYXJnaW4tYm90dG9tIjoiMzAiLCJkaXNwbGF5IjoiIn0sInBvcnRyYWl0X21heF93aWR0aCI6MTAxOCwicG9ydHJhaXRfbWluX3dpZHRoIjo3Njh9" custom_title="Stay Connected" block_template_id="td_block_template_8" f_header_font_family="712" f_header_font_transform="uppercase" f_header_font_weight="500" f_header_font_size="17" border_color="#dd3333"]
- Advertisement -spot_img

Latest Articles