· 6 min read ·

LLM Fundamentals: Part 2 -- How LLMs Generate Text

LLMs predict one token at a time from a probability distribution. Temperature, top-k, and top-p control that distribution, not creativity.

ai llm-fundamentals

This is Part 2 of the LLM Fundamentals series.

Part 1 broke down tokens. Now: how the model picks which token comes next.

Every LLM API exposes three sampling parameters that control how the model selects its next token:

  • Temperature controls how peaked or flat the probability distribution is. Low temperature means the model almost always picks the most likely token. High temperature gives lower-probability tokens a real chance.
  • Top-k limits the model to choosing from only the k most probable tokens, cutting off everything else.
  • Top-p (nucleus sampling) limits the model to choosing from however many tokens are needed to reach a cumulative probability threshold, adapting to how confident the model is at each step.

All three get treated as creativity dials. Slide temperature up for creative writing, slide it down for structured output. It works in practice, but the mental model is wrong. All three control a probability distribution over tokens, and once you see them that way, prompting decisions and debugging get a lot more concrete.

One Token at a Time

Language models generate text autoregressively. Given a sequence of tokens, the model predicts a probability distribution over all possible next tokens. It samples one token from that distribution, appends it to the sequence, and repeats. Every token in every response came out this way: one at a time, left to right, each conditioned on everything before it.

GPT-2 formalized this as modeling p(x) = p(x₁) * p(x₂|x₁) * p(x₃|x₁,x₂) * …, where each factor is a conditional probability of the next token given all previous tokens. Nothing about this process is generative in the way people imagine. No planning, no outline, no revision. Just a sequence of probability lookups, each one informed by the full context so far.

This explains common failure modes. When a model contradicts itself mid-paragraph, the token-by-token process has no global plan. When it trails off into repetition, the conditional probabilities have collapsed into a loop. Once you know the mechanism, the symptoms become predictable.

What the Model Actually Outputs

At each step, the model produces a vector of raw scores called logits, one per token in its vocabulary. A vocabulary of 100,000 tokens means 100,000 logits. Softmax converts these logits into a proper probability distribution that sums to 1.0.

Most of the probability mass concentrates on a handful of tokens. For a sentence like “I drove my car to the ___”, the model might assign 12% to “store,” 9% to “shop,” 7% to “garage,” and scatter the remaining 72% across thousands of lower-probability options.

Token probability distribution showing most mass on a few candidates

Holtzman et al. showed that this distribution shape varies dramatically depending on context: some positions have one overwhelmingly likely continuation, while others spread probability across dozens of plausible tokens. A fixed decoding strategy cannot handle both cases well.

Temperature Scales the Distribution

Temperature modifies the logits before softmax converts them to probabilities. Each logit gets divided by the temperature value T, so the softmax input becomes logit/T instead of logit.

When T is low (say 0.2), dividing by a small number amplifies the differences between logits. High-probability tokens become even more dominant. Low-probability tokens become negligible. When T is high (approaching 1.0), the differences shrink and the distribution flattens, giving more tokens a realistic chance of being selected.

Anthropic’s API accepts temperature values from 0.0 to 1.0, with a default of 1.0. At temperature 0.0, the model still does not guarantee deterministic output across API calls. Internal floating-point precision and infrastructure details mean you can get different results from identical inputs, even at zero temperature.

Temperature comparison: T=0.2 peaked vs T=1.0 flat distribution

Temperature 0.0 works well for classification and structured extraction. Temperature 1.0 (the default) suits open-ended generation. Either way, it is not a creativity dial. It is a peakedness control on a probability distribution.

Top-k and Top-p: Two Other Ways to Shape the Distribution

Temperature scales the whole distribution, but two other parameters cut it instead. Top-k keeps only the k most probable tokens and zeros out everything else. Set k to 3, and the model picks from exactly three candidates regardless of how the probability spreads. Top-k’s weakness is rigidity: 3 tokens might be too few when the model is genuinely uncertain, and too many when one token dominates.

Top-p (nucleus sampling) solves this by adapting to the distribution’s shape. Instead of a fixed count, it includes tokens in descending probability order until their cumulative probability reaches the threshold p. Set p to 0.9, and you get however many tokens it takes to cover 90% of the mass. Confident positions include few tokens. Uncertain positions include many.

Three sampling methods compared: temperature scales, top-k cuts at a fixed count, top-p adapts to the distribution

Most production use cases never touch either one. Anthropic recommends both for advanced use cases only, and recommends against combining top-p with temperature. If output quality is off, rewriting the prompt will fix it before sampling parameters will.

One clarification worth making: “parameters” in the API sense (temperature, top-k, top-p) are not the same as model parameters. Model parameters are the billions of learned weights from training, covered in Part 0. API parameters control how you sample from the distribution. Model parameters define the distribution itself.

Practical Defaults

Common defaults:

# Structured output: classification, extraction, JSON
response = client.messages.create(
model="claude-sonnet-4-6",
temperature=0.0,
messages=[{"role": "user", "content": prompt}],
)
# Open-ended generation: writing, brainstorming
response = client.messages.create(
model="claude-sonnet-4-6",
temperature=1.0,
messages=[{"role": "user", "content": prompt}],
)

Temperature 0.0 for deterministic-ish tasks. Temperature 1.0 (the default) for everything else. Top-k and top-p rarely need adjustment, because the prompt itself has far more influence on output quality than sampling parameter tuning.

Reframing the Mental Model

Asking “what temperature should I use” is really asking: what distribution shape do you want? Lower temperature for consistent, high-probability continuations. Default temperature for broader exploration.

A model at temperature 1.0 is not more creative. It is sampling from a flatter distribution, which means lower-probability tokens have a better chance of being selected. Sometimes that produces surprising, useful output. Sometimes it produces incoherent garbage. Knowing the mechanism makes the outcome predictable for a given task.

Next up: context windows. Part 3 covers how much context a model can actually hold at once, and what happens when you push against that limit.