LLM Fundamentals Part 1: Tokens Are Not Words | BDIGITAL

This is Part 1 of the LLM Fundamentals series.

How BPE Builds a Vocabulary

Most modern LLMs use a tokenization method called Byte Pair Encoding. Sennrich et al. introduced BPE for neural machine translation in 2016, adapting a data compression algorithm to solve an open-vocabulary problem: how do you handle words the model has never seen?

BPE starts with individual characters as its base vocabulary. It then scans the training corpus and repeatedly merges the most frequent pair of adjacent tokens into a new token. Given a tiny corpus containing “hug” (10 times), “pug” (5 times), and “hugs” (5 times), the pair “u” + “g” appears 20 times total, so BPE merges them into “ug” first. Next round, “h” + “ug” merges into “hug.” Common substrings get compressed into single tokens. Rare combinations stay split.

After enough merges, the vocabulary reaches a target size, typically 30,000 to 100,000 tokens. At that point, any input text can be broken down by applying the learned merge rules in order. Common English words like “the” or “and” become single tokens. A rare technical term or foreign name might split into three, four, or five pieces.

Here is the core insight: BPE adapts its vocabulary based on frequency in the training data. Words the model saw millions of times during training get efficient single-token representations. Words it rarely encountered get split into smaller subword chunks, each of which the model has seen frequently in other contexts. No word is truly unknown, because every word decomposes into pieces the model recognizes.

One Token Is Not One Word

A rough rule of thumb: one token equals approximately 3.5 English characters, or about 0.75 words. But that ratio shifts dramatically depending on content.

Common English prose tokenizes efficiently. Code tokenizes less efficiently because variable names, syntax symbols, and whitespace all consume tokens. Non-Latin scripts can use 2 to 4 times more tokens per word than English, because the BPE training data skewed heavily toward English text.

Consider what this means for a 200,000-token context window. In clean English, that is roughly 150,000 words, enough for a full novel. In Python code with long variable names and docstrings, maybe 80,000 to 100,000 words equivalent. In Japanese or Korean, considerably less. Same token budget, different effective capacity depending on what you put in it.

How Tokens Drive Your API Bill

API pricing is per-token, and providers price input and output tokens separately. Input tokens (your prompt, system instructions, and conversation history) are cheaper. Output tokens (what the model generates) cost more, because generation requires sequential computation while input processing can be parallelized.

Every design decision affects token consumption. A verbose system prompt costs tokens on every single API call. A long conversation history accumulates input tokens turn by turn. Asking the model to produce structured JSON output instead of natural language can change output token count significantly.

I learned this building production workflows. When token awareness shifts from academic to financial, you start paying attention to things like system prompt length and conversation history growth. A chatbot serving 10,000 daily conversations with 2,000 tokens per exchange is processing 20 million tokens per day. At that scale, a 10% reduction in prompt size from tighter system instructions saves real money.

Count Before You Send

Anthropic provides a dedicated token counting endpoint that accepts the same structured input as the Messages API, including system prompts, tools, images, and PDFs. It returns the exact input token count before you make the actual request.

import anthropic

client = anthropic.Anthropic()
response = client.messages.count_tokens(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Hello, Claude"}],
)
print(response)  # {"input_tokens": 12}

Token counting is free to use and rate-limited separately from message creation, so calling it does not eat into your messaging quota. For applications where you need to stay within context limits or estimate costs before committing to a generation, this endpoint removes the guesswork.

I use it for routing decisions in my own agent workflows. If a user’s message plus conversation history exceeds a certain token threshold, I can switch to a model with a larger context window, summarize older messages, or truncate. Knowing the exact count before sending lets you make that decision programmatically rather than catching a context-overflow error after the fact.

What Comes Next

Tokens are the atomic unit. Context windows, generation costs, sampling parameters: everything downstream measures in tokens. If you have been thinking in words, you have been estimating with the wrong ruler.

Next up: how LLMs actually generate text. Once you know what tokens are, the next question is how the model picks which one comes next, and what temperature, top-p, and top-k really control.