Tokenizer

LLMs

The component that converts text to and from the integer token IDs an LLM actually consumes.


In one line

The component that converts text to and from the integer token IDs an LLM actually consumes.

What it actually means

A tokenizer is trained alongside (or before) a language model. The most common approach is byte-pair encoding (BPE): start with single bytes, count which adjacent pairs occur most often in your corpus, and merge them into a new token. Repeat until you hit your vocabulary size. The result is a deterministic mapping from any string to a sequence of integers and back. Different model families ship different tokenizers — GPT-4o, Llama 3, and Claude all tokenize the same sentence to different lengths.

Why it matters

The tokenizer is the boundary between human text and the model. It controls how much information fits in your context window, how much you pay per request, and how the model “sees” code, JSON, math, and non-English scripts. A bad tokenizer for your domain (e.g. one that fragments URLs into 20 tokens) will quietly tank both quality and cost.

Example

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
ids = enc.encode("Tokenizers matter more than you think.")
print(len(ids), ids)

You’ll hear it when

  • Switching model providers and your token counts change.
  • Building a chunker — you almost always want to chunk by tokens, not characters.
  • Adding special tokens for chat formatting or tool calls.
  • Optimising prompts for cost.
  • Investigating why a non-English language eats 3x more tokens than English.

Related terms