Tokenization

LLMs don't see words. They see tokens — and getting this wrong silently wrecks your RAG, your fine-tuning, and your cost estimates.

NLP beginner #nlp #tokenization #bpe #llms

Prereqs: Basic Python

Why tokenization matters more than you think

When you send “Hello, world!” to GPT-4, it’s not processing 13 characters or 2 words. It’s processing tokens — usually ~4 of them. Every context window limit, every pricing row, every fine-tuning dataset size is measured in tokens, not characters or words.

If you don’t know how your text tokenizes, you don’t know how much it costs, how much of your context window it eats, or how your model sees it.

What a token is

A token is a sub-word chunk the tokenizer has decided is a useful unit. Common English words are usually one token. Rare words, proper nouns, and non-English scripts break into multiple tokens. Whitespace is usually part of the leading token.

Examples in the GPT-4 tokenizer (cl100k):

Text	Tokens	Count
`hello`	`hello`	1
`Hello world`	`Hello`, `world`	2
`antidisestablishmentarianism`	`anti`, `dis`, `est`, `abl`, `ish`, `ment`, `arian`, `ism`	8
`你好`	(3 bytes in UTF-8 split into multiple tokens)	3
`🎉`	(4-byte emoji)	3

The three algorithms you’ll encounter

1. BPE (Byte Pair Encoding)

Used by: GPT-2, GPT-3, GPT-4, Llama, Mistral, Claude.

Build a vocabulary by starting with individual characters, then repeatedly merging the most frequent adjacent pair. You end up with a vocab that captures common sub-words.

2. WordPiece

Used by: BERT and descendants.

Similar to BPE but the merge criterion is different. In practice, BERT-family tokenizers use ## prefixes for sub-word continuations (playing → play, ##ing).

3. SentencePiece / Unigram

Used by: T5, LLaMA (kind of), many multilingual models.

Treats the input as a raw byte stream with no whitespace pre-tokenization. Handles non-English scripts more cleanly.

The practical rules

Rule 1: Count tokens, not characters

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
len(enc.encode("Hello world"))  # 2

Rule 2: Different models, different tokenizers

A 1000-token prompt in GPT-4’s tokenizer is NOT 1000 tokens in Llama 3’s tokenizer. When switching models, re-measure.

Rule 3: Non-English is expensive

Chinese, Japanese, Korean, Arabic, Hindi — all tokenize at roughly 2-4× the rate of English. If your users speak non-English, your costs and context window effectively shrink.

Rule 4: Numbers and code tokenize weirdly

A number like 12345 might be 1 token or 5, depending on the tokenizer. Code with unusual whitespace tokenizes inconsistently. Never assume — measure.

Rule 5: Tokenizer-aware chunking

When chunking documents for RAG, chunk by token count, not character count. Otherwise you’ll silently exceed embedding model limits.

def chunk_by_tokens(text, max_tokens=512, overlap=50):
    tokens = enc.encode(text)
    for i in range(0, len(tokens), max_tokens - overlap):
        yield enc.decode(tokens[i:i + max_tokens])

The gotchas that cost you

Special tokens. <|endoftext|>, <|im_start|>, [CLS], [SEP] — these are tokens too, and including them in user input can confuse the model.
Tokenizer drift. OpenAI has silently updated tokenizers. Pin the version you’re benchmarking against.
Vocabulary gaps. If you fine-tune a model and add new terms (e.g., product names), they may tokenize into 5-10 tokens each. Consider adding them as custom tokens if they appear frequently.
Cross-language contamination in fine-tuning. Mixing English and Chinese training data without understanding tokenization leads to weird failure modes.

Useful tools

tiktokenizer.vercel.app — paste text, see tokens.
tiktoken Python package — programmatic OpenAI tokenization.
transformers.AutoTokenizer — works for most open-weight models.