Context Window

context length
LLMs

The maximum number of tokens an LLM can consider in a single forward pass — prompt plus generated output.


In one line

The maximum number of tokens an LLM can consider in a single forward pass — prompt plus generated output combined.

What it actually means

Every model is trained with a fixed positional encoding range, which sets a hard ceiling on how many tokens it can attend to at once. That ceiling — 8k, 128k, 1M, depending on the model — is the context window. It includes the system prompt, the conversation history, any retrieved documents, the user message, and the tokens the model is about to generate. Go over and the API errors out, or the oldest tokens get silently truncated.

Why it matters

Context window size dictates whether you can stuff a whole document in versus needing RAG, how long an agent can run before forgetting earlier steps, and what you can fit in a prompt cache. Bigger isn’t always better: attention cost scales quadratically, latency grows, and most models suffer from “lost in the middle” — they recall the start and end of long contexts much better than the middle.

Example

system prompt          : 400 tokens
retrieved chunks       : 6 200 tokens
chat history           : 1 800 tokens
user message           : 250 tokens
budget for generation  : ?
─────────────────────────────────────
8 650 used → on a 128 000 window, plenty of room.

You’ll hear it when

  • Choosing between RAG and “just put it all in the prompt”.
  • Designing an agent’s memory strategy.
  • Diagnosing why a long conversation suddenly forgets early instructions.
  • Comparing model providers on cost and latency for long inputs.
  • Setting up prompt caching.

Related terms