2023 arXiv 2023

LLaMA: Open and Efficient Foundation Language Models

Touvron, Lavril, Izacard, et al.

TL;DR
A family of 7B–65B decoder transformers trained on public data. The 13B model matches GPT-3 and the weights (eventually) leaked, kicking off the modern open-weights era.

What it says

Meta trains a family of decoder-only transformers (7B, 13B, 33B, 65B) on 1–1.4T tokens of publicly available text, following a Chinchilla-inspired data-to-params ratio. Architectural choices: RMSNorm, SwiGLU activations, rotary position embeddings (RoPE), and pre-normalization. They report that LLaMA-13B matches GPT-3 (175B) on most benchmarks while being much cheaper to serve, and LLaMA-65B is competitive with PaLM-540B.

Why it matters

LLaMA (and especially its follow-ups LLaMA-2 and LLaMA-3) is the reason open-weights LLMs exist at production quality. Every “open” model family — Mistral, Qwen, Yi, DeepSeek — picked up the same architectural defaults, and the huge ecosystem of fine-tunes, quantizations, and inference engines (llama.cpp, vLLM, Ollama) grew directly out of this release.

  • LLaMA 2 (Touvron et al, 2023) — chat-tuned, license allowed commercial use.
  • Mistral 7B (Jiang et al, 2023) — GQA and sliding window attention.
  • Chinchilla (Hoffmann et al, 2022) — the scaling-law paper that informed LLaMA’s data-heavy recipe.