2018 NAACL 2019

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Chang, Lee, Toutanova

TL;DR
Pretrain an encoder-only transformer with masked language modeling and next-sentence prediction, then fine-tune for downstream tasks. Set state of the art on 11 NLP benchmarks.

What it says

BERT takes the encoder half of the original transformer and pretrains it on a huge text corpus with two objectives: masked language modeling (predict randomly hidden tokens from their bidirectional context) and next-sentence prediction. Because attention is bidirectional, each token sees the full left and right context — unlike GPT-style left-to-right models. Fine-tuning adds a small task head on top of the [CLS] token or per-token outputs. The result pushes SOTA across GLUE, SQuAD, and NER by large margins.

Why it matters

BERT turned “pretrain then fine-tune” into the dominant NLP workflow for half a decade. Encoder-only transformers are still the backbone of nearly all production embedding and classification systems: BGE, E5, sentence-transformers, and most reranker models are direct descendants. Even now, when someone says “just embed it” they mean a BERT-lineage model.

  • RoBERTa (Liu et al, 2019) — a more careful training recipe that beats BERT.
  • DistilBERT (Sanh et al, 2019) — compressed student model, 60% faster.
  • DeBERTa (He et al, 2020) — disentangled attention, stronger encoder.