Papers

Short, opinionated notes on the papers that shape how we build AI systems. Each page has a TLDR, the key contributions, why it still matters, and where to read next.

Foundational ML & Deep Learning 4

Transformers & Language Models 5

Attention Is All You Need 2017
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin
NeurIPS

Replaces recurrence and convolutions with self-attention. Introduces the Transformer architecture that powers every modern LLM.

#transformers#attention#foundational read →
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 2018
Devlin, Chang, Lee, Toutanova
NAACL 2019

Pretrain an encoder-only transformer with masked language modeling and next-sentence prediction, then fine-tune for downstream tasks. Set state of the art on 11 NLP benchmarks.

#transformers#pretraining#nlp read →
Training Compute-Optimal Large Language Models 2022
Hoffmann, Borgeaud, Mensch, et al.
NeurIPS 2022

For a fixed training compute budget, you should scale model size and training tokens roughly equally. GPT-3 and PaLM were massively undertrained; a 70B model trained on 1.4T tokens beats a 280B one.

#scaling-laws#llms#pretraining read →
Language Models are Few-Shot Learners 2020
Brown, Mann, Ryder, Subbiah, et al.
NeurIPS 2020

Scales the GPT decoder to 175B parameters and shows that a single model, with no gradient updates, can do many tasks from a handful of in-context examples.

#llms#scaling#few-shot read →
LLaMA: Open and Efficient Foundation Language Models 2023
Touvron, Lavril, Izacard, et al.
arXiv 2023

A family of 7B–65B decoder transformers trained on public data. The 13B model matches GPT-3 and the weights (eventually) leaked, kicking off the modern open-weights era.

#llms#open-weights#efficiency read →

RAG & Retrieval 3

Prompting & Reasoning 3

Fine-Tuning & Alignment 4

Computer Vision 3

Scaling & Efficiency 1