Papers
Short, opinionated notes on the papers that shape how we build AI systems. Each page has a TLDR, the key contributions, why it still matters, and where to read next.
Foundational ML & Deep Learning 4
An adaptive optimizer that tracks first and second moments of the gradient per parameter. Became the default optimizer for deep learning almost overnight.
The paper that kicked off the deep learning era. A large CNN trained on GPUs crushed ImageNet 2012, halving the previous error rate.
Normalize each layer's activations over the mini-batch, then rescale with learned parameters. Dramatically speeds up training and reduces sensitivity to initialization.
Adds identity "skip connections" across layers so gradients flow through very deep networks. Made training 150+ layer CNNs practical and won ImageNet 2015.
Transformers & Language Models 5
Replaces recurrence and convolutions with self-attention. Introduces the Transformer architecture that powers every modern LLM.
Pretrain an encoder-only transformer with masked language modeling and next-sentence prediction, then fine-tune for downstream tasks. Set state of the art on 11 NLP benchmarks.
For a fixed training compute budget, you should scale model size and training tokens roughly equally. GPT-3 and PaLM were massively undertrained; a 70B model trained on 1.4T tokens beats a 280B one.
Scales the GPT decoder to 175B parameters and shows that a single model, with no gradient updates, can do many tasks from a handful of in-context examples.
A family of 7B–65B decoder transformers trained on public data. The 13B model matches GPT-3 and the weights (eventually) leaked, kicking off the modern open-weights era.
RAG & Retrieval 3
Keep per-token embeddings for both query and document and score via a sum of max-similarities. Much more expressive than a single-vector dual encoder while still fast enough to index.
Train a BERT-based dual encoder with in-batch negatives to embed queries and passages into the same space. Beats BM25 for open-domain QA retrieval.
Introduces the term "RAG" and a model that retrieves passages from a Wikipedia index and conditions a seq2seq generator on them. The blueprint for every RAG system that came after.
Prompting & Reasoning 3
Asking a large model to "think step by step" — or showing a few examples that include reasoning traces — dramatically improves accuracy on math and multi-step problems.
Interleave reasoning traces ("Thought") with tool calls ("Action") and their results ("Observation") in a single prompt loop. The template every agent framework still uses.
Search over a tree of partial reasoning paths with lookahead and backtracking, using the LLM to both generate branches and evaluate them. Big wins on puzzles that trip up linear CoT.
Fine-Tuning & Alignment 4
Replace most human harmlessness labels with model-generated self-critiques guided by a written set of principles ("constitution"). Scales alignment data and gives more transparent rules.
Derive a closed-form loss that optimizes a policy against preference data without training a separate reward model or running PPO. Much simpler than RLHF, competitive quality.
The InstructGPT recipe — SFT on demonstrations, reward model on human preferences, PPO fine-tuning — turns a raw GPT into a model that follows instructions and refuses harmful requests.
Fine-tune giant models by only training small low-rank update matrices instead of all the original weights. Cuts trainable parameters by 10,000x with no quality loss.
Computer Vision 3
Train an image encoder and a text encoder to embed matching (image, caption) pairs nearby in a shared space. Gives zero-shot image classification across thousands of categories.
Run the diffusion denoising process in the compressed latent space of a pretrained autoencoder instead of in pixel space. Cuts compute by an order of magnitude and enables text-to-image at home.
Slice an image into 16x16 patches, treat them as tokens, run a plain transformer. With enough data, it matches or beats CNNs on ImageNet and other benchmarks.