Linear Algebra for ML

The bits of linear algebra you actually need to read ML papers and debug models. Vectors, matrices, dot products, eigenvalues — tied to embeddings, PCA, and attention.

Mathematics beginner #math #linear-algebra #embeddings #pca #attention

Prereqs: High school algebra, Basic Python / NumPy

Why you care

Almost every model you’ll touch — from logistic regression to GPT — is ultimately a pile of matrix multiplications followed by a non-linearity. If that sentence is opaque, the rest of this page is for you. If it’s obvious, skim anyway — the intuition for why cosine similarity works and why attention is a matmul is worth refreshing.

Vectors

A vector is an ordered list of numbers. In ML it’s usually a row of features, a word embedding, or a hidden state. Geometrically, a vector points from the origin to a point in n-dimensional space.

import numpy as np
v = np.array([0.2, -0.5, 1.3])
w = np.array([0.1, 0.4, 0.9])

Three operations you use every day:

Addition v + w — componentwise. Adds the arrows.
Scalar multiply 3 * v — scales the arrow.
Dot product v @ w — a single number summarizing how aligned they are. v @ w = sum(v_i * w_i).

The dot product and cosine similarity

The dot product has a beautiful second form:

v · w = |v| * |w| * cos(θ)

where θ is the angle between the two vectors. Divide both sides by the magnitudes and you get cosine similarity:

cos(θ) = (v · w) / (|v| * |w|)

This is why embeddings work. An embedding model maps text into vectors where similar meanings point in similar directions. Cosine similarity scores how close two directions are, independent of magnitude.

def cosine(a, b):
    return (a @ b) / (np.linalg.norm(a) * np.linalg.norm(b))

Every vector database in the world is, under the hood, doing cosine (or dot product, or L2) similarity between a query vector and a stored set.

Matrices as linear transformations

A matrix is a grid of numbers, but it’s better to think of it as a function that takes in a vector and returns another vector. Multiplying A @ v transforms v into a new point in space.

Rotations, reflections, and scalings are all matrices.
A linear layer in a neural network — y = Wx + b — is a matrix multiplication plus a shift.
Stacking layers composes transformations: h2 = W2 @ (W1 @ x + b1) + b2.

The shape rule: (m, k) @ (k, n) -> (m, n). The inner dimensions must match; the outer dimensions are the result.

W = np.random.randn(4, 3)  # 4 rows, 3 cols
x = np.random.randn(3)     # 3-vector
y = W @ x                  # 4-vector

Matrix multiplication as dot products

Here’s the intuition that unlocks attention. When you compute A @ B, each entry of the result is a dot product between a row of A and a column of B. So a matrix multiply is just “all the dot products at once, in parallel”.

That’s exactly what attention does. Given queries Q, keys K, and values V:

attention_scores = Q @ K.T   # every query · every key = similarities
weights = softmax(scores / sqrt(d_k))
output = weights @ V         # weighted sum of values

Q @ K.T is a matrix of pairwise dot products — each query checking how similar it is to each key. It’s cosine similarity without the normalization, run in parallel across every position. Attention isn’t magic; it’s a batched similarity search followed by a weighted average.

Eigenvalues (enough to not be scared)

If A @ v = λv for some scalar λ, then v is an eigenvector of A and λ is its eigenvalue. Intuitively: v is a direction that A only stretches, never rotates. Eigenvectors are the “natural axes” of the transformation.

Why you care:

PCA finds the eigenvectors of the covariance matrix. They are the directions of maximum variance in the data. Projecting onto the top-k eigenvectors gives you the best k-dimensional approximation.
Spectral methods, graph embeddings, and stability analysis of optimization all rely on eigenvalues.
Understanding which direction a gradient blows up or shrinks.

You don’t need to compute them by hand. np.linalg.eig(A) exists.

Where this shows up in models

Embeddings → vectors and cosine similarity.
Linear layers → matrix multiplies.
Attention → matmul of queries and keys, softmax, matmul with values.
Convolutions → structured, sparse matrix multiplies (really).
PCA and dimensionality reduction → eigendecomposition.
Gradient descent → vector field on a loss surface; eigenvalues of the Hessian tell you the curvature.

The minimum NumPy toolkit

import numpy as np

# Create
v = np.array([1.0, 2.0, 3.0])
A = np.array([[1, 2], [3, 4]])

# Shape and transpose
A.shape       # (2, 2)
A.T           # transpose

# Products
v @ v         # dot product (scalar)
A @ v         # matrix-vector (vector)
A @ A         # matrix-matrix

# Norms and decomps
np.linalg.norm(v)
np.linalg.inv(A)
np.linalg.eig(A)
np.linalg.svd(A)

If you can read those ten lines and picture what each one does, you can read almost any ML paper’s math section.