Linear Algebra for ML
The bits of linear algebra you actually need to read ML papers and debug models. Vectors, matrices, dot products, eigenvalues — tied to embeddings, PCA, and attention.
Why you care
Almost every model you’ll touch — from logistic regression to GPT — is ultimately a pile of matrix multiplications followed by a non-linearity. If that sentence is opaque, the rest of this page is for you. If it’s obvious, skim anyway — the intuition for why cosine similarity works and why attention is a matmul is worth refreshing.
Vectors
A vector is an ordered list of numbers. In ML it’s usually a row of features, a word embedding, or a hidden state. Geometrically, a vector points from the origin to a point in n-dimensional space.
import numpy as np
v = np.array([0.2, -0.5, 1.3])
w = np.array([0.1, 0.4, 0.9])
Three operations you use every day:
- Addition
v + w— componentwise. Adds the arrows. - Scalar multiply
3 * v— scales the arrow. - Dot product
v @ w— a single number summarizing how aligned they are.v @ w = sum(v_i * w_i).
The dot product and cosine similarity
The dot product has a beautiful second form:
v · w = |v| * |w| * cos(θ)
where θ is the angle between the two vectors. Divide both sides by the
magnitudes and you get cosine similarity:
cos(θ) = (v · w) / (|v| * |w|)
This is why embeddings work. An embedding model maps text into vectors where similar meanings point in similar directions. Cosine similarity scores how close two directions are, independent of magnitude.
def cosine(a, b):
return (a @ b) / (np.linalg.norm(a) * np.linalg.norm(b))
Every vector database in the world is, under the hood, doing cosine (or dot product, or L2) similarity between a query vector and a stored set.
Matrices as linear transformations
A matrix is a grid of numbers, but it’s better to think of it as a
function that takes in a vector and returns another vector. Multiplying
A @ v transforms v into a new point in space.
- Rotations, reflections, and scalings are all matrices.
- A linear layer in a neural network —
y = Wx + b— is a matrix multiplication plus a shift. - Stacking layers composes transformations:
h2 = W2 @ (W1 @ x + b1) + b2.
The shape rule: (m, k) @ (k, n) -> (m, n). The inner dimensions must
match; the outer dimensions are the result.
W = np.random.randn(4, 3) # 4 rows, 3 cols
x = np.random.randn(3) # 3-vector
y = W @ x # 4-vector
Matrix multiplication as dot products
Here’s the intuition that unlocks attention. When you compute A @ B, each
entry of the result is a dot product between a row of A and a column
of B. So a matrix multiply is just “all the dot products at once, in
parallel”.
That’s exactly what attention does. Given queries Q, keys K, and
values V:
attention_scores = Q @ K.T # every query · every key = similarities
weights = softmax(scores / sqrt(d_k))
output = weights @ V # weighted sum of values
Q @ K.T is a matrix of pairwise dot products — each query checking how
similar it is to each key. It’s cosine similarity without the normalization,
run in parallel across every position. Attention isn’t magic; it’s a batched
similarity search followed by a weighted average.
Eigenvalues (enough to not be scared)
If A @ v = λv for some scalar λ, then v is an eigenvector of A
and λ is its eigenvalue. Intuitively: v is a direction that A
only stretches, never rotates. Eigenvectors are the “natural axes” of the
transformation.
Why you care:
- PCA finds the eigenvectors of the covariance matrix. They are the directions of maximum variance in the data. Projecting onto the top-k eigenvectors gives you the best k-dimensional approximation.
- Spectral methods, graph embeddings, and stability analysis of optimization all rely on eigenvalues.
- Understanding which direction a gradient blows up or shrinks.
You don’t need to compute them by hand. np.linalg.eig(A) exists.
Where this shows up in models
- Embeddings → vectors and cosine similarity.
- Linear layers → matrix multiplies.
- Attention → matmul of queries and keys, softmax, matmul with values.
- Convolutions → structured, sparse matrix multiplies (really).
- PCA and dimensionality reduction → eigendecomposition.
- Gradient descent → vector field on a loss surface; eigenvalues of the Hessian tell you the curvature.
The minimum NumPy toolkit
import numpy as np
# Create
v = np.array([1.0, 2.0, 3.0])
A = np.array([[1, 2], [3, 4]])
# Shape and transpose
A.shape # (2, 2)
A.T # transpose
# Products
v @ v # dot product (scalar)
A @ v # matrix-vector (vector)
A @ A # matrix-matrix
# Norms and decomps
np.linalg.norm(v)
np.linalg.inv(A)
np.linalg.eig(A)
np.linalg.svd(A)
If you can read those ten lines and picture what each one does, you can read almost any ML paper’s math section.