NumPy Essentials for ML

NumPy is the substrate every ML framework is built on. Vectorized operations, broadcasting, axis semantics — the stuff that makes the difference between a fast and a slow model.

Python beginner #python #numpy #arrays #vectorization

Prereqs: Python basics

Why NumPy matters

PyTorch tensors, TensorFlow tensors, JAX arrays — they all inherit NumPy’s array semantics. Master NumPy and 80% of framework APIs feel familiar.

More importantly: a vectorized NumPy operation is 100-1000× faster than the equivalent Python for-loop. Getting this reflex right early is the single biggest performance skill in data-heavy ML work.

The things that matter

1. Vectorization (not loops)

# Bad: ~100ms for a million elements
result = [x ** 2 for x in data]

# Good: ~1ms
result = np.array(data) ** 2

Every time you write a for-loop over array elements, ask yourself: can I do this with + - * / ** @ and some axis arguments?

2. Broadcasting

NumPy auto-aligns shapes when they’re compatible. A (3, 1) and a (1, 4) broadcast to (3, 4). This is how you vectorize operations between rows and columns without explicit loops.

mean = data.mean(axis=0)         # shape (features,)
centered = data - mean           # shape (n, features) - (features,) → (n, features)

When broadcasting fails, the error message is your friend. Read the shapes.

3. Axis semantics

axis=0 is “across rows” (producing one value per column). axis=1 is “across columns” (one value per row). Get this backward once and every pipeline bug after will trace back to it.

matrix.sum(axis=0)   # column sums, shape (n_cols,)
matrix.sum(axis=1)   # row sums, shape (n_rows,)

4. Indexing and slicing

Basic slicing: a[1:4], a[:, 2], a[::2].
Boolean masking: a[a > 0].
Fancy indexing: a[[0, 3, 5]].
reshape, expand_dims, squeeze, transpose — the shape-wrangling quartet.

5. Common operations

np.dot(a, b)        # or a @ b — matrix multiply
np.linalg.norm(v)   # Euclidean norm
np.argmax(x, axis=-1)
np.where(mask, a, b)
np.concatenate, np.stack, np.split

The gotcha nobody warns you about

np.array([1, 2, 3]).shape is (3,), NOT (3, 1). That trailing comma matters. A “1D row vector” and a “1×3 matrix” are different shapes and will broadcast differently. When in doubt, .reshape(-1, 1) to force 2D.

What to skip

np.matrix — deprecated, use np.ndarray. Most of scipy.ndimage unless you’re doing image work without a DL framework.