Hyperparameter
A configuration value you set before training that controls how the model learns — distinct from the parameters the model learns itself.
In one line
A configuration value you set before training that controls how the model learns — distinct from the parameters the model learns itself.
What it actually means
Weights and biases are parameters: the model learns them from data by gradient descent. Learning rate, batch size, number of layers, hidden dimension, dropout rate, regularization strength, optimizer choice, and so on are hyperparameters: you choose them before training starts. Picking them well is its own optimization problem, usually solved with grid search, random search, Bayesian optimization (Optuna), or population-based training. The line gets blurry — temperature for inference is often called a hyperparameter even though it isn’t trained.
Why it matters
Hyperparameters often matter more than architecture. A well-tuned baseline can beat a fancy model with default settings. Knowing which knobs are high-leverage for your task — and which are red herrings — is what separates someone who reads a paper from someone who reproduces it.
Example
model:
hidden_dim: 768
num_layers: 12
dropout: 0.1
optim:
lr: 3.0e-4
weight_decay: 0.01
batch_size: 32
You’ll hear it when
- Setting up an Optuna or Ray Tune sweep.
- Reproducing a paper and the result depends on a learning-rate schedule.
- Discussing why “your model is tuned, mine isn’t” makes a comparison unfair.
- Defending compute budgets for hyperparameter search.
- Tuning generation settings (temperature, top-p) for an LLM.