Hyperparameter

Classical ML

A configuration value you set before training that controls how the model learns — distinct from the parameters the model learns itself.


In one line

A configuration value you set before training that controls how the model learns — distinct from the parameters the model learns itself.

What it actually means

Weights and biases are parameters: the model learns them from data by gradient descent. Learning rate, batch size, number of layers, hidden dimension, dropout rate, regularization strength, optimizer choice, and so on are hyperparameters: you choose them before training starts. Picking them well is its own optimization problem, usually solved with grid search, random search, Bayesian optimization (Optuna), or population-based training. The line gets blurry — temperature for inference is often called a hyperparameter even though it isn’t trained.

Why it matters

Hyperparameters often matter more than architecture. A well-tuned baseline can beat a fancy model with default settings. Knowing which knobs are high-leverage for your task — and which are red herrings — is what separates someone who reads a paper from someone who reproduces it.

Example

model:
  hidden_dim: 768
  num_layers: 12
  dropout: 0.1
optim:
  lr: 3.0e-4
  weight_decay: 0.01
  batch_size: 32

You’ll hear it when

  • Setting up an Optuna or Ray Tune sweep.
  • Reproducing a paper and the result depends on a learning-rate schedule.
  • Discussing why “your model is tuned, mine isn’t” makes a comparison unfair.
  • Defending compute budgets for hyperparameter search.
  • Tuning generation settings (temperature, top-p) for an LLM.

Related terms