Inference
Running a trained model to produce predictions on new inputs — the serving side of ML, as opposed to training.
In one line
Running the forward pass of a trained model to get predictions — what “using the model” actually means in production.
What it actually means
Training is where weights change; inference is where weights are frozen and you just evaluate the model on incoming inputs. The concerns are totally different. For training you care about throughput, gradient stability, and data loading. For inference you care about latency, memory footprint, cost per request, and tail behavior. LLM inference in particular is dominated by two phases: prefill (processing the prompt, compute-bound) and decode (generating tokens one at a time, memory-bandwidth-bound). Techniques like KV caching, paged attention, speculative decoding, batching, and quantization all exist to make decode cheaper.
Why it matters
Model quality is half the job; the other half is serving it at a price and speed that makes the product work. Teams that only think about training ship models they can’t afford to run.
Example
# Minimal inference loop for an HF model
with torch.inference_mode():
outputs = model.generate(input_ids, max_new_tokens=128)
You’ll hear it when
- Sizing a GPU budget for a launch.
- Debating vLLM, TensorRT-LLM, or llama.cpp as a serving backend.
- Measuring tokens/second and time-to-first-token.
- Deciding when to quantize or distill for production.